Datasets and Formats
Your model is only as good as your data. Here’s how to format it correctly.Supported File Formats
AITraining supports multiple data sources:| Format | How It’s Loaded | Use Case |
|---|---|---|
| JSONL | pandas.read_json(lines=True) | LLM training, conversations |
| CSV | pandas.read_csv() | Classification, tabular data |
| HF Dataset ID | datasets.load_dataset() | Remote datasets from Hub |
| Local HF Dataset | load_from_disk() | Pre-processed datasets |
Parquet files are supported indirectly through HuggingFace datasets that expose Parquet format.
Common Formats
CSV (Most Common)
Simple and universal. Works for classification, regression, and basic tasks.JSON/JSONL
Better for complex data, conversations, and nested structures.Folders for Images
Organize images by category:Data Quality Basics
Balance Your Classes
Bad:- 1000 positive examples
- 50 negative examples
- 500 positive examples
- 500 negative examples
Clean Your Data
Remove:- Duplicates
- Empty values
- Obvious errors
- Inconsistent formatting
Size Guidelines
| Task Type | Minimum | Good | Great |
|---|---|---|---|
| Text Classification | 100 | 1,000 | 10,000+ |
| Image Classification | 200 | 2,000 | 20,000+ |
| Language Generation | 50 | 500 | 5,000+ |
Required Columns by Trainer
Different trainers require specific columns:| Trainer | Required Columns | Optional |
|---|---|---|
sft / default | text (or messages) | - |
dpo | prompt, chosen, rejected | - |
orpo | prompt, chosen, rejected | - |
reward | text (chosen), rejected | - |
Special Formats
DPO/ORPO (Preference Data)
Token Classification
Conversation Format
Conversations expect lists of{role, content} objects:
Tool Role Support
AITraining supports thetool role for function calling training data:
Automatic compatibility: For models that don’t support the
tool role natively (like Gemma), AITraining automatically converts tool messages to user messages with a [Tool Result] prefix. Models with native tool support (Llama 3.1+, Qwen, etc.) use their native format.Legacy format support: The older OpenAI
function role (used before tool was introduced) is also supported and handled identically to tool role.Tool Calls (Function Calling)
AITraining also supports thetool_calls field for training models to make function calls:
Smart format detection: AITraining detects if your model supports
tool_calls natively:- Qwen, Llama 3.1+: Uses native
<tool_call>format - Gemma, older models: Serializes tool calls as OpenAI-format JSON in content
Tool Call Format Transformation
For models without native tool support, AITraining serializes tool calls as OpenAI-format JSON appended to the assistant content: Input (message with tool_calls field):The serialized format preserves the full OpenAI structure with
id, type, and function fields. This matches the format described in system prompt instructions for better model learning.Message Alternation Handling
Some models (Gemma, Mistral) require strict user/assistant alternation. AITraining automatically fixes common issues: Consecutive same-role messages are merged:These fixes only apply when the tokenizer rejects the original format. Models that accept flexible message ordering keep the original structure.
Automatic Dataset Conversion
AITraining can automatically detect and convert common dataset formats. No manual preprocessing needed.Supported Formats
| Format | Detection | Example Columns |
|---|---|---|
| Alpaca | Auto | instruction, input, output |
| ShareGPT | Auto | conversations with from/value |
| Messages | Auto | messages with role/content |
| Q&A | Auto | question/answer, query/response |
| User/Assistant | Auto | user, assistant |
| DPO | Auto | prompt, chosen, rejected |
| Plain Text | Auto | text |
Using Auto-Conversion
Chat Templates
Chat templates format your data into the model’s expected conversation structure.| Option | Description |
|---|---|
tokenizer | Use the model’s built-in chat template (default for SFT/DPO/ORPO) |
chatml | Standard ChatML format |
zephyr | Zephyr/Mistral format |
none | No template (plain text) |
The unified renderer applies templates consistently. Legacy template paths are still supported for backwards compatibility.
Conversation Extension
Merge single-turn examples into multi-turn conversations:Processed Dataset Output
After processing, your dataset will have:| Column | Description |
|---|---|
text | Formatted training data with chat template applied |
_original_messages | Original messages column (preserved for inspection) |
_original_* | Other original columns renamed with prefix |
Original columns are renamed to
_original_* to prevent other tools from auto-detecting and incorrectly using unprocessed data.Saving Processed Data
Control where processed data is saved with--save-processed-data:
| Option | Behavior |
|---|---|
auto | Save locally; also push to Hub if source was from Hub |
local | Save only to {project}/data_processed/ |
hub | Push only to Hub as private dataset |
both | Save locally and push to Hub |
none | Don’t save processed data |
Quick Tips
- Start small - Test with 100 examples before scaling up
- Validate early - Check your format works before collecting thousands of examples
- Keep it consistent - Same format throughout your dataset
- Document everything - Note any preprocessing or special rules
- Use auto-convert - Let AITraining detect and convert formats automatically