Datasets and Formats
Your model is only as good as your data. Here’s how to format it correctly.
AITraining supports multiple data sources:
| Format | How It’s Loaded | Use Case |
|---|
| JSONL | pandas.read_json(lines=True) | LLM training, conversations |
| CSV | pandas.read_csv() | Classification, tabular data |
| HF Dataset ID | datasets.load_dataset() | Remote datasets from Hub |
| Local HF Dataset | load_from_disk() | Pre-processed datasets |
Parquet files are supported indirectly through HuggingFace datasets that expose Parquet format.
CSV (Most Common)
Simple and universal. Works for classification, regression, and basic tasks.
text,label
"This product is amazing",positive
"Terrible experience",negative
"Average quality",neutral
JSON/JSONL
Better for complex data, conversations, and nested structures.
{"messages": [
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "Python is a programming language"}
]}
Folders for Images
Organize images by category:
dataset/
cats/
cat1.jpg
cat2.jpg
dogs/
dog1.jpg
dog2.jpg
Data Quality Basics
Balance Your Classes
Bad:
- 1000 positive examples
- 50 negative examples
Good:
- 500 positive examples
- 500 negative examples
Clean Your Data
Remove:
- Duplicates
- Empty values
- Obvious errors
- Inconsistent formatting
Size Guidelines
| Task Type | Minimum | Good | Great |
|---|
| Text Classification | 100 | 1,000 | 10,000+ |
| Image Classification | 200 | 2,000 | 20,000+ |
| Language Generation | 50 | 500 | 5,000+ |
Required Columns by Trainer
Different trainers require specific columns:
| Trainer | Required Columns | Optional |
|---|
sft / default | text (or messages) | - |
dpo | prompt, chosen, rejected | - |
orpo | prompt, chosen, rejected | - |
reward | text (chosen), rejected | - |
If required columns are missing, you’ll get a clear validation error listing the missing and available columns.
DPO/ORPO (Preference Data)
{
"prompt": "Explain gravity",
"chosen": "Gravity is a force that attracts objects...",
"rejected": "gravity is thing that make stuff fall"
}
Token Classification
John B-PERSON
Smith I-PERSON
visited O
Paris B-LOCATION
Conversations expect lists of {role, content} objects:
{"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
]}
Or ShareGPT format (auto-detected and converted):
{"conversations": [
{"from": "human", "value": "Hello"},
{"from": "assistant", "value": "Hi there!"}
]}
Automatic Dataset Conversion
AITraining can automatically detect and convert common dataset formats. No manual preprocessing needed.
| Format | Detection | Example Columns |
|---|
| Alpaca | Auto | instruction, input, output |
| ShareGPT | Auto | conversations with from/value |
| Messages | Auto | messages with role/content |
| Q&A | Auto | question/answer, query/response |
| User/Assistant | Auto | user, assistant |
| DPO | Auto | prompt, chosen, rejected |
| Plain Text | Auto | text |
Column mapping is optional - use it to convert varied column names to the expected format.
Using Auto-Conversion
aitraining llm --train \
--model google/gemma-3-270m \
--data-path tatsu-lab/alpaca \
--auto-convert-dataset \
--chat-template gemma3 \
--trainer sft
Chat Templates
Chat templates format your data into the model’s expected conversation structure.
| Option | Description |
|---|
tokenizer | Use the model’s built-in chat template (default for SFT/DPO/ORPO) |
chatml | Standard ChatML format |
zephyr | Zephyr/Mistral format |
none | No template (plain text) |
Templates are auto-selected based on your trainer, or specify manually:
--chat-template tokenizer # Use model's template (recommended)
--chat-template chatml # Force ChatML
--chat-template none # Disable for plain text
The unified renderer applies templates consistently. Legacy template paths are still supported for backwards compatibility.
Conversation Extension
Merge single-turn examples into multi-turn conversations:
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./qa_pairs.jsonl \
--auto-convert-dataset \
--conversation-extension 3 \
--trainer sft
Quick Tips
- Start small - Test with 100 examples before scaling up
- Validate early - Check your format works before collecting thousands of examples
- Keep it consistent - Same format throughout your dataset
- Document everything - Note any preprocessing or special rules
- Use auto-convert - Let AITraining detect and convert formats automatically
Next Steps