Skip to main content

Datasets and Formats

Your model is only as good as your data. Here’s how to format it correctly.

Supported File Formats

AITraining supports multiple data sources:
FormatHow It’s LoadedUse Case
JSONLpandas.read_json(lines=True)LLM training, conversations
CSVpandas.read_csv()Classification, tabular data
HF Dataset IDdatasets.load_dataset()Remote datasets from Hub
Local HF Datasetload_from_disk()Pre-processed datasets
Parquet files are supported indirectly through HuggingFace datasets that expose Parquet format.

Common Formats

CSV (Most Common)

Simple and universal. Works for classification, regression, and basic tasks.
text,label
"This product is amazing",positive
"Terrible experience",negative
"Average quality",neutral

JSON/JSONL

Better for complex data, conversations, and nested structures.
{"messages": [
  {"role": "user", "content": "What is Python?"},
  {"role": "assistant", "content": "Python is a programming language"}
]}

Folders for Images

Organize images by category:
dataset/
  cats/
    cat1.jpg
    cat2.jpg
  dogs/
    dog1.jpg
    dog2.jpg

Data Quality Basics

Balance Your Classes

Bad:
  • 1000 positive examples
  • 50 negative examples
Good:
  • 500 positive examples
  • 500 negative examples

Clean Your Data

Remove:
  • Duplicates
  • Empty values
  • Obvious errors
  • Inconsistent formatting

Size Guidelines

Task TypeMinimumGoodGreat
Text Classification1001,00010,000+
Image Classification2002,00020,000+
Language Generation505005,000+

Required Columns by Trainer

Different trainers require specific columns:
TrainerRequired ColumnsOptional
sft / defaulttext (or messages)-
dpoprompt, chosen, rejected-
orpoprompt, chosen, rejected-
rewardtext (chosen), rejected-
If required columns are missing, you’ll get a clear validation error listing the missing and available columns.

Special Formats

DPO/ORPO (Preference Data)

{
  "prompt": "Explain gravity",
  "chosen": "Gravity is a force that attracts objects...",
  "rejected": "gravity is thing that make stuff fall"
}

Token Classification

John    B-PERSON
Smith   I-PERSON
visited O
Paris   B-LOCATION

Conversation Format

Conversations expect lists of {role, content} objects:
{"messages": [
  {"role": "user", "content": "Hello"},
  {"role": "assistant", "content": "Hi there!"}
]}
Or ShareGPT format (auto-detected and converted):
{"conversations": [
  {"from": "human", "value": "Hello"},
  {"from": "assistant", "value": "Hi there!"}
]}

Automatic Dataset Conversion

AITraining can automatically detect and convert common dataset formats. No manual preprocessing needed.

Supported Formats

FormatDetectionExample Columns
AlpacaAutoinstruction, input, output
ShareGPTAutoconversations with from/value
MessagesAutomessages with role/content
Q&AAutoquestion/answer, query/response
User/AssistantAutouser, assistant
DPOAutoprompt, chosen, rejected
Plain TextAutotext
Column mapping is optional - use it to convert varied column names to the expected format.

Using Auto-Conversion

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path tatsu-lab/alpaca \
  --auto-convert-dataset \
  --chat-template gemma3 \
  --trainer sft

Chat Templates

Chat templates format your data into the model’s expected conversation structure.
OptionDescription
tokenizerUse the model’s built-in chat template (default for SFT/DPO/ORPO)
chatmlStandard ChatML format
zephyrZephyr/Mistral format
noneNo template (plain text)
Templates are auto-selected based on your trainer, or specify manually:
--chat-template tokenizer  # Use model's template (recommended)
--chat-template chatml     # Force ChatML
--chat-template none       # Disable for plain text
The unified renderer applies templates consistently. Legacy template paths are still supported for backwards compatibility.

Conversation Extension

Merge single-turn examples into multi-turn conversations:
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./qa_pairs.jsonl \
  --auto-convert-dataset \
  --conversation-extension 3 \
  --trainer sft

Quick Tips

  1. Start small - Test with 100 examples before scaling up
  2. Validate early - Check your format works before collecting thousands of examples
  3. Keep it consistent - Same format throughout your dataset
  4. Document everything - Note any preprocessing or special rules
  5. Use auto-convert - Let AITraining detect and convert formats automatically

Next Steps