Skip to main content

Datasets and Formats

Your model is only as good as your data. Here’s how to format it correctly.

Supported File Formats

AITraining supports multiple data sources:
FormatHow It’s LoadedUse Case
JSONLpandas.read_json(lines=True)LLM training, conversations
CSVpandas.read_csv()Classification, tabular data
HF Dataset IDdatasets.load_dataset()Remote datasets from Hub
Local HF Datasetload_from_disk()Pre-processed datasets
Parquet files are supported indirectly through HuggingFace datasets that expose Parquet format.

Common Formats

CSV (Most Common)

Simple and universal. Works for classification, regression, and basic tasks.
text,label
"This product is amazing",positive
"Terrible experience",negative
"Average quality",neutral

JSON/JSONL

Better for complex data, conversations, and nested structures.
{"messages": [
  {"role": "user", "content": "What is Python?"},
  {"role": "assistant", "content": "Python is a programming language"}
]}

Folders for Images

Organize images by category:
dataset/
  cats/
    cat1.jpg
    cat2.jpg
  dogs/
    dog1.jpg
    dog2.jpg

Data Quality Basics

Balance Your Classes

Bad:
  • 1000 positive examples
  • 50 negative examples
Good:
  • 500 positive examples
  • 500 negative examples

Clean Your Data

Remove:
  • Duplicates
  • Empty values
  • Obvious errors
  • Inconsistent formatting

Size Guidelines

Task TypeMinimumGoodGreat
Text Classification1001,00010,000+
Image Classification2002,00020,000+
Language Generation505005,000+

Required Columns by Trainer

Different trainers require specific columns:
TrainerRequired ColumnsOptional
sft / defaulttext (or messages)-
dpoprompt, chosen, rejected-
orpoprompt, chosen, rejected-
rewardtext (chosen), rejected-
If required columns are missing, you’ll get a clear validation error listing the missing and available columns.

Special Formats

DPO/ORPO (Preference Data)

{
  "prompt": "Explain gravity",
  "chosen": "Gravity is a force that attracts objects...",
  "rejected": "gravity is thing that make stuff fall"
}

Token Classification

John    B-PERSON
Smith   I-PERSON
visited O
Paris   B-LOCATION

Conversation Format

Conversations expect lists of {role, content} objects:
{"messages": [
  {"role": "user", "content": "Hello"},
  {"role": "assistant", "content": "Hi there!"}
]}
Or ShareGPT format (auto-detected and converted):
{"conversations": [
  {"from": "human", "value": "Hello"},
  {"from": "assistant", "value": "Hi there!"}
]}

Tool Role Support

AITraining supports the tool role for function calling training data:
{"messages": [
  {"role": "user", "content": "What's 2+2?"},
  {"role": "assistant", "content": "Let me calculate that."},
  {"role": "tool", "content": "4"},
  {"role": "assistant", "content": "The answer is 4."}
]}
Automatic compatibility: For models that don’t support the tool role natively (like Gemma), AITraining automatically converts tool messages to user messages with a [Tool Result] prefix. Models with native tool support (Llama 3.1+, Qwen, etc.) use their native format.
Legacy format support: The older OpenAI function role (used before tool was introduced) is also supported and handled identically to tool role.

Tool Calls (Function Calling)

AITraining also supports the tool_calls field for training models to make function calls:
{"messages": [
  {"role": "user", "content": "What's the weather in Paris?"},
  {
    "role": "assistant",
    "content": "Let me check.",
    "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]
  },
  {"role": "tool", "content": "Sunny, 20C"},
  {"role": "assistant", "content": "It's sunny and 20C in Paris."}
]}
Smart format detection: AITraining detects if your model supports tool_calls natively:
  • Qwen, Llama 3.1+: Uses native <tool_call> format
  • Gemma, older models: Serializes tool calls as OpenAI-format JSON in content
At inference, parse the JSON from the assistant output to extract tool calls.

Tool Call Format Transformation

For models without native tool support, AITraining serializes tool calls as OpenAI-format JSON appended to the assistant content: Input (message with tool_calls field):
{
  "role": "assistant",
  "content": "Let me search for that.",
  "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "{\"query\": \"weather\"}"}}]
}
Output (serialized in content):
Let me search for that.
{"content": "Let me search for that.", "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "{\"query\": \"weather\"}"}}]}
The serialized format preserves the full OpenAI structure with id, type, and function fields. This matches the format described in system prompt instructions for better model learning.

Message Alternation Handling

Some models (Gemma, Mistral) require strict user/assistant alternation. AITraining automatically fixes common issues: Consecutive same-role messages are merged:
// Before (would fail on Gemma)
[
  {"role": "assistant", "content": "Hello!"},
  {"role": "assistant", "content": "How can I help?"}
]

// After (auto-fixed)
[
  {"role": "assistant", "content": "Hello!\nHow can I help?"}
]
Missing user before assistant gets a placeholder:
// Before (system → assistant, no user)
[
  {"role": "system", "content": "You are helpful"},
  {"role": "assistant", "content": "Hello!"}
]

// After (auto-fixed)
[
  {"role": "system", "content": "You are helpful"},
  {"role": "user", "content": "[Continued]"},
  {"role": "assistant", "content": "Hello!"}
]
These fixes only apply when the tokenizer rejects the original format. Models that accept flexible message ordering keep the original structure.

Automatic Dataset Conversion

AITraining can automatically detect and convert common dataset formats. No manual preprocessing needed.

Supported Formats

FormatDetectionExample Columns
AlpacaAutoinstruction, input, output
ShareGPTAutoconversations with from/value
MessagesAutomessages with role/content
Q&AAutoquestion/answer, query/response
User/AssistantAutouser, assistant
DPOAutoprompt, chosen, rejected
Plain TextAutotext
Column mapping is optional - use it to convert varied column names to the expected format.

Using Auto-Conversion

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path tatsu-lab/alpaca \
  --auto-convert-dataset \
  --chat-template gemma3 \
  --trainer sft

Chat Templates

Chat templates format your data into the model’s expected conversation structure.
OptionDescription
tokenizerUse the model’s built-in chat template (default for SFT/DPO/ORPO)
chatmlStandard ChatML format
zephyrZephyr/Mistral format
noneNo template (plain text)
Templates are auto-selected based on your trainer, or specify manually:
--chat-template tokenizer  # Use model's template (recommended)
--chat-template chatml     # Force ChatML
--chat-template none       # Disable for plain text
The unified renderer applies templates consistently. Legacy template paths are still supported for backwards compatibility.

Conversation Extension

Merge single-turn examples into multi-turn conversations:
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./qa_pairs.jsonl \
  --auto-convert-dataset \
  --conversation-extension 3 \
  --trainer sft

Processed Dataset Output

After processing, your dataset will have:
ColumnDescription
textFormatted training data with chat template applied
_original_messagesOriginal messages column (preserved for inspection)
_original_*Other original columns renamed with prefix
Original columns are renamed to _original_* to prevent other tools from auto-detecting and incorrectly using unprocessed data.

Saving Processed Data

Control where processed data is saved with --save-processed-data:
OptionBehavior
autoSave locally; also push to Hub if source was from Hub
localSave only to {project}/data_processed/
hubPush only to Hub as private dataset
bothSave locally and push to Hub
noneDon’t save processed data

Quick Tips

  1. Start small - Test with 100 examples before scaling up
  2. Validate early - Check your format works before collecting thousands of examples
  3. Keep it consistent - Same format throughout your dataset
  4. Document everything - Note any preprocessing or special rules
  5. Use auto-convert - Let AITraining detect and convert formats automatically

Next Steps