Datasets and Formats

Your model is only as good as your data. Here’s how to format it correctly.

Supported File Formats

AITraining supports multiple data sources:

Format	How It’s Loaded	Use Case
JSONL	`pandas.read_json(lines=True)`	LLM training, conversations
CSV	`pandas.read_csv()`	Classification, tabular data
HF Dataset ID	`datasets.load_dataset()`	Remote datasets from Hub
Local HF Dataset	`load_from_disk()`	Pre-processed datasets

Parquet files are supported indirectly through HuggingFace datasets that expose Parquet format.

Common Formats

CSV (Most Common)

Simple and universal. Works for classification, regression, and basic tasks.

text,label
"This product is amazing",positive
"Terrible experience",negative
"Average quality",neutral

JSON/JSONL

Better for complex data, conversations, and nested structures.

{"messages": [
  {"role": "user", "content": "What is Python?"},
  {"role": "assistant", "content": "Python is a programming language"}
]}

Folders for Images

Organize images by category:

dataset/
  cats/
    cat1.jpg
    cat2.jpg
  dogs/
    dog1.jpg
    dog2.jpg

Data Quality Basics

Balance Your Classes

Bad:

1000 positive examples
50 negative examples

Good:

500 positive examples
500 negative examples

Clean Your Data

Remove:

Duplicates
Empty values
Obvious errors
Inconsistent formatting

Size Guidelines

Task Type	Minimum	Good	Great
Text Classification	100	1,000	10,000+
Image Classification	200	2,000	20,000+
Language Generation	50	500	5,000+

Required Columns by Trainer

Different trainers require specific columns:

Trainer	Required Columns	Optional
`sft` / `default`	`text` (or `messages`)	-
`dpo`	`prompt`, `chosen`, `rejected`	-
`orpo`	`prompt`, `chosen`, `rejected`	-
`reward`	`text` (chosen), `rejected`	-

If required columns are missing, you’ll get a clear validation error listing the missing and available columns.

Special Formats

DPO/ORPO (Preference Data)

{
  "prompt": "Explain gravity",
  "chosen": "Gravity is a force that attracts objects...",
  "rejected": "gravity is thing that make stuff fall"
}

Token Classification

John    B-PERSON
Smith   I-PERSON
visited O
Paris   B-LOCATION

Conversation Format

Conversations expect lists of {role, content} objects:

{"messages": [
  {"role": "user", "content": "Hello"},
  {"role": "assistant", "content": "Hi there!"}
]}

Or ShareGPT format (auto-detected and converted):

{"conversations": [
  {"from": "human", "value": "Hello"},
  {"from": "assistant", "value": "Hi there!"}
]}

Tool Role Support

AITraining supports the tool role for function calling training data:

{"messages": [
  {"role": "user", "content": "What's 2+2?"},
  {"role": "assistant", "content": "Let me calculate that."},
  {"role": "tool", "content": "4"},
  {"role": "assistant", "content": "The answer is 4."}
]}

Automatic compatibility: For models that don’t support the tool role natively (like Gemma), AITraining automatically converts tool messages to user messages with a [Tool Result] prefix. Models with native tool support (Llama 3.1+, Qwen, etc.) use their native format.

Legacy format support: The older OpenAI function role (used before tool was introduced) is also supported and handled identically to tool role.

Tool Calls (Function Calling)

AITraining also supports the tool_calls field for training models to make function calls:

{"messages": [
  {"role": "user", "content": "What's the weather in Paris?"},
  {
    "role": "assistant",
    "content": "Let me check.",
    "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]
  },
  {"role": "tool", "content": "Sunny, 20C"},
  {"role": "assistant", "content": "It's sunny and 20C in Paris."}
]}

Smart format detection: AITraining detects if your model supports tool_calls natively:

Qwen, Llama 3.1+: Uses native <tool_call> format
Gemma, older models: Serializes tool calls as OpenAI-format JSON in content

At inference, parse the JSON from the assistant output to extract tool calls.

Tool Call Format Transformation

For models without native tool support, AITraining serializes tool calls as OpenAI-format JSON appended to the assistant content: Input (message with tool_calls field):

{
  "role": "assistant",
  "content": "Let me search for that.",
  "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "{\"query\": \"weather\"}"}}]
}

Output (serialized in content):

Let me search for that.
{"content": "Let me search for that.", "tool_calls": [{"id": "call_123", "type": "function", "function": {"name": "search", "arguments": "{\"query\": \"weather\"}"}}]}

The serialized format preserves the full OpenAI structure with id, type, and function fields. This matches the format described in system prompt instructions for better model learning.

Message Alternation Handling

Some models (Gemma, Mistral) require strict user/assistant alternation. AITraining automatically fixes common issues: Consecutive same-role messages are merged:

// Before (would fail on Gemma)
[
  {"role": "assistant", "content": "Hello!"},
  {"role": "assistant", "content": "How can I help?"}
]

// After (auto-fixed)
[
  {"role": "assistant", "content": "Hello!\nHow can I help?"}
]

Missing user before assistant gets a placeholder:

// Before (system → assistant, no user)
[
  {"role": "system", "content": "You are helpful"},
  {"role": "assistant", "content": "Hello!"}
]

// After (auto-fixed)
[
  {"role": "system", "content": "You are helpful"},
  {"role": "user", "content": "[Continued]"},
  {"role": "assistant", "content": "Hello!"}
]

These fixes only apply when the tokenizer rejects the original format. Models that accept flexible message ordering keep the original structure.

Automatic Dataset Conversion

AITraining can automatically detect and convert common dataset formats. No manual preprocessing needed.

Supported Formats

Format	Detection	Example Columns
Alpaca	Auto	`instruction`, `input`, `output`
ShareGPT	Auto	`conversations` with `from`/`value`
Messages	Auto	`messages` with `role`/`content`
Q&A	Auto	`question`/`answer`, `query`/`response`
User/Assistant	Auto	`user`, `assistant`
DPO	Auto	`prompt`, `chosen`, `rejected`
Plain Text	Auto	`text`

Column mapping is optional - use it to convert varied column names to the expected format.

Using Auto-Conversion

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path tatsu-lab/alpaca \
  --auto-convert-dataset \
  --chat-template gemma3 \
  --trainer sft

Chat Templates

Chat templates format your data into the model’s expected conversation structure.

Option	Description
`tokenizer`	Use the model’s built-in chat template (default for SFT/DPO/ORPO)
`chatml`	Standard ChatML format
`zephyr`	Zephyr/Mistral format
`none`	No template (plain text)

Templates are auto-selected based on your trainer, or specify manually:

--chat-template tokenizer  # Use model's template (recommended)
--chat-template chatml     # Force ChatML
--chat-template none       # Disable for plain text

The unified renderer applies templates consistently. Legacy template paths are still supported for backwards compatibility.

Conversation Extension

Merge single-turn examples into multi-turn conversations:

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./qa_pairs.jsonl \
  --auto-convert-dataset \
  --conversation-extension 3 \
  --trainer sft

Processed Dataset Output

After processing, your dataset will have:

Column	Description
`text`	Formatted training data with chat template applied
`_original_messages`	Original messages column (preserved for inspection)
`_original_*`	Other original columns renamed with prefix

Original columns are renamed to _original_* to prevent other tools from auto-detecting and incorrectly using unprocessed data.

Saving Processed Data

Control where processed data is saved with --save-processed-data:

Option	Behavior
`auto`	Save locally; also push to Hub if source was from Hub
`local`	Save only to `{project}/data_processed/`
`hub`	Push only to Hub as private dataset
`both`	Save locally and push to Hub
`none`	Don’t save processed data

Quick Tips

Start small - Test with 100 examples before scaling up
Validate early - Check your format works before collecting thousands of examples
Keep it consistent - Same format throughout your dataset
Document everything - Note any preprocessing or special rules
Use auto-convert - Let AITraining detect and convert formats automatically

Getting Started

AI Training Fundamentals

Core Concepts

Interface Selection

Datasets and Formats

Datasets and Formats

Supported File Formats

Common Formats

CSV (Most Common)

JSON/JSONL

Folders for Images

Data Quality Basics

Balance Your Classes

Clean Your Data

Size Guidelines

Required Columns by Trainer

Special Formats

DPO/ORPO (Preference Data)

Token Classification

Conversation Format

Tool Role Support

Tool Calls (Function Calling)

Tool Call Format Transformation

Message Alternation Handling

Automatic Dataset Conversion

Supported Formats

Using Auto-Conversion

Chat Templates

Conversation Extension

Processed Dataset Output

Saving Processed Data

Quick Tips

Next Steps

Hyperparameters

Training Tasks

Getting Started

AI Training Fundamentals

Core Concepts

Interface Selection

​Datasets and Formats

​Supported File Formats

​Common Formats

​CSV (Most Common)

​JSON/JSONL

​Folders for Images

​Data Quality Basics

​Balance Your Classes

​Clean Your Data

​Size Guidelines

​Required Columns by Trainer

​Special Formats

​DPO/ORPO (Preference Data)

​Token Classification

​Conversation Format

​Tool Role Support

​Tool Calls (Function Calling)

​Tool Call Format Transformation

​Message Alternation Handling

​Automatic Dataset Conversion

​Supported Formats

​Using Auto-Conversion

​Chat Templates

​Conversation Extension

​Processed Dataset Output

​Saving Processed Data

​Quick Tips

​Next Steps

Hyperparameters

Training Tasks

Datasets and Formats

Supported File Formats

Common Formats

CSV (Most Common)

JSON/JSONL

Folders for Images

Data Quality Basics

Balance Your Classes

Clean Your Data

Size Guidelines

Required Columns by Trainer

Special Formats

DPO/ORPO (Preference Data)

Token Classification

Conversation Format

Tool Role Support

Tool Calls (Function Calling)

Tool Call Format Transformation

Message Alternation Handling

Automatic Dataset Conversion

Supported Formats

Using Auto-Conversion

Chat Templates

Conversation Extension

Processed Dataset Output

Saving Processed Data

Quick Tips

Next Steps