Dataset Guide

Your dataset is the most important factor in training success. A small, high-quality dataset beats a massive, noisy one every time.

The Dataset Size Problem

This is critical: Small models + Large datasets = Overfitting

Model Size	Recommended Dataset Size	Max Dataset Size
270M - 500M	1,000 - 5,000	10,000
1B - 3B	5,000 - 20,000	50,000
7B - 13B	20,000 - 100,000	500,000
30B+	100,000+	No practical limit

Why Does This Happen?

Think of it like this:

Small model = Small brain = Can only memorize so much
Large dataset = Lots of information
Result = Model just memorizes examples instead of learning patterns

Example: Training gemma-3-270m on the full Alpaca dataset (52k examples):

Model memorizes: “When asked about France’s capital, say Paris”
But doesn’t learn: “How to answer geography questions in general”

How to Fix It

Use --max-samples in the wizard:

Maximum samples (optional, for testing/debugging): 5000

Or in the CLI:

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path tatsu-lab/alpaca \
  --max-samples 5000 \
  ...

Dataset Formats

The wizard automatically detects your dataset format.

Alpaca Format (Most Common)

{
  "instruction": "Write a poem about the ocean",
  "input": "",
  "output": "The waves crash upon the shore..."
}

Columns: instruction, input (optional), output Good for: Instruction following, Q&A, task completion

ShareGPT / Conversation Format

{
  "conversations": [
    {"from": "human", "value": "Hello! How are you?"},
    {"from": "gpt", "value": "I'm doing well, thank you!"},
    {"from": "human", "value": "Can you help me with Python?"},
    {"from": "gpt", "value": "Of course! What do you need help with?"}
  ]
}

Good for: Chatbots, multi-turn conversations

Messages Format (OpenAI-style)

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."}
  ]
}

Good for: API-style training, system prompts

Q&A Format

{
  "question": "What is the capital of France?",
  "answer": "The capital of France is Paris."
}

Columns: question/query/prompt + answer/response Good for: Simple question answering

DPO Format (Preference Training)

{
  "prompt": "Explain quantum physics",
  "chosen": "Quantum physics is a branch of science that studies...",
  "rejected": "idk its like small particles or something lol"
}

Required for: DPO, ORPO trainers

Plain Text

{
  "text": "This is a document about machine learning. It covers various topics..."
}

Good for: Continued pretraining, domain adaptation

Automatic Format Detection

The wizard analyzes your dataset and suggests conversion:

🔄 Dataset Format Analysis:
✓ Detected dataset format: alpaca
  • Your dataset is in alpaca format
  • This can be converted to the standard messages format for better compatibility

Do you want to analyze and convert your dataset to the model's chat format? (y/N): y

What Conversion Does

Normalizes your data to a standard format
Applies the correct chat template for your model
Handles special tokens properly

Example: Alpaca → Messages for Gemma Before:

{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}

After:

<start_of_turn>user
Translate to French

Hello<end_of_turn>
<start_of_turn>model
Bonjour<end_of_turn>

Using Local Data

CSV Files

Create a CSV with your examples:

instruction,input,output
"Write a poem about cats","","Soft paws, gentle eyes..."
"Translate to Spanish","Hello","Hola"
"Summarize this","Long article text here","Brief summary"

Then in the wizard:

Dataset (number, HF ID, or command): ./my_data/training.csv

JSON/JSONL Files

Create a .jsonl file (one JSON object per line):

{"instruction": "Write a poem", "output": "..."}
{"instruction": "Translate", "input": "Hello", "output": "Hola"}

Folder Structure

Put all your files in a folder:

my_data/
  train.jsonl
  validation.jsonl  (optional)

Then:

Dataset (number, HF ID, or command): ./my_data

Dataset Quality Tips

Quality > Quantity

500 high-quality examples beat 50,000 mediocre ones. Each example should be:

Accurate and correct
Well-formatted
Representative of what you want the model to do

Diversity matters

Include varied examples:

Different topics
Different lengths
Different styles
Edge cases

Match your use case

If you want a customer support bot, train on customer support conversations. If you want a code assistant, train on code examples. Don’t train on general data and expect specific skills.

Clean your data

Remove:

Duplicates
Broken examples
Inconsistent formatting
Low-quality responses

Balance your classes

If you have categories, try to have similar numbers of each. 1000 examples of category A + 50 examples of category B = model ignores B.

Popular Datasets

For Learning/Testing

Dataset	Size	Format	Best For
`tatsu-lab/alpaca`	52k	Alpaca	General instruction following
`databricks/databricks-dolly-15k`	15k	Alpaca	Business/professional tasks
`OpenAssistant/oasst1`	10k+	Conversation	Helpful assistant behavior

For Specific Tasks

Dataset	Size	Format	Best For
`sahil2801/CodeAlpaca-20k`	20k	Alpaca	Code generation
`WizardLM/WizardLM_evol_instruct_70k`	70k	Alpaca	Complex reasoning
`timdettmers/openassistant-guanaco`	9k	Conversation	Helpful chat

For Preference Training (DPO/ORPO)

Dataset	Size	Format	Best For
`Anthropic/hh-rlhf`	170k	DPO	Helpful and harmless
`argilla/ultrafeedback-binarized-preferences`	60k	DPO	General preferences

Train/Validation Splits

What They Are

Train split: Data the model learns from
Validation split: Data to check if the model is learning (not memorizing)

When to Use Validation

Use a validation split if:

You have 1,000+ examples
You want to detect overfitting
You’re experimenting with hyperparameters

Skip validation if:

You have < 500 examples (every example matters)
You’re doing a quick test run
You’ll evaluate separately after training

Setting Splits in the Wizard

✓ Dataset loaded. Splits found: train, test, validation
✓ Using split: train (auto-selected from: train, test, validation)

Validation split name (optional) [validation]:

Limiting Dataset Size

For testing or to prevent overfitting:

Maximum samples (optional, for testing/debugging): 1000

This is especially useful when:

First training run: Use 100-500 samples to verify everything works
Small model: Limit to 1,000-5,000 for 270M-1B models
Quick iteration: Test different settings with smaller data

Column Mapping

If your dataset has non-standard column names, the wizard asks:

📝 Column Mapping:

For instruction tuning (SFT):
• Should contain complete conversations or instruction-response pairs

Text column name [text]: my_instruction_column
✓ text_column: my_instruction_column

DPO/ORPO Required Columns

DPO/ORPO requires three columns:
  • Prompt column: the instruction/question
  • Chosen column: the preferred response
  • Rejected column: the non-preferred response

Prompt column name [REQUIRED] [prompt]: question
Chosen response column [REQUIRED] [chosen]: good_response
Rejected response column [REQUIRED] [rejected]: bad_response

Getting Started

Understanding Choices

Dataset Guide

Dataset Guide

The Dataset Size Problem

Why Does This Happen?

How to Fix It

Dataset Formats

Alpaca Format (Most Common)

ShareGPT / Conversation Format

Messages Format (OpenAI-style)

Q&A Format

DPO Format (Preference Training)

Plain Text

Automatic Format Detection

What Conversion Does

Using Local Data

CSV Files

JSON/JSONL Files

Folder Structure

Dataset Quality Tips

Popular Datasets

For Learning/Testing

For Specific Tasks

For Preference Training (DPO/ORPO)

Train/Validation Splits

What They Are

When to Use Validation

Setting Splits in the Wizard

Limiting Dataset Size

Column Mapping

DPO/ORPO Required Columns

Next Steps

SFT Walkthrough

Dataset Formats Reference

Getting Started

Understanding Choices

​Dataset Guide

​The Dataset Size Problem

​Why Does This Happen?

​How to Fix It

​Dataset Formats

​Alpaca Format (Most Common)

​ShareGPT / Conversation Format

​Messages Format (OpenAI-style)

​Q&A Format

​DPO Format (Preference Training)

​Plain Text

​Automatic Format Detection

​What Conversion Does

​Using Local Data

​CSV Files

​JSON/JSONL Files

​Folder Structure

​Dataset Quality Tips

​Popular Datasets

​For Learning/Testing

​For Specific Tasks

​For Preference Training (DPO/ORPO)

​Train/Validation Splits

​What They Are

​When to Use Validation

​Setting Splits in the Wizard

​Limiting Dataset Size

​Column Mapping

​DPO/ORPO Required Columns

​Next Steps

SFT Walkthrough

Dataset Formats Reference

Dataset Guide

The Dataset Size Problem

Why Does This Happen?

How to Fix It

Dataset Formats

Alpaca Format (Most Common)

ShareGPT / Conversation Format

Messages Format (OpenAI-style)

Q&A Format

DPO Format (Preference Training)

Plain Text

Automatic Format Detection

What Conversion Does

Using Local Data

CSV Files

JSON/JSONL Files

Folder Structure

Dataset Quality Tips

Popular Datasets

For Learning/Testing

For Specific Tasks

For Preference Training (DPO/ORPO)

Train/Validation Splits

What They Are

When to Use Validation

Setting Splits in the Wizard

Limiting Dataset Size

Column Mapping

DPO/ORPO Required Columns

Next Steps