Skip to main content

Dataset Guide

Your dataset is the most important factor in training success. A small, high-quality dataset beats a massive, noisy one every time.

The Dataset Size Problem

This is critical: Small models + Large datasets = Overfitting
Model SizeRecommended Dataset SizeMax Dataset Size
270M - 500M1,000 - 5,00010,000
1B - 3B5,000 - 20,00050,000
7B - 13B20,000 - 100,000500,000
30B+100,000+No practical limit

Why Does This Happen?

Think of it like this:
  • Small model = Small brain = Can only memorize so much
  • Large dataset = Lots of information
  • Result = Model just memorizes examples instead of learning patterns
Example: Training gemma-3-270m on the full Alpaca dataset (52k examples):
  • Model memorizes: “When asked about France’s capital, say Paris”
  • But doesn’t learn: “How to answer geography questions in general”

How to Fix It

Use --max-samples in the wizard:
Maximum samples (optional, for testing/debugging): 5000
Or in the CLI:
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path tatsu-lab/alpaca \
  --max-samples 5000 \
  ...

Dataset Formats

The wizard automatically detects your dataset format.

Alpaca Format (Most Common)

{
  "instruction": "Write a poem about the ocean",
  "input": "",
  "output": "The waves crash upon the shore..."
}
Columns: instruction, input (optional), output Good for: Instruction following, Q&A, task completion

ShareGPT / Conversation Format

{
  "conversations": [
    {"from": "human", "value": "Hello! How are you?"},
    {"from": "gpt", "value": "I'm doing well, thank you!"},
    {"from": "human", "value": "Can you help me with Python?"},
    {"from": "gpt", "value": "Of course! What do you need help with?"}
  ]
}
Good for: Chatbots, multi-turn conversations

Messages Format (OpenAI-style)

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "2+2 equals 4."}
  ]
}
Good for: API-style training, system prompts

Q&A Format

{
  "question": "What is the capital of France?",
  "answer": "The capital of France is Paris."
}
Columns: question/query/prompt + answer/response Good for: Simple question answering

DPO Format (Preference Training)

{
  "prompt": "Explain quantum physics",
  "chosen": "Quantum physics is a branch of science that studies...",
  "rejected": "idk its like small particles or something lol"
}
Required for: DPO, ORPO trainers

Plain Text

{
  "text": "This is a document about machine learning. It covers various topics..."
}
Good for: Continued pretraining, domain adaptation

Automatic Format Detection

The wizard analyzes your dataset and suggests conversion:
🔄 Dataset Format Analysis:
✓ Detected dataset format: alpaca
  • Your dataset is in alpaca format
  • This can be converted to the standard messages format for better compatibility

Do you want to analyze and convert your dataset to the model's chat format? (y/N): y

What Conversion Does

  1. Normalizes your data to a standard format
  2. Applies the correct chat template for your model
  3. Handles special tokens properly
Example: Alpaca → Messages for Gemma Before:
{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}
After:
<start_of_turn>user
Translate to French

Hello<end_of_turn>
<start_of_turn>model
Bonjour<end_of_turn>

Using Local Data

CSV Files

Create a CSV with your examples:
instruction,input,output
"Write a poem about cats","","Soft paws, gentle eyes..."
"Translate to Spanish","Hello","Hola"
"Summarize this","Long article text here","Brief summary"
Then in the wizard:
Dataset (number, HF ID, or command): ./my_data/training.csv

JSON/JSONL Files

Create a .jsonl file (one JSON object per line):
{"instruction": "Write a poem", "output": "..."}
{"instruction": "Translate", "input": "Hello", "output": "Hola"}

Folder Structure

Put all your files in a folder:
my_data/
  train.jsonl
  validation.jsonl  (optional)
Then:
Dataset (number, HF ID, or command): ./my_data

Dataset Quality Tips

500 high-quality examples beat 50,000 mediocre ones. Each example should be:
  • Accurate and correct
  • Well-formatted
  • Representative of what you want the model to do
Include varied examples:
  • Different topics
  • Different lengths
  • Different styles
  • Edge cases
If you want a customer support bot, train on customer support conversations. If you want a code assistant, train on code examples. Don’t train on general data and expect specific skills.
Remove:
  • Duplicates
  • Broken examples
  • Inconsistent formatting
  • Low-quality responses
If you have categories, try to have similar numbers of each. 1000 examples of category A + 50 examples of category B = model ignores B.

For Learning/Testing

DatasetSizeFormatBest For
tatsu-lab/alpaca52kAlpacaGeneral instruction following
databricks/databricks-dolly-15k15kAlpacaBusiness/professional tasks
OpenAssistant/oasst110k+ConversationHelpful assistant behavior

For Specific Tasks

DatasetSizeFormatBest For
sahil2801/CodeAlpaca-20k20kAlpacaCode generation
WizardLM/WizardLM_evol_instruct_70k70kAlpacaComplex reasoning
timdettmers/openassistant-guanaco9kConversationHelpful chat

For Preference Training (DPO/ORPO)

DatasetSizeFormatBest For
Anthropic/hh-rlhf170kDPOHelpful and harmless
argilla/ultrafeedback-binarized-preferences60kDPOGeneral preferences

Train/Validation Splits

What They Are

  • Train split: Data the model learns from
  • Validation split: Data to check if the model is learning (not memorizing)

When to Use Validation

Use a validation split if:
  • You have 1,000+ examples
  • You want to detect overfitting
  • You’re experimenting with hyperparameters
Skip validation if:
  • You have < 500 examples (every example matters)
  • You’re doing a quick test run
  • You’ll evaluate separately after training

Setting Splits in the Wizard

✓ Dataset loaded. Splits found: train, test, validation
✓ Using split: train (auto-selected from: train, test, validation)

Validation split name (optional) [validation]:

Limiting Dataset Size

For testing or to prevent overfitting:
Maximum samples (optional, for testing/debugging): 1000
This is especially useful when:
  1. First training run: Use 100-500 samples to verify everything works
  2. Small model: Limit to 1,000-5,000 for 270M-1B models
  3. Quick iteration: Test different settings with smaller data

Column Mapping

If your dataset has non-standard column names, the wizard asks:
📝 Column Mapping:

For instruction tuning (SFT):
• Should contain complete conversations or instruction-response pairs

Text column name [text]: my_instruction_column
✓ text_column: my_instruction_column

DPO/ORPO Required Columns

DPO/ORPO requires three columns:
  • Prompt column: the instruction/question
  • Chosen column: the preferred response
  • Rejected column: the non-preferred response

Prompt column name [REQUIRED] [prompt]: question
Chosen response column [REQUIRED] [chosen]: good_response
Rejected response column [REQUIRED] [rejected]: bad_response

Next Steps