Dataset Guide
Your dataset is the most important factor in training success. A small, high-quality dataset beats a massive, noisy one every time.The Dataset Size Problem
Why Does This Happen?
Think of it like this:- Small model = Small brain = Can only memorize so much
- Large dataset = Lots of information
- Result = Model just memorizes examples instead of learning patterns
gemma-3-270m on the full Alpaca dataset (52k examples):
- Model memorizes: “When asked about France’s capital, say Paris”
- But doesn’t learn: “How to answer geography questions in general”
How to Fix It
Use--max-samples in the wizard:
Dataset Formats
The wizard automatically detects your dataset format.Alpaca Format (Most Common)
instruction, input (optional), output
Good for: Instruction following, Q&A, task completion
ShareGPT / Conversation Format
Messages Format (OpenAI-style)
Q&A Format
question/query/prompt + answer/response
Good for: Simple question answering
DPO Format (Preference Training)
Plain Text
Automatic Format Detection
The wizard analyzes your dataset and suggests conversion:What Conversion Does
- Normalizes your data to a standard format
- Applies the correct chat template for your model
- Handles special tokens properly
Using Local Data
CSV Files
Create a CSV with your examples:JSON/JSONL Files
Create a.jsonl file (one JSON object per line):
Folder Structure
Put all your files in a folder:Dataset Quality Tips
Quality > Quantity
Quality > Quantity
500 high-quality examples beat 50,000 mediocre ones. Each example should be:
- Accurate and correct
- Well-formatted
- Representative of what you want the model to do
Diversity matters
Diversity matters
Include varied examples:
- Different topics
- Different lengths
- Different styles
- Edge cases
Match your use case
Match your use case
If you want a customer support bot, train on customer support conversations.
If you want a code assistant, train on code examples.
Don’t train on general data and expect specific skills.
Clean your data
Clean your data
Remove:
- Duplicates
- Broken examples
- Inconsistent formatting
- Low-quality responses
Balance your classes
Balance your classes
If you have categories, try to have similar numbers of each.
1000 examples of category A + 50 examples of category B = model ignores B.
Popular Datasets
For Learning/Testing
| Dataset | Size | Format | Best For |
|---|---|---|---|
tatsu-lab/alpaca | 52k | Alpaca | General instruction following |
databricks/databricks-dolly-15k | 15k | Alpaca | Business/professional tasks |
OpenAssistant/oasst1 | 10k+ | Conversation | Helpful assistant behavior |
For Specific Tasks
| Dataset | Size | Format | Best For |
|---|---|---|---|
sahil2801/CodeAlpaca-20k | 20k | Alpaca | Code generation |
WizardLM/WizardLM_evol_instruct_70k | 70k | Alpaca | Complex reasoning |
timdettmers/openassistant-guanaco | 9k | Conversation | Helpful chat |
For Preference Training (DPO/ORPO)
| Dataset | Size | Format | Best For |
|---|---|---|---|
Anthropic/hh-rlhf | 170k | DPO | Helpful and harmless |
argilla/ultrafeedback-binarized-preferences | 60k | DPO | General preferences |
Train/Validation Splits
What They Are
- Train split: Data the model learns from
- Validation split: Data to check if the model is learning (not memorizing)
When to Use Validation
Use a validation split if:- You have 1,000+ examples
- You want to detect overfitting
- You’re experimenting with hyperparameters
- You have < 500 examples (every example matters)
- You’re doing a quick test run
- You’ll evaluate separately after training
Setting Splits in the Wizard
Limiting Dataset Size
For testing or to prevent overfitting:- First training run: Use 100-500 samples to verify everything works
- Small model: Limit to 1,000-5,000 for 270M-1B models
- Quick iteration: Test different settings with smaller data