Choosing the Right Model
The model you choose dramatically affects training time, quality, and hardware requirements. This guide helps you make the right choice.Model Size vs Hardware
The golden rule: A model needs roughly 2x its parameter count in GB of memory for training. A 7B model needs ~14GB VRAM for full training, or ~8GB with LoRA.
Quick Reference
| Your Hardware | Max Model Size | Recommended Models |
|---|---|---|
| MacBook Air M1 (8GB) | 500M - 1B | google/gemma-3-270m |
| MacBook Pro M2 (16GB) | 1B - 3B | google/gemma-2-2b, Llama-3.2-1B |
| MacBook Pro M3 Max (36-64GB) | 7B - 13B | Llama-3.2-8B, Mistral-7B |
| RTX 3060/3070 (8-12GB) | 1B - 3B | gemma-2-2b, Llama-3.2-3B |
| RTX 3090/4090 (24GB) | 7B - 13B | Llama-3.2-8B, Mistral-7B |
| A100 (40-80GB) | 30B - 70B | Llama-3.1-70B with quantization |
Memory Estimation Formula
- Full training: 7B × 16 = ~112GB (needs multi-GPU)
- With LoRA: 7B × 2 + 2GB = ~16GB
- With LoRA + int4: 7B × 0.5 + 2GB = ~6GB
Base vs Instruction-Tuned Models
This is one of the most important decisions you’ll make.Base Models (Pretrained)
Examples:google/gemma-2-2b, meta-llama/Llama-3.2-1B
What they are: Trained on raw text to predict the next word. They know language but don’t know how to be helpful.
When to use:
- You have lots of training data (10k+ examples)
- You want full control over the model’s behavior
- You’re training for a specific format (not chat)
- You want to create your own instruction style
Instruction-Tuned Models (IT/Instruct)
Examples:google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct
What they are: Base models that have already been trained to follow instructions and be helpful.
When to use:
- You have limited training data (100-5k examples)
- You want to refine existing helpful behavior
- You’re building a chatbot or assistant
- You want faster results with less data
Decision Matrix
| Situation | Use Base | Use Instruction-Tuned |
|---|---|---|
| Less than 1k examples | ✓ | |
| 1k - 10k examples | Depends | ✓ |
| 10k+ examples | ✓ | |
| Chat/assistant use case | ✓ | |
| Custom format (not chat) | ✓ | |
| Domain-specific (medical, legal) | ✓ | ✓ (either works) |
| Code generation | ✓ | |
| Creative writing | ✓ | ✓ (either works) |
Model Families
Google Gemma
Versions: Gemma 2, Gemma 3| Model | Size | Best For |
|---|---|---|
google/gemma-3-270m | 270M | Testing, learning, CPU/Apple Silicon |
google/gemma-2-2b | 2B | Consumer GPUs, good quality/speed balance |
google/gemma-2-9b | 9B | High quality on good hardware |
google/gemma-2-27b | 27B | Best Gemma quality, needs serious hardware |
-it suffix for instruction-tuned versions
Meta Llama
Versions: Llama 3.1, Llama 3.2| Model | Size | Best For |
|---|---|---|
meta-llama/Llama-3.2-1B | 1B | Mobile, edge devices |
meta-llama/Llama-3.2-3B | 3B | Consumer hardware |
meta-llama/Llama-3.1-8B | 8B | General purpose, excellent quality |
meta-llama/Llama-3.1-70B | 70B | Production quality, needs cloud GPU |
Mistral
| Model | Size | Best For |
|---|---|---|
mistralai/Mistral-7B-v0.3 | 7B | Great quality/efficiency ratio |
mistralai/Mixtral-8x7B | 8x7B | MoE architecture, fast inference |
Qwen (Alibaba)
| Model | Size | Best For |
|---|---|---|
Qwen/Qwen2.5-0.5B | 500M | Ultra-small, edge devices |
Qwen/Qwen2.5-3B | 3B | Balanced for consumer hardware |
Qwen/Qwen2.5-7B | 7B | Excellent multilingual, especially Chinese |
Searching for Models
In the wizard, use these commands:Sorting Options
| Option | When to Use |
|---|---|
| Trending | See what’s popular right now |
| Downloads | Most proven/used models |
| Likes | Community favorites |
| Recent | Newest releases |
Tips for Choosing
Start small, scale up
Start small, scale up
Always start with a smaller model like
gemma-3-270m. Get your pipeline working, verify your dataset is formatted correctly, then scale up to larger models.Don't chase the biggest model
Don't chase the biggest model
A well-trained 3B model often beats a poorly-trained 7B model. Focus on data quality first, then scale the model.
Match model to data
Match model to data
If you only have 500 examples, a 270M-1B model is plenty. Using a 7B model will just memorize your data instead of learning patterns.
Consider inference costs
Consider inference costs
If you’re deploying the model, remember: larger models cost more to run. A 1B model is 7x cheaper to serve than a 7B model.
Try instruction-tuned first
Try instruction-tuned first
Unless you have 10k+ high-quality examples, start with an instruction-tuned model. You’ll get better results faster.