Skip to main content

Choosing the Right Model

The model you choose dramatically affects training time, quality, and hardware requirements. This guide helps you make the right choice.

Model Size vs Hardware

The golden rule: A model needs roughly 2x its parameter count in GB of memory for training. A 7B model needs ~14GB VRAM for full training, or ~8GB with LoRA.

Quick Reference

Your HardwareMax Model SizeRecommended Models
MacBook Air M1 (8GB)500M - 1Bgoogle/gemma-3-270m
MacBook Pro M2 (16GB)1B - 3Bgoogle/gemma-2-2b, Llama-3.2-1B
MacBook Pro M3 Max (36-64GB)7B - 13BLlama-3.2-8B, Mistral-7B
RTX 3060/3070 (8-12GB)1B - 3Bgemma-2-2b, Llama-3.2-3B
RTX 3090/4090 (24GB)7B - 13BLlama-3.2-8B, Mistral-7B
A100 (40-80GB)30B - 70BLlama-3.1-70B with quantization

Memory Estimation Formula

Full training:   params × 4 bytes × 4 (model + optimizer + gradients + activations)
With LoRA:       params × 2 bytes + ~2GB
With LoRA + int4: params × 0.5 bytes + ~2GB
Example: 7B model
  • Full training: 7B × 16 = ~112GB (needs multi-GPU)
  • With LoRA: 7B × 2 + 2GB = ~16GB
  • With LoRA + int4: 7B × 0.5 + 2GB = ~6GB

Base vs Instruction-Tuned Models

This is one of the most important decisions you’ll make.

Base Models (Pretrained)

Examples: google/gemma-2-2b, meta-llama/Llama-3.2-1B What they are: Trained on raw text to predict the next word. They know language but don’t know how to be helpful. When to use:
  • You have lots of training data (10k+ examples)
  • You want full control over the model’s behavior
  • You’re training for a specific format (not chat)
  • You want to create your own instruction style
Example behavior before training:
User: What is the capital of France?
Model: The question was first posed in 1789 when...

Instruction-Tuned Models (IT/Instruct)

Examples: google/gemma-2-2b-it, meta-llama/Llama-3.2-1B-Instruct What they are: Base models that have already been trained to follow instructions and be helpful. When to use:
  • You have limited training data (100-5k examples)
  • You want to refine existing helpful behavior
  • You’re building a chatbot or assistant
  • You want faster results with less data
Example behavior before training:
User: What is the capital of France?
Model: The capital of France is Paris.

Decision Matrix

SituationUse BaseUse Instruction-Tuned
Less than 1k examples
1k - 10k examplesDepends
10k+ examples
Chat/assistant use case
Custom format (not chat)
Domain-specific (medical, legal)✓ (either works)
Code generation
Creative writing✓ (either works)

Model Families

Google Gemma

Versions: Gemma 2, Gemma 3
ModelSizeBest For
google/gemma-3-270m270MTesting, learning, CPU/Apple Silicon
google/gemma-2-2b2BConsumer GPUs, good quality/speed balance
google/gemma-2-9b9BHigh quality on good hardware
google/gemma-2-27b27BBest Gemma quality, needs serious hardware
Strengths: Great for smaller sizes, efficient, good multilingual support Tip: Add -it suffix for instruction-tuned versions

Meta Llama

Versions: Llama 3.1, Llama 3.2
ModelSizeBest For
meta-llama/Llama-3.2-1B1BMobile, edge devices
meta-llama/Llama-3.2-3B3BConsumer hardware
meta-llama/Llama-3.1-8B8BGeneral purpose, excellent quality
meta-llama/Llama-3.1-70B70BProduction quality, needs cloud GPU
Strengths: Excellent quality, strong reasoning, great community support Note: Requires accepting license on HuggingFace first

Mistral

ModelSizeBest For
mistralai/Mistral-7B-v0.37BGreat quality/efficiency ratio
mistralai/Mixtral-8x7B8x7BMoE architecture, fast inference
Strengths: Efficient, fast inference, good at code Tip: Mistral often punches above its weight class

Qwen (Alibaba)

ModelSizeBest For
Qwen/Qwen2.5-0.5B500MUltra-small, edge devices
Qwen/Qwen2.5-3B3BBalanced for consumer hardware
Qwen/Qwen2.5-7B7BExcellent multilingual, especially Chinese
Strengths: Excellent multilingual, especially Asian languages

Searching for Models

In the wizard, use these commands:
# Search by name
/search llama

# Search by capability
/search code
/search multilingual

# Filter by size
/filter

# Sort options
/sort

Sorting Options

OptionWhen to Use
TrendingSee what’s popular right now
DownloadsMost proven/used models
LikesCommunity favorites
RecentNewest releases

Tips for Choosing

Always start with a smaller model like gemma-3-270m. Get your pipeline working, verify your dataset is formatted correctly, then scale up to larger models.
A well-trained 3B model often beats a poorly-trained 7B model. Focus on data quality first, then scale the model.
If you only have 500 examples, a 270M-1B model is plenty. Using a 7B model will just memorize your data instead of learning patterns.
If you’re deploying the model, remember: larger models cost more to run. A 1B model is 7x cheaper to serve than a 7B model.
Unless you have 10k+ high-quality examples, start with an instruction-tuned model. You’ll get better results faster.

Validating Your Choice

After selecting a model, the wizard validates it exists:
✓ Model: google/gemma-3-270m
If it doesn’t exist:
❌ Model 'google/gemma3-270m' not found on HuggingFace Hub.
  Suggestions: Did you mean 'google/gemma-3-270m'?
  Check the model ID at https://huggingface.co/models

Try again with a different model? [Y/n]:

Next Steps