Evaluation Metrics
You can’t improve what you don’t measure. Here’s how to tell if your model is actually working.Classification Metrics
Accuracy
The simplest metric - what percentage did you get right?Precision & Recall
Precision: Of the ones you predicted positive, how many were actually positive? Recall: Of all the actual positives, how many did you find? Example for spam detection:- Precision: Of emails marked spam, how many were actually spam?
- Recall: Of all spam emails, how many did you catch?
F1 Score
Combines precision and recall into one number.Generation Metrics
Perplexity
How surprised the model is by the test data. Lower is better.- Good model: Perplexity = 10-50
- Bad model: Perplexity = 100+
BLEU Score
Compares generated text to reference text. Used for translation, summarization.- BLEU = 0: No overlap
- BLEU = 1: Perfect match
- BLEU > 0.3: Usually decent
Human Evaluation
Sometimes the best metric is asking humans:- Is this response helpful?
- Does this summary capture the main points?
- Is this translation natural?
Loss Curves
Training Loss vs Validation Loss
Watch both during training: Good pattern:- Both decrease
- Stay close together
- Plateau eventually
- Training loss keeps dropping
- Validation loss increases
- Gap widens
- Both stay high
- Little improvement
- Need more capacity or data
Task-Specific Metrics
Image Classification
- Top-1 Accuracy: Correct class is the top prediction
- Top-5 Accuracy: Correct class in top 5 predictions
- Confusion Matrix: See which classes get confused
Object Detection
- mAP (mean Average Precision): Overall detection quality
- IoU (Intersection over Union): How well boxes overlap
NER/Token Classification
- Entity-level F1: Complete entities correct
- Token-level accuracy: Individual tokens correct
Quick Reference
| Task | Primary Metric | Good Score |
|---|---|---|
| Binary Classification | F1 Score | > 0.8 |
| Multi-class Classification | Accuracy | > 0.9 |
| Generation | Perplexity | < 50 |
| Translation | BLEU | > 0.3 |
| Summarization | ROUGE | > 0.4 |
| Q&A | Exact Match | > 0.7 |
Enhanced Evaluation in AITraining
AITraining supports enhanced evaluation with multiple built-in and custom metrics.Enable Enhanced Evaluation
Available Metrics
| Metric | Description |
|---|---|
perplexity | Model uncertainty (lower is better) |
bleu | N-gram overlap with reference |
rouge | Recall-oriented understudy for gisting evaluation |
accuracy | Classification accuracy |
f1 | F1 score for classification |
Python API
Custom Metrics
Register custom metrics for specialized evaluation:Practical Tips
- Always use validation set - Never evaluate on training data
- Consider the task - Accuracy isn’t always best
- Watch trends - Improving is more important than absolute numbers
- Multiple metrics - No single metric tells the whole story
Red Flags
- Training accuracy 100%, validation 60% → Overfitting
- All metrics stuck → Learning rate might be wrong
- Metrics jumping around → Batch size too small
- Perfect scores immediately → Data leak or bug
Rethinking AI Evaluation
Traditional benchmarks may not capture true intelligence. Our research explores new approaches to evaluating AI reasoning.The Child Benchmark: A New Way to Test AGI
Why we should evaluate AI like we evaluate children’s development