Skip to main content

Evaluation Metrics

You can’t improve what you don’t measure. Here’s how to tell if your model is actually working.

Classification Metrics

Accuracy

The simplest metric - what percentage did you get right?
Accuracy = Correct Predictions / Total Predictions
Example: 90/100 correct = 90% accuracy Problem: Misleading with imbalanced data. If 95% of emails are not spam, a model that always says “not spam” gets 95% accuracy.

Precision & Recall

Precision: Of the ones you predicted positive, how many were actually positive? Recall: Of all the actual positives, how many did you find? Example for spam detection:
  • Precision: Of emails marked spam, how many were actually spam?
  • Recall: Of all spam emails, how many did you catch?

F1 Score

Combines precision and recall into one number.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Use when you care about both false positives and false negatives equally.

Generation Metrics

Perplexity

How surprised the model is by the test data. Lower is better.
  • Good model: Perplexity = 10-50
  • Bad model: Perplexity = 100+

BLEU Score

Compares generated text to reference text. Used for translation, summarization.
  • BLEU = 0: No overlap
  • BLEU = 1: Perfect match
  • BLEU > 0.3: Usually decent

Human Evaluation

Sometimes the best metric is asking humans:
  • Is this response helpful?
  • Does this summary capture the main points?
  • Is this translation natural?

Loss Curves

Training Loss vs Validation Loss

Watch both during training: Good pattern:
  • Both decrease
  • Stay close together
  • Plateau eventually
Overfitting:
  • Training loss keeps dropping
  • Validation loss increases
  • Gap widens
Underfitting:
  • Both stay high
  • Little improvement
  • Need more capacity or data

Task-Specific Metrics

Image Classification

  • Top-1 Accuracy: Correct class is the top prediction
  • Top-5 Accuracy: Correct class in top 5 predictions
  • Confusion Matrix: See which classes get confused

Object Detection

  • mAP (mean Average Precision): Overall detection quality
  • IoU (Intersection over Union): How well boxes overlap

NER/Token Classification

  • Entity-level F1: Complete entities correct
  • Token-level accuracy: Individual tokens correct

Quick Reference

TaskPrimary MetricGood Score
Binary ClassificationF1 Score> 0.8
Multi-class ClassificationAccuracy> 0.9
GenerationPerplexity< 50
TranslationBLEU> 0.3
SummarizationROUGE> 0.4
Q&AExact Match> 0.7

Enhanced Evaluation in AITraining

AITraining supports enhanced evaluation with multiple built-in and custom metrics.

Enable Enhanced Evaluation

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data.jsonl \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,bleu"

Available Metrics

MetricDescription
perplexityModel uncertainty (lower is better)
bleuN-gram overlap with reference
rougeRecall-oriented understudy for gisting evaluation
accuracyClassification accuracy
f1F1 score for classification

Python API

from autotrain.trainers.clm.params import LLMTrainingParams

params = LLMTrainingParams(
    model="google/gemma-3-270m",
    data_path="./data.jsonl",
    project_name="my-model",

    use_enhanced_eval=True,
    eval_metrics=["perplexity", "bleu"],
)

Custom Metrics

Register custom metrics for specialized evaluation:
from autotrain.metrics import register_metric

@register_metric("my_custom_metric")
def compute_custom_metric(predictions, references):
    # Your custom scoring logic
    score = ...
    return {"my_custom_metric": score}

# Then use it in training
params = LLMTrainingParams(
    ...
    use_enhanced_eval=True,
    eval_metrics=["perplexity", "my_custom_metric"],
)

Practical Tips

  1. Always use validation set - Never evaluate on training data
  2. Consider the task - Accuracy isn’t always best
  3. Watch trends - Improving is more important than absolute numbers
  4. Multiple metrics - No single metric tells the whole story

Red Flags

  • Training accuracy 100%, validation 60% → Overfitting
  • All metrics stuck → Learning rate might be wrong
  • Metrics jumping around → Batch size too small
  • Perfect scores immediately → Data leak or bug

Rethinking AI Evaluation

Traditional benchmarks may not capture true intelligence. Our research explores new approaches to evaluating AI reasoning.

The Child Benchmark: A New Way to Test AGI

Why we should evaluate AI like we evaluate children’s development

Next Steps