跳转到主要内容

基准测试

评估和比较模型性能。

快速评估

使用增强评估

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,accuracy"

仅评估(不训练)

python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained('./my-model')
tokenizer = AutoTokenizer.from_pretrained('./my-model')

# 在测试数据上计算困惑度
# ...
"

指标

可用指标

指标描述用例
perplexity语言建模质量LLMs
accuracy分类准确率分类
f1F1 分数分类
bleu翻译质量Seq2Seq
rouge摘要质量Seq2Seq

自定义评估

增强评估在训练期间在您的验证集上运行:
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --valid-split validation \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,accuracy"
增强评估使用 --valid-split 指定的验证数据。要在训练后在单独的测试集上评估,请使用下面显示的 LM Evaluation Harness 或自定义脚本。

推理速度

吞吐量测试

# benchmark_speed.py
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def benchmark(model_path, num_samples=100, max_tokens=50):
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    if torch.cuda.is_available():
        model = model.cuda()

    prompt = "The quick brown fox"
    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    # 预热
    for _ in range(5):
        model.generate(**inputs, max_new_tokens=max_tokens)

    # 基准测试
    start = time.time()
    for _ in range(num_samples):
        model.generate(**inputs, max_new_tokens=max_tokens)
    elapsed = time.time() - start

    tokens_per_second = (num_samples * max_tokens) / elapsed
    print(f"Throughput: {tokens_per_second:.2f} tokens/sec")
    print(f"Latency: {elapsed/num_samples*1000:.2f} ms/sample")

benchmark("./my-model")

内存使用

# benchmark_memory.py
import torch
from transformers import AutoModelForCausalLM

def measure_memory(model_path):
    torch.cuda.reset_peak_memory_stats()

    model = AutoModelForCausalLM.from_pretrained(model_path)
    model = model.cuda()

    peak_memory = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Peak memory: {peak_memory:.2f} GB")

measure_memory("./my-model")

模型比较

比较多个模型

# compare_models.py
import json
from pathlib import Path

def compare_models(model_paths):
    results = []

    for path in model_paths:
        state_file = Path(path) / "trainer_state.json"
        if state_file.exists():
            with open(state_file) as f:
                state = json.load(f)
            results.append({
                "model": path,
                "best_metric": state.get("best_metric"),
                "epoch": state.get("epoch"),
            })

    # 按 best_metric 排序(通常是 eval_loss)
    results.sort(key=lambda x: x.get("best_metric") or float("inf"))

    print("Model Comparison:")
    print("-" * 50)
    for r in results:
        metric = r.get('best_metric')
        metric_str = f"{metric:.4f}" if metric else "N/A"
        print(f"{r['model']}: best_metric={metric_str}")

compare_models([
    "./model-v1",
    "./model-v2",
    "./model-v3"
])

W&B 比较

记录到 W&B 时,在仪表板中比较运行:
# 训练多个变体
aitraining llm --train --model modelA --project-name exp-a --log wandb
aitraining llm --train --model modelB --project-name exp-b --log wandb

# 在 W&B 仪表板中比较

标准基准测试

LM Evaluation Harness

对于像 HellaSwag、ARC 和 MMLU 这样的标准基准测试,在训练后使用 LM Evaluation Harness:
pip install lm-eval

lm_eval --model hf \
  --model_args pretrained=./my-model \
  --tasks hellaswag,arc_easy,arc_challenge \
  --batch_size 8

常见基准测试任务

任务描述
hellaswag常识推理
arc_easy科学问题(简单)
arc_challenge科学问题(困难)
mmlu多任务语言理解
winogrande常识推理
truthfulqa真实性评估

报告

生成报告

# generate_report.py
import json
from datetime import datetime

def generate_report(model_path, metrics, benchmark_results):
    report = {
        "model": model_path,
        "date": datetime.now().isoformat(),
        "metrics": metrics,
        "benchmarks": benchmark_results,
    }

    with open("benchmark_report.json", 'w') as f:
        json.dump(report, f, indent=2)

    # 打印摘要
    print(f"\nBenchmark Report - {model_path}")
    print("=" * 50)
    print(f"Eval Loss: {metrics.get('eval_loss', 'N/A')}")
    print(f"Perplexity: {metrics.get('perplexity', 'N/A')}")
    for name, score in benchmark_results.items():
        print(f"{name}: {score}")

下一步