基准测试

评估和比较模型性能。

快速评估

使用增强评估

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,accuracy"

仅评估（不训练）

python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained('./my-model')
tokenizer = AutoTokenizer.from_pretrained('./my-model')

# 在测试数据上计算困惑度
# ...
"

指标

可用指标

指标	描述	用例
`perplexity`	语言建模质量	LLMs
`accuracy`	分类准确率	分类
`f1`	F1 分数	分类
`bleu`	翻译质量	Seq2Seq
`rouge`	摘要质量	Seq2Seq

自定义评估

增强评估在训练期间在您的验证集上运行：

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --valid-split validation \
  --project-name my-model \
  --use-enhanced-eval \
  --eval-metrics "perplexity,accuracy"

增强评估使用 --valid-split 指定的验证数据。要在训练后在单独的测试集上评估，请使用下面显示的 LM Evaluation Harness 或自定义脚本。

推理速度

吞吐量测试

# benchmark_speed.py
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def benchmark(model_path, num_samples=100, max_tokens=50):
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    if torch.cuda.is_available():
        model = model.cuda()

    prompt = "The quick brown fox"
    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    # 预热
    for _ in range(5):
        model.generate(**inputs, max_new_tokens=max_tokens)

    # 基准测试
    start = time.time()
    for _ in range(num_samples):
        model.generate(**inputs, max_new_tokens=max_tokens)
    elapsed = time.time() - start

    tokens_per_second = (num_samples * max_tokens) / elapsed
    print(f"Throughput: {tokens_per_second:.2f} tokens/sec")
    print(f"Latency: {elapsed/num_samples*1000:.2f} ms/sample")

benchmark("./my-model")

内存使用

# benchmark_memory.py
import torch
from transformers import AutoModelForCausalLM

def measure_memory(model_path):
    torch.cuda.reset_peak_memory_stats()

    model = AutoModelForCausalLM.from_pretrained(model_path)
    model = model.cuda()

    peak_memory = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Peak memory: {peak_memory:.2f} GB")

measure_memory("./my-model")

模型比较

比较多个模型

# compare_models.py
import json
from pathlib import Path

def compare_models(model_paths):
    results = []

    for path in model_paths:
        state_file = Path(path) / "trainer_state.json"
        if state_file.exists():
            with open(state_file) as f:
                state = json.load(f)
            results.append({
                "model": path,
                "best_metric": state.get("best_metric"),
                "epoch": state.get("epoch"),
            })

    # 按 best_metric 排序（通常是 eval_loss）
    results.sort(key=lambda x: x.get("best_metric") or float("inf"))

    print("Model Comparison:")
    print("-" * 50)
    for r in results:
        metric = r.get('best_metric')
        metric_str = f"{metric:.4f}" if metric else "N/A"
        print(f"{r['model']}: best_metric={metric_str}")

compare_models([
    "./model-v1",
    "./model-v2",
    "./model-v3"
])

W&B 比较

记录到 W&B 时，在仪表板中比较运行：

# 训练多个变体
aitraining llm --train --model modelA --project-name exp-a --log wandb
aitraining llm --train --model modelB --project-name exp-b --log wandb

# 在 W&B 仪表板中比较

标准基准测试

LM Evaluation Harness

对于像 HellaSwag、ARC 和 MMLU 这样的标准基准测试，在训练后使用 LM Evaluation Harness：

pip install lm-eval

lm_eval --model hf \
  --model_args pretrained=./my-model \
  --tasks hellaswag,arc_easy,arc_challenge \
  --batch_size 8

常见基准测试任务

任务	描述
`hellaswag`	常识推理
`arc_easy`	科学问题（简单）
`arc_challenge`	科学问题（困难）
`mmlu`	多任务语言理解
`winogrande`	常识推理
`truthfulqa`	真实性评估

报告

生成报告

# generate_report.py
import json
from datetime import datetime

def generate_report(model_path, metrics, benchmark_results):
    report = {
        "model": model_path,
        "date": datetime.now().isoformat(),
        "metrics": metrics,
        "benchmarks": benchmark_results,
    }

    with open("benchmark_report.json", 'w') as f:
        json.dump(report, f, indent=2)

    # 打印摘要
    print(f"\nBenchmark Report - {model_path}")
    print("=" * 50)
    print(f"Eval Loss: {metrics.get('eval_loss', 'N/A')}")
    print(f"Perplexity: {metrics.get('perplexity', 'N/A')}")
    for name, score in benchmark_results.items():
        print(f"{name}: {score}")

CLI 基础

配置

训练命令

高级用法

推理

基准测试

基准测试

快速评估

使用增强评估

仅评估（不训练）

指标

可用指标

自定义评估

推理速度

吞吐量测试

内存使用

模型比较

比较多个模型

W&B 比较

标准基准测试

LM Evaluation Harness

常见基准测试任务

报告

生成报告

下一步

模型服务

推理模式

CLI 基础

配置

训练命令

高级用法

推理

​基准测试

​快速评估

​使用增强评估

​仅评估（不训练）

​指标

​可用指标

​自定义评估

​推理速度

​吞吐量测试

​内存使用

​模型比较

​比较多个模型

​W&B 比较

​标准基准测试

​LM Evaluation Harness

​常见基准测试任务

​报告

​生成报告

​下一步

模型服务

推理模式

基准测试

快速评估

使用增强评估

仅评估（不训练）

指标

可用指标

自定义评估

推理速度

吞吐量测试

内存使用

模型比较

比较多个模型

W&B 比较

标准基准测试

LM Evaluation Harness

常见基准测试任务

报告

生成报告

下一步