基准测试
评估和比较模型性能。快速评估
使用增强评估
复制
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name my-model \
--use-enhanced-eval \
--eval-metrics "perplexity,accuracy"
仅评估(不训练)
复制
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained('./my-model')
tokenizer = AutoTokenizer.from_pretrained('./my-model')
# 在测试数据上计算困惑度
# ...
"
指标
可用指标
| 指标 | 描述 | 用例 |
|---|---|---|
perplexity | 语言建模质量 | LLMs |
accuracy | 分类准确率 | 分类 |
f1 | F1 分数 | 分类 |
bleu | 翻译质量 | Seq2Seq |
rouge | 摘要质量 | Seq2Seq |
自定义评估
增强评估在训练期间在您的验证集上运行:复制
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--valid-split validation \
--project-name my-model \
--use-enhanced-eval \
--eval-metrics "perplexity,accuracy"
增强评估使用
--valid-split 指定的验证数据。要在训练后在单独的测试集上评估,请使用下面显示的 LM Evaluation Harness 或自定义脚本。推理速度
吞吐量测试
复制
# benchmark_speed.py
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def benchmark(model_path, num_samples=100, max_tokens=50):
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if torch.cuda.is_available():
model = model.cuda()
prompt = "The quick brown fox"
inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
inputs = {k: v.cuda() for k, v in inputs.items()}
# 预热
for _ in range(5):
model.generate(**inputs, max_new_tokens=max_tokens)
# 基准测试
start = time.time()
for _ in range(num_samples):
model.generate(**inputs, max_new_tokens=max_tokens)
elapsed = time.time() - start
tokens_per_second = (num_samples * max_tokens) / elapsed
print(f"Throughput: {tokens_per_second:.2f} tokens/sec")
print(f"Latency: {elapsed/num_samples*1000:.2f} ms/sample")
benchmark("./my-model")
内存使用
复制
# benchmark_memory.py
import torch
from transformers import AutoModelForCausalLM
def measure_memory(model_path):
torch.cuda.reset_peak_memory_stats()
model = AutoModelForCausalLM.from_pretrained(model_path)
model = model.cuda()
peak_memory = torch.cuda.max_memory_allocated() / 1024**3
print(f"Peak memory: {peak_memory:.2f} GB")
measure_memory("./my-model")
模型比较
比较多个模型
复制
# compare_models.py
import json
from pathlib import Path
def compare_models(model_paths):
results = []
for path in model_paths:
state_file = Path(path) / "trainer_state.json"
if state_file.exists():
with open(state_file) as f:
state = json.load(f)
results.append({
"model": path,
"best_metric": state.get("best_metric"),
"epoch": state.get("epoch"),
})
# 按 best_metric 排序(通常是 eval_loss)
results.sort(key=lambda x: x.get("best_metric") or float("inf"))
print("Model Comparison:")
print("-" * 50)
for r in results:
metric = r.get('best_metric')
metric_str = f"{metric:.4f}" if metric else "N/A"
print(f"{r['model']}: best_metric={metric_str}")
compare_models([
"./model-v1",
"./model-v2",
"./model-v3"
])
W&B 比较
记录到 W&B 时,在仪表板中比较运行:复制
# 训练多个变体
aitraining llm --train --model modelA --project-name exp-a --log wandb
aitraining llm --train --model modelB --project-name exp-b --log wandb
# 在 W&B 仪表板中比较
标准基准测试
LM Evaluation Harness
对于像 HellaSwag、ARC 和 MMLU 这样的标准基准测试,在训练后使用 LM Evaluation Harness:复制
pip install lm-eval
lm_eval --model hf \
--model_args pretrained=./my-model \
--tasks hellaswag,arc_easy,arc_challenge \
--batch_size 8
常见基准测试任务
| 任务 | 描述 |
|---|---|
hellaswag | 常识推理 |
arc_easy | 科学问题(简单) |
arc_challenge | 科学问题(困难) |
mmlu | 多任务语言理解 |
winogrande | 常识推理 |
truthfulqa | 真实性评估 |
报告
生成报告
复制
# generate_report.py
import json
from datetime import datetime
def generate_report(model_path, metrics, benchmark_results):
report = {
"model": model_path,
"date": datetime.now().isoformat(),
"metrics": metrics,
"benchmarks": benchmark_results,
}
with open("benchmark_report.json", 'w') as f:
json.dump(report, f, indent=2)
# 打印摘要
print(f"\nBenchmark Report - {model_path}")
print("=" * 50)
print(f"Eval Loss: {metrics.get('eval_loss', 'N/A')}")
print(f"Perplexity: {metrics.get('perplexity', 'N/A')}")
for name, score in benchmark_results.items():
print(f"{name}: {score}")