推理模式

使用 CLI 使用您训练好的模型运行推理。

LLM 推理

基本用法

aitraining llm --inference \
  --model ./my-trained-model \
  --inference-prompts "What is machine learning?"

多个提示

逗号分隔：

aitraining llm --inference \
  --model ./my-model \
  --inference-prompts "Hello, how are you?,What is AI?,Explain transformers"

从文件：

# prompts.txt - 每行一个提示
aitraining llm --inference \
  --model ./my-model \
  --inference-prompts prompts.txt

生成参数

aitraining llm --inference \
  --model ./my-model \
  --inference-prompts "Tell me a story" \
  --inference-max-tokens 500 \
  --inference-temperature 0.7 \
  --inference-top-p 0.9 \
  --inference-top-k 50

参数

参数	描述	默认值
`--inference-prompts`	提示（文本或文件路径）	必需
`--inference-max-tokens`	生成的最大标记数	`256`
`--inference-temperature`	采样温度	`1.0`
`--inference-top-p`	核采样	`1.0`
`--inference-top-k`	Top-k 采样	`50`
`--inference-output`	输出文件路径	自动

CLI 与 Chat UI 默认值不同：CLI 使用 temperature=1.0 和 top_p=1.0 以获得更确定的输出，而 Chat UI 默认使用 temperature=0.7 和 top_p=0.95 以获得更自然的对话。

输出

结果保存为 JSON：

[
  {
    "prompt": "What is machine learning?",
    "response": "Machine learning is..."
  }
]

聊天界面

要进行交互式测试，请使用聊天界面：

aitraining chat

然后在浏览器中打开 http://localhost:7860/inference。Chat UI 允许您交互式地加载和测试任何本地或 Hub 模型。

使用 Hub 模型

直接测试 Hugging Face 模型：

aitraining llm --inference \
  --model meta-llama/Llama-3.2-1B \
  --inference-prompts "Hello!"

API 推理

AITraining API 提供批量推理端点：

批量推理请求

import requests

response = requests.post("http://localhost:7860/api/batch_inference", json={
    "model_path": "./my-model",
    "prompts": ["Hello!", "What is AI?"],
    "max_new_tokens": 100,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 50,
    "do_sample": True
})

results = response.json()

API 参数

参数	描述	默认值
`model_path`	模型路径	必需
`prompts`	提示列表	必需
`max_new_tokens`	生成的最大标记数	`100`
`temperature`	采样温度	`0.7`
`top_p`	核采样	`0.95`
`top_k`	Top-k 采样	`50`
`do_sample`	使用采样	`True`
`device`	使用的设备（cuda/cpu）	自动

API 默认值与 CLI 不同：API 默认使用 max_new_tokens=100（不是 256）和 temperature=0.7（不是 1.0）。

批量推理

脚本示例

# batch_inference.py
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

def batch_inference(model_path, prompts, output_path):
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    results = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=256)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "response": response})

    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)

# 用法
with open("prompts.txt") as f:
    prompts = [line.strip() for line in f]

batch_inference("./my-model", prompts, "results.json")

性能提示

GPU 加速

确保 CUDA 可用：

python -c "import torch; print(torch.cuda.is_available())"

内存优化

对于大型模型：

# 使用量化
aitraining llm --inference \
  --model ./my-model \
  --quantization int4 \
  --inference-prompts "Hello"

批处理

对于许多提示，批处理更快：

# 批量处理
batch_size = 8
for i in range(0, len(prompts), batch_size):
    batch = prompts[i:i+batch_size]
    # 处理批次

CLI 基础

配置

训练命令

高级用法

推理

推理模式

推理模式

LLM 推理

基本用法

多个提示

生成参数

参数

输出

聊天界面

使用 Hub 模型

API 推理

批量推理请求

API 参数

批量推理

脚本示例

性能提示

GPU 加速

内存优化

批处理

下一步

模型服务

聊天界面

CLI 基础

配置

训练命令

高级用法

推理

​推理模式

​LLM 推理

​基本用法

​多个提示

​生成参数

​参数

​输出

​聊天界面

​使用 Hub 模型

​API 推理

​批量推理请求

​API 参数

​批量推理

​脚本示例

​性能提示

​GPU 加速

​内存优化

​批处理

​下一步

模型服务

聊天界面

推理模式

LLM 推理

基本用法

多个提示

生成参数

参数

输出

聊天界面

使用 Hub 模型

API 推理

批量推理请求

API 参数

批量推理

脚本示例

性能提示

GPU 加速

内存优化

批处理

下一步