Inference Mode
Run inference using your trained models from the CLI.
LLM Inference
Basic Usage
aitraining llm --inference \
--model ./my-trained-model \
--inference-prompts "What is machine learning?"
Multiple Prompts
Comma-separated:
aitraining llm --inference \
--model ./my-model \
--inference-prompts "Hello, how are you?,What is AI?,Explain transformers"
From file:
# prompts.txt - one prompt per line
aitraining llm --inference \
--model ./my-model \
--inference-prompts prompts.txt
Generation Parameters
aitraining llm --inference \
--model ./my-model \
--inference-prompts "Tell me a story" \
--inference-max-tokens 500 \
--inference-temperature 0.7 \
--inference-top-p 0.9 \
--inference-top-k 50
Parameters
| Parameter | Description | Default |
|---|
--inference-prompts | Prompts (text or file path) | Required |
--inference-max-tokens | Max tokens to generate | 256 |
--inference-temperature | Sampling temperature | 1.0 |
--inference-top-p | Nucleus sampling | 1.0 |
--inference-top-k | Top-k sampling | 50 |
--inference-output | Output file path | Auto |
CLI vs Chat UI defaults differ: CLI uses temperature=1.0 and top_p=1.0 for more deterministic output, while the Chat UI defaults to temperature=0.7 and top_p=0.95 for more natural conversation.
Output
Results are saved to JSON:
[
{
"prompt": "What is machine learning?",
"response": "Machine learning is..."
}
]
Chat Interface
For interactive testing, use the Chat interface:
Then open http://localhost:7860/inference in your browser. The Chat UI lets you load and test any local or Hub model interactively.
Using Hub Models
Test Hugging Face models directly:
aitraining llm --inference \
--model meta-llama/Llama-3.2-1B \
--inference-prompts "Hello!"
API Inference
The AITraining API provides batch inference endpoints:
Batch Inference Request
import requests
response = requests.post("http://localhost:7860/api/batch_inference", json={
"model_path": "./my-model",
"prompts": ["Hello!", "What is AI?"],
"max_new_tokens": 100,
"temperature": 0.7,
"top_p": 0.95,
"top_k": 50,
"do_sample": True
})
results = response.json()
API Parameters
| Parameter | Description | Default |
|---|
model_path | Path to model | Required |
prompts | List of prompts | Required |
max_new_tokens | Max tokens to generate | 100 |
temperature | Sampling temperature | 0.7 |
top_p | Nucleus sampling | 0.95 |
top_k | Top-k sampling | 50 |
do_sample | Use sampling | True |
device | Device to use (cuda/cpu) | Auto |
API defaults differ from CLI: The API uses max_new_tokens=100 (not 256) and temperature=0.7 (not 1.0) by default.
Batch Inference
Script Example
# batch_inference.py
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
def batch_inference(model_path, prompts, output_path):
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
results = []
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
results.append({"prompt": prompt, "response": response})
with open(output_path, 'w') as f:
json.dump(results, f, indent=2)
# Usage
with open("prompts.txt") as f:
prompts = [line.strip() for line in f]
batch_inference("./my-model", prompts, "results.json")
GPU Acceleration
Ensure CUDA is available:
python -c "import torch; print(torch.cuda.is_available())"
Memory Optimization
For large models:
# Use quantization
aitraining llm --inference \
--model ./my-model \
--quantization int4 \
--inference-prompts "Hello"
Batching
For many prompts, batch processing is faster:
# Process in batches
batch_size = 8
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i+batch_size]
# Process batch
Next Steps