Skip to main content

Inference Mode

Run inference using your trained models from the CLI.

LLM Inference

Basic Usage

aitraining llm --inference \
  --model ./my-trained-model \
  --inference-prompts "What is machine learning?"

Multiple Prompts

Comma-separated:
aitraining llm --inference \
  --model ./my-model \
  --inference-prompts "Hello, how are you?,What is AI?,Explain transformers"
From file:
# prompts.txt - one prompt per line
aitraining llm --inference \
  --model ./my-model \
  --inference-prompts prompts.txt

Generation Parameters

aitraining llm --inference \
  --model ./my-model \
  --inference-prompts "Tell me a story" \
  --inference-max-tokens 500 \
  --inference-temperature 0.7 \
  --inference-top-p 0.9 \
  --inference-top-k 50

Parameters

ParameterDescriptionDefault
--inference-promptsPrompts (text or file path)Required
--inference-max-tokensMax tokens to generate256
--inference-temperatureSampling temperature1.0
--inference-top-pNucleus sampling1.0
--inference-top-kTop-k sampling50
--inference-outputOutput file pathAuto
CLI vs Chat UI defaults differ: CLI uses temperature=1.0 and top_p=1.0 for more deterministic output, while the Chat UI defaults to temperature=0.7 and top_p=0.95 for more natural conversation.

Output

Results are saved to JSON:
[
  {
    "prompt": "What is machine learning?",
    "response": "Machine learning is..."
  }
]

Chat Interface

For interactive testing, use the Chat interface:
aitraining chat
Then open http://localhost:7860/inference in your browser. The Chat UI lets you load and test any local or Hub model interactively.

Using Hub Models

Test Hugging Face models directly:
aitraining llm --inference \
  --model meta-llama/Llama-3.2-1B \
  --inference-prompts "Hello!"

API Inference

The AITraining API provides batch inference endpoints:

Batch Inference Request

import requests

response = requests.post("http://localhost:7860/api/batch_inference", json={
    "model_path": "./my-model",
    "prompts": ["Hello!", "What is AI?"],
    "max_new_tokens": 100,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 50,
    "do_sample": True
})

results = response.json()

API Parameters

ParameterDescriptionDefault
model_pathPath to modelRequired
promptsList of promptsRequired
max_new_tokensMax tokens to generate100
temperatureSampling temperature0.7
top_pNucleus sampling0.95
top_kTop-k sampling50
do_sampleUse samplingTrue
deviceDevice to use (cuda/cpu)Auto
API defaults differ from CLI: The API uses max_new_tokens=100 (not 256) and temperature=0.7 (not 1.0) by default.

Batch Inference

Script Example

# batch_inference.py
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

def batch_inference(model_path, prompts, output_path):
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    results = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=256)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "response": response})

    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)

# Usage
with open("prompts.txt") as f:
    prompts = [line.strip() for line in f]

batch_inference("./my-model", prompts, "results.json")

Performance Tips

GPU Acceleration

Ensure CUDA is available:
python -c "import torch; print(torch.cuda.is_available())"

Memory Optimization

For large models:
# Use quantization
aitraining llm --inference \
  --model ./my-model \
  --quantization int4 \
  --inference-prompts "Hello"

Batching

For many prompts, batch processing is faster:
# Process in batches
batch_size = 8
for i in range(0, len(prompts), batch_size):
    batch = prompts[i:i+batch_size]
    # Process batch

Next Steps