Inference Mode

Run inference using your trained models from the CLI.

LLM Inference

Basic Usage

aitraining llm --inference \
  --model ./my-trained-model \
  --inference-prompts "What is machine learning?"

Multiple Prompts

Comma-separated:

aitraining llm --inference \
  --model ./my-model \
  --inference-prompts "Hello, how are you?,What is AI?,Explain transformers"

From file:

# prompts.txt - one prompt per line
aitraining llm --inference \
  --model ./my-model \
  --inference-prompts prompts.txt

Generation Parameters

aitraining llm --inference \
  --model ./my-model \
  --inference-prompts "Tell me a story" \
  --inference-max-tokens 500 \
  --inference-temperature 0.7 \
  --inference-top-p 0.9 \
  --inference-top-k 50

Parameters

Parameter	Description	Default
`--inference-prompts`	Prompts (text or file path)	Required
`--inference-max-tokens`	Max tokens to generate	`256`
`--inference-temperature`	Sampling temperature	`1.0`
`--inference-top-p`	Nucleus sampling	`1.0`
`--inference-top-k`	Top-k sampling	`50`
`--inference-output`	Output file path	Auto

CLI vs Chat UI defaults differ: CLI uses temperature=1.0 and top_p=1.0 for more deterministic output, while the Chat UI defaults to temperature=0.7 and top_p=0.95 for more natural conversation.

Output

Results are saved to JSON:

[
  {
    "prompt": "What is machine learning?",
    "response": "Machine learning is..."
  }
]

Chat Interface

For interactive testing, use the Chat interface:

aitraining chat

Then open http://localhost:7860/inference in your browser. The Chat UI lets you load and test any local or Hub model interactively.

Using Hub Models

Test Hugging Face models directly:

aitraining llm --inference \
  --model meta-llama/Llama-3.2-1B \
  --inference-prompts "Hello!"

API Inference

The AITraining API provides batch inference endpoints:

Batch Inference Request

import requests

response = requests.post("http://localhost:7860/api/batch_inference", json={
    "model_path": "./my-model",
    "prompts": ["Hello!", "What is AI?"],
    "max_new_tokens": 100,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 50,
    "do_sample": True
})

results = response.json()

API Parameters

Parameter	Description	Default
`model_path`	Path to model	Required
`prompts`	List of prompts	Required
`max_new_tokens`	Max tokens to generate	`100`
`temperature`	Sampling temperature	`0.7`
`top_p`	Nucleus sampling	`0.95`
`top_k`	Top-k sampling	`50`
`do_sample`	Use sampling	`True`
`device`	Device to use (cuda/cpu)	Auto

API defaults differ from CLI: The API uses max_new_tokens=100 (not 256) and temperature=0.7 (not 1.0) by default.

Batch Inference

Script Example

# batch_inference.py
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

def batch_inference(model_path, prompts, output_path):
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    results = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        outputs = model.generate(**inputs, max_new_tokens=256)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results.append({"prompt": prompt, "response": response})

    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)

# Usage
with open("prompts.txt") as f:
    prompts = [line.strip() for line in f]

batch_inference("./my-model", prompts, "results.json")

Performance Tips

GPU Acceleration

Ensure CUDA is available:

python -c "import torch; print(torch.cuda.is_available())"

Memory Optimization

For large models:

# Use quantization
aitraining llm --inference \
  --model ./my-model \
  --quantization int4 \
  --inference-prompts "Hello"

Batching

For many prompts, batch processing is faster:

# Process in batches
batch_size = 8
for i in range(0, len(prompts), batch_size):
    batch = prompts[i:i+batch_size]
    # Process batch

CLI Basics

Configuration

Training Commands

Advanced Usage

Inference

Inference Mode

Inference Mode

LLM Inference

Basic Usage

Multiple Prompts

Generation Parameters

Parameters

Output

Chat Interface

Using Hub Models

API Inference

Batch Inference Request

API Parameters

Batch Inference

Script Example

Performance Tips

GPU Acceleration

Memory Optimization

Batching

Next Steps

Model Serving

Chat Interface

CLI Basics

Configuration

Training Commands

Advanced Usage

Inference

​Inference Mode

​LLM Inference

​Basic Usage

​Multiple Prompts

​Generation Parameters

​Parameters

​Output

​Chat Interface

​Using Hub Models

​API Inference

​Batch Inference Request

​API Parameters

​Batch Inference

​Script Example

​Performance Tips

​GPU Acceleration

​Memory Optimization

​Batching

​Next Steps

Model Serving

Chat Interface

Inference Mode

LLM Inference

Basic Usage

Multiple Prompts

Generation Parameters

Parameters

Output

Chat Interface

Using Hub Models

API Inference

Batch Inference Request

API Parameters

Batch Inference

Script Example

Performance Tips

GPU Acceleration

Memory Optimization

Batching

Next Steps