Skip to main content

Model Serving

Serve your trained models for production inference.

Chat Interface

The simplest way to test and interact with models:
aitraining chat
Opens a web interface at http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.

Custom Port

aitraining chat --port 3000

Custom Host

aitraining chat --host 0.0.0.0

API Server

The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.

Start API Server

aitraining api
Starts the training API on http://127.0.0.1:7860 by default.

Parameters

ParameterDescriptionDefault
--portPort to run the API on7860
--hostHost to bind to127.0.0.1
--taskTask to run (optional)None

Custom Port/Host

aitraining api --port 8000 --host 0.0.0.0

Environment Variables

The API server reads configuration from environment variables:
VariableDescription
HF_TOKENHugging Face token for authentication
AUTOTRAIN_USERNAMEUsername for training
PROJECT_NAMEName of the project
TASK_IDTask identifier
PARAMSTraining parameters (JSON)
DATA_PATHPath to training data
MODELModel to use

Endpoints

EndpointDescription
GET /Returns training status message
GET /healthHealth check (returns “OK”)
The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.

Production Deployment

Using vLLM

For production-grade serving with high throughput:
pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./my-trained-model \
  --port 8000

Using Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
  -v ./my-model:/model \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /model

OpenAI-Compatible API

Both vLLM and TGI provide OpenAI-compatible endpoints:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not needed for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Docker Deployment

Dockerfile Example

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install aitraining torch

# Expose port
EXPOSE 7860

# Run chat server
CMD ["aitraining", "chat", "--host", "0.0.0.0", "--port", "7860"]
Build and run:
docker build -t my-model-server .
docker run -p 7860:7860 my-model-server

With GPU

docker run --gpus all -p 7860:7860 my-model-server

Load Testing

Using hey

hey -n 100 -c 10 \
  -m POST \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}' \
  http://localhost:8000/generate

Using locust

# locustfile.py
from locust import HttpUser, task

class ModelUser(HttpUser):
    @task
    def generate(self):
        self.client.post("/generate", json={
            "prompt": "Hello, how are you?",
            "max_tokens": 50
        })
locust -f locustfile.py --host http://localhost:8000

Monitoring

Prometheus Metrics

If using vLLM or TGI, metrics are available at /metrics.

Logging

aitraining api --port 8000 2>&1 | tee server.log

Next Steps