Model Serving
Serve your trained models for production inference.
Chat Interface
The simplest way to test and interact with models:
Opens a web interface at http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.
Custom Port
aitraining chat --port 3000
Custom Host
aitraining chat --host 0.0.0.0
API Server
The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.
Start API Server
Starts the training API on http://127.0.0.1:7860 by default.
Parameters
| Parameter | Description | Default |
|---|
--port | Port to run the API on | 7860 |
--host | Host to bind to | 127.0.0.1 |
--task | Task to run (optional) | None |
Custom Port/Host
aitraining api --port 8000 --host 0.0.0.0
Environment Variables
The API server reads configuration from environment variables:
| Variable | Description |
|---|
HF_TOKEN | Hugging Face token for authentication |
AUTOTRAIN_USERNAME | Username for training |
PROJECT_NAME | Name of the project |
TASK_ID | Task identifier |
PARAMS | Training parameters (JSON) |
DATA_PATH | Path to training data |
MODEL | Model to use |
Endpoints
| Endpoint | Description |
|---|
GET / | Returns training status message |
GET /health | Health check (returns “OK”) |
The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.
Production Deployment
Using vLLM
For production-grade serving with high throughput:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./my-trained-model \
--port 8000
Using Text Generation Inference (TGI)
docker run --gpus all -p 8080:80 \
-v ./my-model:/model \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id /model
OpenAI-Compatible API
Both vLLM and TGI provide OpenAI-compatible endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # Not needed for local
)
response = client.chat.completions.create(
model="my-model",
messages=[
{"role": "user", "content": "Hello!"}
]
)
Docker Deployment
Dockerfile Example
FROM python:3.10-slim
WORKDIR /app
# Install dependencies
RUN pip install aitraining torch
# Expose port
EXPOSE 7860
# Run chat server
CMD ["aitraining", "chat", "--host", "0.0.0.0", "--port", "7860"]
Build and run:
docker build -t my-model-server .
docker run -p 7860:7860 my-model-server
With GPU
docker run --gpus all -p 7860:7860 my-model-server
Load Testing
Using hey
hey -n 100 -c 10 \
-m POST \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello", "max_tokens": 50}' \
http://localhost:8000/generate
Using locust
# locustfile.py
from locust import HttpUser, task
class ModelUser(HttpUser):
@task
def generate(self):
self.client.post("/generate", json={
"prompt": "Hello, how are you?",
"max_tokens": 50
})
locust -f locustfile.py --host http://localhost:8000
Monitoring
Prometheus Metrics
If using vLLM or TGI, metrics are available at /metrics.
Logging
aitraining api --port 8000 2>&1 | tee server.log
Next Steps