Quantization
Quantization reduces memory usage by using lower precision for model weights.
Quick Start
aitraining llm --train \
--model meta-llama/Llama-3.2-8B \
--data-path ./data.jsonl \
--project-name quantized-model \
--peft \
--quantization int4
Python API
from autotrain.trainers.clm.params import LLMTrainingParams
params = LLMTrainingParams(
model="meta-llama/Llama-3.2-8B",
data_path="./data.jsonl",
project_name="quantized-model",
peft=True,
quantization="int4", # or "int8"
lora_r=16,
)
Quantization Options
| Option | Memory Reduction | Quality |
|---|
| None | 0% | Best |
| int8 | ~50% | Very Good |
| int4 | ~75% | Good |
Supported Tasks
Quantization is available for:
| Task | Params Class | Notes |
|---|
| LLM | LLMTrainingParams | Full support |
| VLM | VLMTrainingParams | Full support |
| Seq2Seq | Seq2SeqParams | Full support |
4-bit (QLoRA)
Maximum memory savings:
params = LLMTrainingParams(
...
quantization="int4",
)
8-bit
Better quality, less savings:
params = LLMTrainingParams(
...
quantization="int8",
)
Memory Requirements
Llama 3.2 8B
| Config | VRAM Required |
|---|
| Full precision | ~64 GB |
| LoRA (fp16) | ~18 GB |
| LoRA + 8bit | ~12 GB |
| LoRA + 4bit | ~8 GB |
Gemma 2 27B
| Config | VRAM Required |
|---|
| Full precision | ~108 GB |
| LoRA + 4bit | ~20 GB |
Best Practices
Use with LoRA
Quantization requires PEFT/LoRA to be enabled:
params = LLMTrainingParams(
...
peft=True, # Required for quantized training
quantization="int4",
)
Quantization only works when peft=True. Without PEFT enabled, the quantization setting will be ignored.
Adjust Learning Rate
Quantized training often benefits from a higher learning rate than the default (3e-5):
params = LLMTrainingParams(
...
peft=True,
quantization="int4",
lr=2e-4, # Higher LR works well with QLoRA
)
Use Flash Attention
Combine with Flash Attention for speed:
params = LLMTrainingParams(
...
quantization="int4",
use_flash_attention_2=True, # Requires Linux + CUDA + flash-attn package
)
Inference with Quantized Models
Load quantized models for inference:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit config (matches AITraining defaults)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-8B",
quantization_config=bnb_config,
)
Quantization only works on Linux. The bitsandbytes library required for int4/int8 quantization is only available on Linux systems.
Apple Silicon (MPS) Note
Quantization is not compatible with Apple Silicon MPS. When you use quantization on a Mac with M1/M2/M3:
- Training automatically falls back to CPU
- You’ll see a warning message explaining this
- For faster training on Mac, skip quantization and use LoRA alone
# On Apple Silicon - use LoRA without quantization for MPS acceleration
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data.jsonl \
--project-name mac-training \
--peft \
--lora-r 16
Environment variables for manual control:
AUTOTRAIN_DISABLE_MPS=1 - Force CPU training
AUTOTRAIN_ENABLE_MPS=1 - Force MPS even with quantization (may crash)
Quality Considerations
Quantization does reduce quality slightly. For critical applications:
- Test on your specific task
- Compare with full-precision baseline
- Consider 8-bit if quality matters more
Next Steps