Distributed Training
AITraining supports multi-GPU training through Accelerate, with optional DeepSpeed Zero-3 optimization for large models.
Requirements
| Component | Required | Install |
|---|
| Accelerate | Yes (included) | Included with AITraining |
| DeepSpeed | Optional | pip install deepspeed |
| Multiple GPUs | Yes | NVIDIA CUDA GPUs |
Distribution Backends
| Backend | Value | Description |
|---|
| DDP | ddp or None | PyTorch Distributed Data Parallel (default) |
| DeepSpeed | deepspeed | DeepSpeed Zero-3 with automatic sharding |
Quick Start
DDP (Default)
With multiple GPUs, DDP is used automatically:
aitraining llm --train \
--model meta-llama/Llama-3.2-1B \
--data-path ./data \
--project-name my-model \
--trainer sft \
--peft
DeepSpeed
For large models, use DeepSpeed Zero-3:
aitraining llm --train \
--model meta-llama/Llama-3.2-3B \
--data-path ./data \
--project-name my-model \
--trainer sft \
--distributed-backend deepspeed \
--peft
Python API
from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject
params = LLMTrainingParams(
model="meta-llama/Llama-3.2-3B",
data_path="./data",
project_name="distributed-model",
trainer="sft",
# Distribution
distributed_backend="deepspeed", # or None for DDP
# Training
epochs=3,
batch_size=2,
gradient_accumulation=4,
mixed_precision="bf16",
peft=True,
lora_r=16,
)
project = AutoTrainProject(params=params, backend="local", process=True)
project.create()
YAML Configuration
task: llm-sft
backend: local
base_model: meta-llama/Llama-3.2-3B
project_name: distributed-model
data:
path: ./data
train_split: train
valid_split: null
chat_template: tokenizer
column_mapping:
text_column: text
log: wandb
params:
distributed_backend: deepspeed # or null for DDP
epochs: 3
batch_size: 2
gradient_accumulation: 4
mixed_precision: bf16
peft: true
lora_r: 16
How It Works
Accelerate Launch
Training is launched through Accelerate:
- AITraining detects available GPUs
- Launches training via
accelerate launch
- For DeepSpeed, adds
--use_deepspeed and Zero-3 flags
- Logs
accelerate env for debugging
DDP Settings
When using DDP:
ddp_find_unused_parameters=False is set for performance
- Each GPU processes a portion of the batch
- Gradients are synchronized across GPUs
DeepSpeed Zero-3
When using DeepSpeed:
- Model parameters are sharded across GPUs
- Uses
--deepspeed_multinode_launcher standard for multi-node
- Zero-3 configuration is applied automatically
- Model saving uses
accelerator.get_state_dict() with unwrapping
Multi-Node Training
For multi-node DeepSpeed training:
# On each node
aitraining llm --train \
--model meta-llama/Llama-3.2-3B \
--data-path ./data \
--project-name my-model \
--distributed-backend deepspeed \
--peft
The --deepspeed_multinode_launcher standard flag is passed automatically.
Task-Specific Behavior
LLM Training
- Default: DDP when multiple GPUs detected
- DeepSpeed: Explicitly set
--distributed-backend deepspeed
Seq2Seq and VLM
- Auto-selects DeepSpeed for many-GPU cases
- Uses multi-GPU DDP for PEFT + quantization + bf16 combinations
Checkpointing with DeepSpeed
When using DeepSpeed, PEFT adapter saving is handled differently. The SavePeftModelCallback is not used; instead, saving uses accelerator.get_state_dict(trainer.deepspeed) and unwraps the model.
GPU Selection
Control which GPUs to use:
# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1 aitraining llm --train ...
# Use all available GPUs (default)
aitraining llm --train ...
Troubleshooting
Check Accelerate Environment
Common Issues
| Issue | Solution |
|---|
| DeepSpeed not found | pip install deepspeed |
| NCCL errors | Check GPU connectivity and CUDA version |
| OOM errors | Reduce batch size or use DeepSpeed |
| Slow training | Ensure GPUs are on same PCIe bus |
Next Steps