Skip to main content

Distributed Training

AITraining supports multi-GPU training through Accelerate, with optional DeepSpeed Zero-3 optimization for large models.

Requirements

ComponentRequiredInstall
AccelerateYes (included)Included with AITraining
DeepSpeedOptionalpip install deepspeed
Multiple GPUsYesNVIDIA CUDA GPUs

Distribution Backends

BackendValueDescription
DDPddp or NonePyTorch Distributed Data Parallel (default)
DeepSpeeddeepspeedDeepSpeed Zero-3 with automatic sharding

Quick Start

DDP (Default)

With multiple GPUs, DDP is used automatically:
aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./data \
  --project-name my-model \
  --trainer sft \
  --peft

DeepSpeed

For large models, use DeepSpeed Zero-3:
aitraining llm --train \
  --model meta-llama/Llama-3.2-3B \
  --data-path ./data \
  --project-name my-model \
  --trainer sft \
  --distributed-backend deepspeed \
  --peft

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-3B",
    data_path="./data",
    project_name="distributed-model",

    trainer="sft",

    # Distribution
    distributed_backend="deepspeed",  # or None for DDP

    # Training
    epochs=3,
    batch_size=2,
    gradient_accumulation=4,
    mixed_precision="bf16",

    peft=True,
    lora_r=16,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

YAML Configuration

task: llm-sft
backend: local
base_model: meta-llama/Llama-3.2-3B
project_name: distributed-model

data:
  path: ./data
  train_split: train
  valid_split: null
  chat_template: tokenizer
  column_mapping:
    text_column: text

log: wandb

params:
  distributed_backend: deepspeed  # or null for DDP
  epochs: 3
  batch_size: 2
  gradient_accumulation: 4
  mixed_precision: bf16
  peft: true
  lora_r: 16

How It Works

Accelerate Launch

Training is launched through Accelerate:
  1. AITraining detects available GPUs
  2. Launches training via accelerate launch
  3. For DeepSpeed, adds --use_deepspeed and Zero-3 flags
  4. Logs accelerate env for debugging

DDP Settings

When using DDP:
  • ddp_find_unused_parameters=False is set for performance
  • Each GPU processes a portion of the batch
  • Gradients are synchronized across GPUs

DeepSpeed Zero-3

When using DeepSpeed:
  • Model parameters are sharded across GPUs
  • Uses --deepspeed_multinode_launcher standard for multi-node
  • Zero-3 configuration is applied automatically
  • Model saving uses accelerator.get_state_dict() with unwrapping

Multi-Node Training

For multi-node DeepSpeed training:
# On each node
aitraining llm --train \
  --model meta-llama/Llama-3.2-3B \
  --data-path ./data \
  --project-name my-model \
  --distributed-backend deepspeed \
  --peft
The --deepspeed_multinode_launcher standard flag is passed automatically.

Task-Specific Behavior

LLM Training

  • Default: DDP when multiple GPUs detected
  • DeepSpeed: Explicitly set --distributed-backend deepspeed

Seq2Seq and VLM

  • Auto-selects DeepSpeed for many-GPU cases
  • Uses multi-GPU DDP for PEFT + quantization + bf16 combinations

Checkpointing with DeepSpeed

When using DeepSpeed, PEFT adapter saving is handled differently. The SavePeftModelCallback is not used; instead, saving uses accelerator.get_state_dict(trainer.deepspeed) and unwraps the model.

GPU Selection

Control which GPUs to use:
# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1 aitraining llm --train ...

# Use all available GPUs (default)
aitraining llm --train ...

Troubleshooting

Check Accelerate Environment

accelerate env

Common Issues

IssueSolution
DeepSpeed not foundpip install deepspeed
NCCL errorsCheck GPU connectivity and CUDA version
OOM errorsReduce batch size or use DeepSpeed
Slow trainingEnsure GPUs are on same PCIe bus

Next Steps