Distributed Training

AITraining supports multi-GPU training through Accelerate, with optional DeepSpeed Zero-3 optimization for large models.

Requirements

Component	Required	Install
Accelerate	Yes (included)	Included with AITraining
DeepSpeed	Optional	`pip install deepspeed`
Multiple GPUs	Yes	NVIDIA CUDA GPUs

Distribution Backends

Backend	Value	Description
DDP	`ddp` or `None`	PyTorch Distributed Data Parallel (default)
DeepSpeed	`deepspeed`	DeepSpeed Zero-3 with automatic sharding

Quick Start

DDP (Default)

With multiple GPUs, DDP is used automatically:

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./data \
  --project-name my-model \
  --trainer sft \
  --peft

DeepSpeed

For large models, use DeepSpeed Zero-3:

aitraining llm --train \
  --model meta-llama/Llama-3.2-3B \
  --data-path ./data \
  --project-name my-model \
  --trainer sft \
  --distributed-backend deepspeed \
  --peft

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-3B",
    data_path="./data",
    project_name="distributed-model",

    trainer="sft",

    # Distribution
    distributed_backend="deepspeed",  # or None for DDP

    # Training
    epochs=3,
    batch_size=2,
    gradient_accumulation=4,
    mixed_precision="bf16",

    peft=True,
    lora_r=16,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

YAML Configuration

task: llm-sft
backend: local
base_model: meta-llama/Llama-3.2-3B
project_name: distributed-model

data:
  path: ./data
  train_split: train
  valid_split: null
  chat_template: tokenizer
  column_mapping:
    text_column: text

log: wandb

params:
  distributed_backend: deepspeed  # or null for DDP
  epochs: 3
  batch_size: 2
  gradient_accumulation: 4
  mixed_precision: bf16
  peft: true
  lora_r: 16

How It Works

Accelerate Launch

Training is launched through Accelerate:

AITraining detects available GPUs
Launches training via accelerate launch
For DeepSpeed, adds --use_deepspeed and Zero-3 flags
Logs accelerate env for debugging

DDP Settings

When using DDP:

ddp_find_unused_parameters=False is set for performance
Each GPU processes a portion of the batch
Gradients are synchronized across GPUs

DeepSpeed Zero-3

When using DeepSpeed:

Model parameters are sharded across GPUs
Uses --deepspeed_multinode_launcher standard for multi-node
Zero-3 configuration is applied automatically
Model saving uses accelerator.get_state_dict() with unwrapping

Multi-Node Training

For multi-node DeepSpeed training:

# On each node
aitraining llm --train \
  --model meta-llama/Llama-3.2-3B \
  --data-path ./data \
  --project-name my-model \
  --distributed-backend deepspeed \
  --peft

The --deepspeed_multinode_launcher standard flag is passed automatically.

Task-Specific Behavior

LLM Training

Default: DDP when multiple GPUs detected
DeepSpeed: Explicitly set --distributed-backend deepspeed

Seq2Seq and VLM

Auto-selects DeepSpeed for many-GPU cases
Uses multi-GPU DDP for PEFT + quantization + bf16 combinations

Checkpointing with DeepSpeed

When using DeepSpeed, PEFT adapter saving is handled differently. The SavePeftModelCallback is not used; instead, saving uses accelerator.get_state_dict(trainer.deepspeed) and unwraps the model.

GPU Selection

Control which GPUs to use:

# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1 aitraining llm --train ...

# Use all available GPUs (default)
aitraining llm --train ...

Troubleshooting

Check Accelerate Environment

accelerate env

Common Issues

Issue	Solution
DeepSpeed not found	`pip install deepspeed`
NCCL errors	Check GPU connectivity and CUDA version
OOM errors	Reduce batch size or use DeepSpeed
Slow training	Ensure GPUs are on same PCIe bus

Training Techniques

Optimization

Custom Development

Evaluation

Research

Production

Distributed Training

Distributed Training

Requirements

Distribution Backends

Quick Start

DDP (Default)

DeepSpeed

Python API

YAML Configuration

How It Works

Accelerate Launch

DDP Settings

DeepSpeed Zero-3

Multi-Node Training

Task-Specific Behavior

LLM Training

Seq2Seq and VLM

Checkpointing with DeepSpeed

GPU Selection

Troubleshooting

Check Accelerate Environment

Common Issues

Next Steps

LoRA/PEFT

Quantization

Training Techniques

Optimization

Custom Development

Evaluation

Research

Production

​Distributed Training

​Requirements

​Distribution Backends

​Quick Start

​DDP (Default)

​DeepSpeed

​Python API

​YAML Configuration

​How It Works

​Accelerate Launch

​DDP Settings

​DeepSpeed Zero-3

​Multi-Node Training

​Task-Specific Behavior

​LLM Training

​Seq2Seq and VLM

​Checkpointing with DeepSpeed

​GPU Selection

​Troubleshooting

​Check Accelerate Environment

​Common Issues

​Next Steps

LoRA/PEFT

Quantization

Distributed Training

Requirements

Distribution Backends

Quick Start

DDP (Default)

DeepSpeed

Python API

YAML Configuration

How It Works

Accelerate Launch

DDP Settings

DeepSpeed Zero-3

Multi-Node Training

Task-Specific Behavior

LLM Training

Seq2Seq and VLM

Checkpointing with DeepSpeed

GPU Selection

Troubleshooting

Check Accelerate Environment

Common Issues

Next Steps