Skip to main content

Batch Processing

Run multiple training experiments systematically.

Multiple Configs

Sequential Runs

Run different configs in sequence:
for config in configs/*.yaml; do
  echo "Running $config..."
  aitraining --config "$config"
done

Parallel Runs

Run on different GPUs simultaneously:
CUDA_VISIBLE_DEVICES=0 aitraining --config config1.yaml &
CUDA_VISIBLE_DEVICES=1 aitraining --config config2.yaml &
wait

Parameter Sweeps

Manual Sweep

for lr in 1e-5 2e-5 5e-5; do
  for bs in 4 8 16; do
    aitraining llm --train \
      --model google/gemma-3-270m \
      --data-path ./data \
      --project-name "exp-lr${lr}-bs${bs}" \
      --lr $lr \
      --batch-size $bs
  done
done

Built-in Sweeps

Use the hyperparameter sweep feature:
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name sweep-experiment \
  --use-sweep \
  --sweep-backend optuna \
  --sweep-n-trials 20

Experiment Scripts

Basic Script

#!/bin/bash
# experiments.sh

MODELS=(
  "google/gemma-3-270m"
  "google/gemma-2-2b"
)

TRAINERS=(
  "sft"
  "dpo"
)

for model in "${MODELS[@]}"; do
  for trainer in "${TRAINERS[@]}"; do
    name=$(basename $model)-$trainer
    aitraining llm --train \
      --model $model \
      --data-path ./data \
      --trainer $trainer \
      --project-name "$name"
  done
done

With Logging

#!/bin/bash
# run_experiments.sh

LOG_DIR="logs/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$LOG_DIR"

run_experiment() {
  local config=$1
  local name=$(basename "$config" .yaml)

  echo "[$(date)] Starting $name"
  aitraining --config "$config" 2>&1 | tee "$LOG_DIR/$name.log"
  echo "[$(date)] Finished $name"
}

for config in experiments/*.yaml; do
  run_experiment "$config"
done

echo "All experiments complete. Logs in $LOG_DIR"

Job Management

Background Jobs

# Start in background
nohup aitraining --config config.yaml > training.log 2>&1 &
echo $! > training.pid

# Check status
ps -p $(cat training.pid)

# Stop job
kill $(cat training.pid)

tmux Sessions

# Create session
tmux new-session -d -s training

# Run training
tmux send-keys -t training "aitraining --config config.yaml" Enter

# Attach to see output
tmux attach -t training

# Detach: Ctrl+B, D

Results Collection

Aggregate Metrics

import json
from pathlib import Path

results = []
for exp_dir in Path("experiments").glob("*/"):
    # Training state is saved in trainer_state.json
    state_file = exp_dir / "trainer_state.json"
    if state_file.exists():
        with open(state_file) as f:
            state = json.load(f)
        results.append({
            "experiment": exp_dir.name,
            "best_metric": state.get("best_metric"),
            "global_step": state.get("global_step"),
            "epoch": state.get("epoch"),
        })

# Sort by best_metric (typically eval_loss)
results.sort(key=lambda x: x.get("best_metric") or float("inf"))

# Print best
print("Best experiment:", results[0]["experiment"])

Compare with W&B

When using --log wandb, all experiments are tracked. Set the W&B project via environment variable:
# Set W&B project for all runs
export WANDB_PROJECT=my-experiments

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name exp-1 \
  --log wandb
View comparisons in the W&B dashboard.

Next Steps