Skip to main content

Logging & Debugging

Monitor training progress and diagnose issues.

Logging Options

Weights & Biases

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log wandb
Features:
  • Real-time loss curves
  • Hardware metrics
  • Hyperparameter tracking
  • Model artifacts

TensorBoard

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log tensorboard
View in browser:
tensorboard --logdir my-model/runs

W&B Visualizer (LEET)

Built-in terminal visualizer that shows real-time metrics in your terminal.
The W&B visualizer is enabled by default when using --log wandb. Use --no-wandb-visualizer to disable it.
# Visualizer is on by default with wandb
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log wandb

# To disable the terminal visualizer
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --log wandb \
  --no-wandb-visualizer

Logging Steps

Control logging frequency:
aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./data \
  --project-name my-model \
  --logging-steps 10  # Log every 10 steps

Verbose Output

Capture Full Logs

aitraining llm --train ... 2>&1 | tee training.log

Environment Variables

These environment variables affect logging and debugging behavior:
VariableDescription
AUTOTRAIN_TUI_MODE=1Suppresses logs when running in TUI mode (set automatically)
PAUSE_ON_FAILURE=0Disable pausing on failure (default: 1, enabled)
WANDB_API_KEYWeights & Biases API key for logging

Noise Suppression

These are set automatically to reduce log noise:
VariableValueEffect
TF_CPP_MIN_LOG_LEVEL3Suppress TensorFlow warnings
TOKENIZERS_PARALLELISMfalseDisable tokenizer parallelism warnings
BITSANDBYTES_NOWELCOME1Suppress bitsandbytes welcome message

Common Issues

Out of Memory (OOM)

Symptoms:
  • “CUDA out of memory” error
  • Training crashes suddenly
Solutions:
# Reduce batch size
aitraining llm --train --batch-size 1 ...

# Enable gradient checkpointing (on by default)
# If disabled, re-enable:
# --disable-gradient-checkpointing false

# Use gradient accumulation
aitraining llm --train \
  --batch-size 1 \
  --gradient-accumulation 8 \
  ...

# Enable auto batch size finding
aitraining llm --train --auto-find-batch-size ...

# Use quantization
aitraining llm --train --quantization int4 ...

Slow Training

Check:
  1. GPU utilization:
nvidia-smi -l 1  # Watch GPU usage
  1. Enable optimizations:
aitraining llm --train \
  --use-flash-attention-2 \
  --packing \
  --mixed-precision bf16 \
  ...
  1. Data loading bottleneck:
    • Ensure data is on fast storage (SSD)
    • Pre-process data to reduce tokenization overhead
    • Use smaller sequence lengths if possible

NaN Loss

Symptoms:
  • Loss becomes NaN
  • Training diverges
Solutions:
# Lower learning rate
aitraining llm --train --lr 1e-6 ...

# Add gradient clipping
aitraining llm --train --max-grad-norm 0.5 ...

# Use fp32 instead of fp16/bf16
aitraining llm --train --mixed-precision no ...

Data Issues

Symptoms:
  • Unexpected behavior
  • Poor model quality
Debug steps:
# Check data format
import json
with open("data.jsonl") as f:
    for i, line in enumerate(f):
        try:
            data = json.loads(line)
            print(f"Line {i}: {list(data.keys())}")
        except:
            print(f"Line {i}: INVALID JSON")
        if i >= 5:
            break
# Preview data processing
aitraining llm --train \
  --max-samples 10 \
  --epochs 1 \
  ...

Checkpointing

Save Strategy

aitraining llm --train \
  --save-strategy steps \
  --save-steps 500 \
  --save-total-limit 3 \
  ...

Resume Training

If training crashes, resume from checkpoint:
aitraining llm --train \
  --model ./my-model/checkpoint-500 \
  --data-path ./data \
  ...

Monitoring Tools

GPU Monitoring

# Real-time GPU stats
watch -n 1 nvidia-smi

# GPU memory usage over time
nvidia-smi --query-gpu=memory.used --format=csv -l 5

System Resources

# CPU and memory
htop

# Disk I/O
iostat -x 1

Debugging Checklist

  1. Check logs - Look for error messages
  2. Verify data - Ensure correct format
  3. Check GPU - Memory and utilization
  4. Try smaller - Reduce batch/model size
  5. Isolate issue - Minimal reproduction

Next Steps