Logging & Debugging
Monitor training progress and diagnose issues.
Logging Options
Weights & Biases
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name my-model \
--log wandb
Features:
- Real-time loss curves
- Hardware metrics
- Hyperparameter tracking
- Model artifacts
TensorBoard
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name my-model \
--log tensorboard
View in browser:
tensorboard --logdir my-model/runs
W&B Visualizer (LEET)
Built-in terminal visualizer that shows real-time metrics in your terminal.
The W&B visualizer is enabled by default when using --log wandb. Use --no-wandb-visualizer to disable it.
# Visualizer is on by default with wandb
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name my-model \
--log wandb
# To disable the terminal visualizer
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name my-model \
--log wandb \
--no-wandb-visualizer
Logging Steps
Control logging frequency:
aitraining llm --train \
--model google/gemma-3-270m \
--data-path ./data \
--project-name my-model \
--logging-steps 10 # Log every 10 steps
Verbose Output
Capture Full Logs
aitraining llm --train ... 2>&1 | tee training.log
Environment Variables
These environment variables affect logging and debugging behavior:
| Variable | Description |
|---|
AUTOTRAIN_TUI_MODE=1 | Suppresses logs when running in TUI mode (set automatically) |
PAUSE_ON_FAILURE=0 | Disable pausing on failure (default: 1, enabled) |
WANDB_API_KEY | Weights & Biases API key for logging |
Noise Suppression
These are set automatically to reduce log noise:
| Variable | Value | Effect |
|---|
TF_CPP_MIN_LOG_LEVEL | 3 | Suppress TensorFlow warnings |
TOKENIZERS_PARALLELISM | false | Disable tokenizer parallelism warnings |
BITSANDBYTES_NOWELCOME | 1 | Suppress bitsandbytes welcome message |
Common Issues
Out of Memory (OOM)
Symptoms:
- “CUDA out of memory” error
- Training crashes suddenly
Solutions:
# Reduce batch size
aitraining llm --train --batch-size 1 ...
# Enable gradient checkpointing (on by default)
# If disabled, re-enable:
# --disable-gradient-checkpointing false
# Use gradient accumulation
aitraining llm --train \
--batch-size 1 \
--gradient-accumulation 8 \
...
# Enable auto batch size finding
aitraining llm --train --auto-find-batch-size ...
# Use quantization
aitraining llm --train --quantization int4 ...
Slow Training
Check:
- GPU utilization:
nvidia-smi -l 1 # Watch GPU usage
- Enable optimizations:
aitraining llm --train \
--use-flash-attention-2 \
--packing \
--mixed-precision bf16 \
...
- Data loading bottleneck:
- Ensure data is on fast storage (SSD)
- Pre-process data to reduce tokenization overhead
- Use smaller sequence lengths if possible
NaN Loss
Symptoms:
- Loss becomes NaN
- Training diverges
Solutions:
# Lower learning rate
aitraining llm --train --lr 1e-6 ...
# Add gradient clipping
aitraining llm --train --max-grad-norm 0.5 ...
# Use fp32 instead of fp16/bf16
aitraining llm --train --mixed-precision no ...
Data Issues
Symptoms:
- Unexpected behavior
- Poor model quality
Debug steps:
# Check data format
import json
with open("data.jsonl") as f:
for i, line in enumerate(f):
try:
data = json.loads(line)
print(f"Line {i}: {list(data.keys())}")
except:
print(f"Line {i}: INVALID JSON")
if i >= 5:
break
# Preview data processing
aitraining llm --train \
--max-samples 10 \
--epochs 1 \
...
Checkpointing
Save Strategy
aitraining llm --train \
--save-strategy steps \
--save-steps 500 \
--save-total-limit 3 \
...
Resume Training
If training crashes, resume from checkpoint:
aitraining llm --train \
--model ./my-model/checkpoint-500 \
--data-path ./data \
...
GPU Monitoring
# Real-time GPU stats
watch -n 1 nvidia-smi
# GPU memory usage over time
nvidia-smi --query-gpu=memory.used --format=csv -l 5
System Resources
# CPU and memory
htop
# Disk I/O
iostat -x 1
Debugging Checklist
- Check logs - Look for error messages
- Verify data - Ensure correct format
- Check GPU - Memory and utilization
- Try smaller - Reduce batch/model size
- Isolate issue - Minimal reproduction
Next Steps