LoRA & PEFT
Parameter-Efficient Fine-Tuning lets you train large models with less memory.
What is LoRA?
LoRA (Low-Rank Adaptation) adds small trainable matrices to the model while keeping base weights frozen. This dramatically reduces memory usage and training time.
Quick Start
aitraining llm --train \
--model meta-llama/Llama-3.2-1B \
--data-path ./data.jsonl \
--project-name lora-model \
--peft \
--lora-r 16 \
--lora-alpha 32
Python API
from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject
params = LLMTrainingParams(
model="meta-llama/Llama-3.2-1B",
data_path="./data.jsonl",
project_name="lora-model",
trainer="sft",
# LoRA configuration
peft=True,
lora_r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules="all-linear", # Default: all-linear
epochs=3,
batch_size=4,
lr=2e-4, # Higher LR works with LoRA
)
project = AutoTrainProject(params=params, backend="local", process=True)
project.create()
Parameters
| Parameter | Description | Default |
|---|
peft | Enable LoRA | False |
lora_r | Rank (size of adapters) | 16 |
lora_alpha | Scaling factor | 32 |
lora_dropout | Dropout rate | 0.05 |
target_modules | Modules to adapt | all-linear |
Rank (lora_r)
Higher rank = more parameters = more capacity:
| Rank | Use Case |
|---|
| 8 | Simple tasks, very memory constrained |
| 16 | Standard (recommended) |
| 32-64 | Complex tasks, more memory available |
| 128+ | Near full fine-tuning capacity |
Alpha
The alpha/rank ratio affects learning:
# Standard scaling
lora_r=16
lora_alpha=32 # alpha/r = 2
# More aggressive
lora_r=16
lora_alpha=64 # alpha/r = 4
# Conservative
lora_r=16
lora_alpha=16 # alpha/r = 1
Target Modules
By default, LoRA targets all linear layers (all-linear). You can customize:
# All linear layers (default)
target_modules="all-linear"
# Attention layers only
target_modules="q_proj,k_proj,v_proj,o_proj"
# Include MLP
target_modules="q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj"
With Quantization
Combine LoRA with quantization for maximum memory savings:
aitraining llm --train \
--model meta-llama/Llama-3.2-8B \
--data-path ./data.jsonl \
--project-name quantized-lora \
--peft \
--quantization int4
params = LLMTrainingParams(
model="meta-llama/Llama-3.2-8B",
data_path="./data.jsonl",
project_name="quantized-lora",
peft=True,
lora_r=16,
quantization="int4", # or "int8"
)
Memory Comparison
| Model | Full Fine-tune | LoRA | LoRA + 4bit |
|---|
| 1B | 8 GB | 4 GB | 3 GB |
| 7B | 56 GB | 16 GB | 8 GB |
| 13B | 104 GB | 32 GB | 16 GB |
Merging Adapters
By default, LoRA adapters are automatically merged into the base model after training. This makes inference simpler - you get a single model file ready to use.
Default Behavior (Merged)
params = LLMTrainingParams(
...
peft=True,
# merge_adapter=True is the default
)
Save Adapters Only
To save only the adapter files (smaller, but requires base model for inference):
aitraining llm --train \
--model meta-llama/Llama-3.2-1B \
--data-path ./data.jsonl \
--project-name lora-model \
--peft \
--no-merge-adapter
Manual Merge Later
aitraining tools merge-llm-adapter \
--base-model-path meta-llama/Llama-3.2-1B \
--adapter-path ./lora-model \
--output-folder ./merged-model
You must specify either --output-folder to save locally or --push-to-hub to upload to Hugging Face Hub.
| Parameter | Description | Required |
|---|
--base-model-path | Base model to merge adapter into | Yes |
--adapter-path | Path to LoRA adapter | Yes |
--output-folder | Local output directory | One of these |
--push-to-hub | Push to Hugging Face Hub | required |
--token | Hugging Face token (for hub push) | No |
--pad-to-multiple-of | Pad vocab size | No |
Or in Python:
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
# Load and merge adapter
model = PeftModel.from_pretrained(model, "./lora-model")
model = model.merge_and_unload()
# Save merged model
model.save_pretrained("./merged-model")
Convert LoRA adapters to Kohya-compatible .safetensors format:
aitraining tools convert_to_kohya \
--adapter-path ./lora-model \
--output-path ./kohya-lora.safetensors
Loading Adapters
Use adapters without merging:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
# Load adapter
model = PeftModel.from_pretrained(model, "./lora-model")
# Use for inference
inputs = tokenizer("Hello", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
Best Practices
Training
- Use higher learning rate (2e-4 to 1e-3)
- LoRA benefits from longer training
- Consider targeting all linear layers for complex tasks
Memory
- Start with
lora_r=16
- Add quantization if needed
- Use gradient checkpointing (on by default)
Quality
- Higher rank generally = better quality
- Test on your specific task
- Compare with full fine-tuning if memory allows
Next Steps