Skip to main content

LoRA & PEFT

Parameter-Efficient Fine-Tuning lets you train large models with less memory.

What is LoRA?

LoRA (Low-Rank Adaptation) adds small trainable matrices to the model while keeping base weights frozen. This dramatically reduces memory usage and training time.

Quick Start

aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./data.jsonl \
  --project-name lora-model \
  --peft \
  --lora-r 16 \
  --lora-alpha 32

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-1B",
    data_path="./data.jsonl",
    project_name="lora-model",

    trainer="sft",

    # LoRA configuration
    peft=True,
    lora_r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules="all-linear",  # Default: all-linear

    epochs=3,
    batch_size=4,
    lr=2e-4,  # Higher LR works with LoRA
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Parameters

ParameterDescriptionDefault
peftEnable LoRAFalse
lora_rRank (size of adapters)16
lora_alphaScaling factor32
lora_dropoutDropout rate0.05
target_modulesModules to adaptall-linear

Rank (lora_r)

Higher rank = more parameters = more capacity:
RankUse Case
8Simple tasks, very memory constrained
16Standard (recommended)
32-64Complex tasks, more memory available
128+Near full fine-tuning capacity

Alpha

The alpha/rank ratio affects learning:
# Standard scaling
lora_r=16
lora_alpha=32  # alpha/r = 2

# More aggressive
lora_r=16
lora_alpha=64  # alpha/r = 4

# Conservative
lora_r=16
lora_alpha=16  # alpha/r = 1

Target Modules

By default, LoRA targets all linear layers (all-linear). You can customize:
# All linear layers (default)
target_modules="all-linear"

# Attention layers only
target_modules="q_proj,k_proj,v_proj,o_proj"

# Include MLP
target_modules="q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj"

With Quantization

Combine LoRA with quantization for maximum memory savings:
aitraining llm --train \
  --model meta-llama/Llama-3.2-8B \
  --data-path ./data.jsonl \
  --project-name quantized-lora \
  --peft \
  --quantization int4
params = LLMTrainingParams(
    model="meta-llama/Llama-3.2-8B",
    data_path="./data.jsonl",
    project_name="quantized-lora",

    peft=True,
    lora_r=16,
    quantization="int4",  # or "int8"
)

Memory Comparison

ModelFull Fine-tuneLoRALoRA + 4bit
1B8 GB4 GB3 GB
7B56 GB16 GB8 GB
13B104 GB32 GB16 GB

Merging Adapters

By default, LoRA adapters are automatically merged into the base model after training. This makes inference simpler - you get a single model file ready to use.

Default Behavior (Merged)

params = LLMTrainingParams(
    ...
    peft=True,
    # merge_adapter=True is the default
)

Save Adapters Only

To save only the adapter files (smaller, but requires base model for inference):
aitraining llm --train \
  --model meta-llama/Llama-3.2-1B \
  --data-path ./data.jsonl \
  --project-name lora-model \
  --peft \
  --no-merge-adapter

Manual Merge Later

aitraining tools merge-llm-adapter \
  --base-model-path meta-llama/Llama-3.2-1B \
  --adapter-path ./lora-model \
  --output-folder ./merged-model
You must specify either --output-folder to save locally or --push-to-hub to upload to Hugging Face Hub.

Merge Tool Parameters

ParameterDescriptionRequired
--base-model-pathBase model to merge adapter intoYes
--adapter-pathPath to LoRA adapterYes
--output-folderLocal output directoryOne of these
--push-to-hubPush to Hugging Face Hubrequired
--tokenHugging Face token (for hub push)No
--pad-to-multiple-ofPad vocab sizeNo
Or in Python:
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

# Load and merge adapter
model = PeftModel.from_pretrained(model, "./lora-model")
model = model.merge_and_unload()

# Save merged model
model.save_pretrained("./merged-model")

Convert to Kohya Format

Convert LoRA adapters to Kohya-compatible .safetensors format:
aitraining tools convert_to_kohya \
  --adapter-path ./lora-model \
  --output-path ./kohya-lora.safetensors

Loading Adapters

Use adapters without merging:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

# Load adapter
model = PeftModel.from_pretrained(model, "./lora-model")

# Use for inference
inputs = tokenizer("Hello", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)

Best Practices

Training

  • Use higher learning rate (2e-4 to 1e-3)
  • LoRA benefits from longer training
  • Consider targeting all linear layers for complex tasks

Memory

  • Start with lora_r=16
  • Add quantization if needed
  • Use gradient checkpointing (on by default)

Quality

  • Higher rank generally = better quality
  • Test on your specific task
  • Compare with full fine-tuning if memory allows

Next Steps