ORPO Training
ORPO combines SFT and preference optimization in a single training phase.
What is ORPO?
ORPO (Odds Ratio Preference Optimization) is a simpler alternative to DPO that doesn’t require a reference model. It optimizes preferences using odds ratios directly, reducing memory usage and training complexity.
Quick Start
aitraining llm --train \
--model google/gemma-2-2b \
--data-path ./preferences.jsonl \
--project-name gemma-orpo \
--trainer orpo \
--prompt-text-column prompt \
--text-column chosen \
--rejected-text-column rejected \
--peft
ORPO requires --prompt-text-column and --rejected-text-column. The --text-column defaults to "text", so only specify it if your chosen column has a different name.
Python API
from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject
params = LLMTrainingParams(
model="google/gemma-2-2b",
data_path="./preferences.jsonl",
project_name="gemma-orpo",
trainer="orpo",
prompt_text_column="prompt",
text_column="chosen",
rejected_text_column="rejected",
dpo_beta=0.1, # Default: 0.1
max_prompt_length=128, # Default: 128
max_completion_length=None, # Default: None
epochs=3,
batch_size=2,
lr=5e-5,
peft=True,
lora_r=16,
)
project = AutoTrainProject(params=params, backend="local", process=True)
project.create()
Same as DPO - preference pairs:
{
"prompt": "What is AI?",
"chosen": "AI is artificial intelligence, a field of computer science focused on creating systems that can perform tasks requiring human intelligence.",
"rejected": "AI is just robots."
}
ORPO vs DPO
| Aspect | ORPO | DPO |
|---|
| Reference model | Not needed | Not needed with PEFT, required for full fine-tuning |
| Memory usage | Lower | Higher (if using reference model) |
| Training speed | Faster | Slower |
| SFT phase | Combined | Separate |
| Complexity | Simpler | More options |
Parameters
| Parameter | Description | Default |
|---|
trainer | Set to "orpo" | Required |
dpo_beta | Odds ratio weight | 0.1 |
max_prompt_length | Max prompt tokens | 128 |
max_completion_length | Max response tokens | None |
When to Use ORPO
Choose ORPO when:
- Memory is limited (no reference model needed)
- You want combined SFT + alignment
- Simpler training pipeline preferred
- Starting from a base model (not instruction-tuned)
Choose DPO when:
- You need fine-grained control
- Working with already instruction-tuned models
- Reference model behavior is important
Example: Customer Support
params = LLMTrainingParams(
model="google/gemma-2-2b",
data_path="./support_preferences.jsonl",
project_name="support-bot",
trainer="orpo",
dpo_beta=0.15,
epochs=3,
batch_size=2,
gradient_accumulation=4,
lr=2e-5,
peft=True,
lora_r=32,
lora_alpha=64,
log="wandb",
)
Next Steps