Skip to main content

ORPO Training

ORPO combines SFT and preference optimization in a single training phase.

What is ORPO?

ORPO (Odds Ratio Preference Optimization) is a simpler alternative to DPO that doesn’t require a reference model. It optimizes preferences using odds ratios directly, reducing memory usage and training complexity.

Quick Start

aitraining llm --train \
  --model google/gemma-2-2b \
  --data-path ./preferences.jsonl \
  --project-name gemma-orpo \
  --trainer orpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --peft
ORPO requires --prompt-text-column and --rejected-text-column. The --text-column defaults to "text", so only specify it if your chosen column has a different name.

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./preferences.jsonl",
    project_name="gemma-orpo",

    trainer="orpo",
    prompt_text_column="prompt",
    text_column="chosen",
    rejected_text_column="rejected",
    dpo_beta=0.1,  # Default: 0.1
    max_prompt_length=128,  # Default: 128
    max_completion_length=None,  # Default: None

    epochs=3,
    batch_size=2,
    lr=5e-5,

    peft=True,
    lora_r=16,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Data Format

Same as DPO - preference pairs:
{
  "prompt": "What is AI?",
  "chosen": "AI is artificial intelligence, a field of computer science focused on creating systems that can perform tasks requiring human intelligence.",
  "rejected": "AI is just robots."
}

ORPO vs DPO

AspectORPODPO
Reference modelNot neededNot needed with PEFT, required for full fine-tuning
Memory usageLowerHigher (if using reference model)
Training speedFasterSlower
SFT phaseCombinedSeparate
ComplexitySimplerMore options

Parameters

ParameterDescriptionDefault
trainerSet to "orpo"Required
dpo_betaOdds ratio weight0.1
max_prompt_lengthMax prompt tokens128
max_completion_lengthMax response tokensNone

When to Use ORPO

Choose ORPO when:
  • Memory is limited (no reference model needed)
  • You want combined SFT + alignment
  • Simpler training pipeline preferred
  • Starting from a base model (not instruction-tuned)
Choose DPO when:
  • You need fine-grained control
  • Working with already instruction-tuned models
  • Reference model behavior is important

Example: Customer Support

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./support_preferences.jsonl",
    project_name="support-bot",

    trainer="orpo",
    dpo_beta=0.15,

    epochs=3,
    batch_size=2,
    gradient_accumulation=4,
    lr=2e-5,

    peft=True,
    lora_r=32,
    lora_alpha=64,

    log="wandb",
)

Next Steps