ORPO Training

ORPO combines SFT and preference optimization in a single training phase.

What is ORPO?

ORPO (Odds Ratio Preference Optimization) is a simpler alternative to DPO that doesn’t require a reference model. It optimizes preferences using odds ratios directly, reducing memory usage and training complexity.

Quick Start

aitraining llm --train \
  --model google/gemma-2-2b \
  --data-path ./preferences.jsonl \
  --project-name gemma-orpo \
  --trainer orpo \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected \
  --peft

ORPO requires --prompt-text-column and --rejected-text-column. The --text-column defaults to "text", so only specify it if your chosen column has a different name.

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./preferences.jsonl",
    project_name="gemma-orpo",

    trainer="orpo",
    prompt_text_column="prompt",
    text_column="chosen",
    rejected_text_column="rejected",
    dpo_beta=0.1,  # Default: 0.1
    max_prompt_length=128,  # Default: 128
    max_completion_length=None,  # Default: None

    epochs=3,
    batch_size=2,
    lr=5e-5,

    peft=True,
    lora_r=16,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Data Format

Same as DPO - preference pairs:

{
  "prompt": "What is AI?",
  "chosen": "AI is artificial intelligence, a field of computer science focused on creating systems that can perform tasks requiring human intelligence.",
  "rejected": "AI is just robots."
}

ORPO vs DPO

Aspect	ORPO	DPO
Reference model	Not needed	Not needed with PEFT, required for full fine-tuning
Memory usage	Lower	Higher (if using reference model)
Training speed	Faster	Slower
SFT phase	Combined	Separate
Complexity	Simpler	More options

Parameters

Parameter	Description	Default
`trainer`	Set to `"orpo"`	Required
`dpo_beta`	Odds ratio weight	`0.1`
`max_prompt_length`	Max prompt tokens	`128`
`max_completion_length`	Max response tokens	`None`

When to Use ORPO

Choose ORPO when:

Memory is limited (no reference model needed)
You want combined SFT + alignment
Simpler training pipeline preferred
Starting from a base model (not instruction-tuned)

Choose DPO when:

You need fine-grained control
Working with already instruction-tuned models
Reference model behavior is important

Example: Customer Support

params = LLMTrainingParams(
    model="google/gemma-2-2b",
    data_path="./support_preferences.jsonl",
    project_name="support-bot",

    trainer="orpo",
    dpo_beta=0.15,

    epochs=3,
    batch_size=2,
    gradient_accumulation=4,
    lr=2e-5,

    peft=True,
    lora_r=32,
    lora_alpha=64,

    log="wandb",
)

Training Techniques

Optimization

Custom Development

Evaluation

Research

Production

ORPO Training

ORPO Training

What is ORPO?

Quick Start

Python API

Data Format

ORPO vs DPO

Parameters

When to Use ORPO

Example: Customer Support

Next Steps

DPO Training

Reward Modeling

Training Techniques

Optimization

Custom Development

Evaluation

Research

Production

​ORPO Training

​What is ORPO?

​Quick Start

​Python API

​Data Format

​ORPO vs DPO

​Parameters

​When to Use ORPO

​Example: Customer Support

​Next Steps

DPO Training

Reward Modeling

ORPO Training

What is ORPO?

Quick Start

Python API

Data Format

ORPO vs DPO

Parameters

When to Use ORPO

Example: Customer Support

Next Steps