Skip to main content

PPO Training

Train language models using Proximal Policy Optimization (PPO) for reinforcement learning from human feedback (RLHF).

Overview

PPO training is a 2-step process:
  1. Train a Reward Model - Train a model to score responses (see Reward Modeling)
  2. Run PPO Training - Use the reward model to guide policy optimization

Quick Start

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./prompts.jsonl \
  --project-name ppo-model \
  --trainer ppo \
  --rl-reward-model-path ./reward-model

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-3-270m",
    data_path="./prompts.jsonl",
    project_name="ppo-model",

    trainer="ppo",
    rl_reward_model_path="./reward-model",

    # PPO hyperparameters
    rl_gamma=0.99,
    rl_gae_lambda=0.95,
    rl_kl_coef=0.1,
    rl_clip_range=0.2,
    rl_num_ppo_epochs=4,

    epochs=1,
    batch_size=4,
    lr=1e-5,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Requirements

PPO training requires either --rl-reward-model-path (path to a trained reward model) or --model-ref (reference model for KL divergence). At least one must be specified.

Parameters

Core PPO Parameters

ParameterCLI FlagDefaultDescription
rl_reward_model_path--rl-reward-model-pathNonePath to reward model (required)
rl_gamma--rl-gamma0.99Discount factor (0.9-0.99)
rl_gae_lambda--rl-gae-lambda0.95GAE lambda for advantage estimation (0.9-0.99)
rl_kl_coef--rl-kl-coef0.1KL divergence coefficient (0.01-0.5)
rl_value_loss_coef--rl-value-loss-coef1.0Value loss coefficient (0.5-2.0)
rl_clip_range--rl-clip-range0.2PPO clipping range (0.1-0.3)
rl_value_clip_range--rl-value-clip-range0.2Value function clipping range

Training Parameters

ParameterCLI FlagDefaultDescription
rl_num_ppo_epochs--rl-num-ppo-epochs4PPO epochs per batch
rl_chunk_size--rl-chunk-size128Training chunk size
rl_mini_batch_size--rl-mini-batch-size8Mini-batch size
rl_optimize_device_cache--rl-optimize-device-cacheTrueMemory optimization

Generation Parameters

ParameterCLI FlagDefaultDescription
rl_max_new_tokens--rl-max-new-tokens128Max tokens to generate
rl_top_k--rl-top-k50Top-k sampling
rl_top_p--rl-top-p1.0Top-p (nucleus) sampling
rl_temperature--rl-temperature1.0Generation temperature

Advanced Parameters

ParameterCLI FlagDefaultDescription
rl_reward_fn--rl-reward-fnNoneReward function: default, length_penalty, correctness, custom
rl_multi_objective--rl-multi-objectiveFalseEnable multi-objective rewards
rl_reward_weights--rl-reward-weightsNoneJSON weights for multi-objective
rl_env_type--rl-env-typeNoneRL environment type
rl_env_config--rl-env-configNoneJSON environment config

Data Format

PPO training uses prompts only (the model generates responses):
{"text": "What is machine learning?"}
{"text": "Explain quantum computing."}
{"text": "Write a haiku about coding."}

RL Environment Types

Three environment types are available:
EnvironmentDescription
text_generationStandard text generation with reward scoring
multi_objectiveMultiple reward components combined
preference_comparisonCompare generated responses

Multi-Objective Rewards

Enable multiple reward signals:
params = LLMTrainingParams(
    ...
    trainer="ppo",
    rl_multi_objective=True,
    rl_env_type="multi_objective",
    rl_reward_weights='{"correctness": 1.0, "formatting": 0.1}',
)

Example: Full RLHF Pipeline

Step 1: Train Reward Model

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./preferences.jsonl \
  --project-name reward-model \
  --trainer reward \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected

Step 2: Run PPO Training

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./prompts.jsonl \
  --project-name ppo-model \
  --trainer ppo \
  --rl-reward-model-path ./reward-model \
  --rl-kl-coef 0.1 \
  --rl-clip-range 0.2

Best Practices

  1. Start with a good base model - Fine-tune with SFT before PPO
  2. Use a well-trained reward model - Quality of rewards determines PPO success
  3. Monitor KL divergence - Too high means model is diverging too much from original
  4. Start with default hyperparameters - Adjust based on training dynamics
  5. Use small learning rates - PPO is sensitive to learning rate (1e-5 to 5e-6)

Next Steps