Skip to main content

Reward Modeling

Train reward models that score text responses for use in PPO/RLHF training.
Important: Reward models are NOT text generators. They output a scalar score for a given text, used to provide rewards during PPO training. You cannot use a reward model as a normal LLM for text generation.

Quick Start

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./preferences.jsonl \
  --project-name reward-model \
  --trainer reward \
  --prompt-text-column prompt \
  --text-column chosen \
  --rejected-text-column rejected

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="google/gemma-3-270m",
    data_path="./preferences.jsonl",
    project_name="reward-model",

    trainer="reward",

    # Column mappings (required for reward training)
    prompt_text_column="prompt",
    text_column="chosen",
    rejected_text_column="rejected",

    epochs=1,
    batch_size=4,
    lr=2e-5,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Data Format

Reward training requires preference data with three columns:
ColumnDescription
promptThe input prompt/question
chosenThe preferred/better response
rejectedThe less preferred/worse response

Example Data

{"prompt": "Explain gravity", "chosen": "Gravity is a fundamental force...", "rejected": "gravity makes stuff fall down"}
{"prompt": "What is Python?", "chosen": "Python is a high-level programming language...", "rejected": "its a snake"}
{"prompt": "Write a greeting", "chosen": "Hello! How can I assist you today?", "rejected": "hey"}

Required Parameters

Reward training requires all three column parameters to be specified:
  • --prompt-text-column
  • --text-column (for chosen responses)
  • --rejected-text-column

Parameters

ParameterCLI FlagDefaultDescription
prompt_text_column--prompt-text-columnpromptColumn with prompts
text_column--text-columntextColumn with chosen responses
rejected_text_column--rejected-text-columnrejectedColumn with rejected responses

Output Model

The trained model is an AutoModelForSequenceClassification that:
  • Takes text input
  • Returns a scalar reward score
  • Higher scores indicate better responses
  • Used as input to PPO training via --rl-reward-model-path

Using the Reward Model

With PPO Training

aitraining llm --train \
  --model google/gemma-3-270m \
  --data-path ./prompts.jsonl \
  --project-name ppo-model \
  --trainer ppo \
  --rl-reward-model-path ./reward-model

Direct Inference

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load reward model
model = AutoModelForSequenceClassification.from_pretrained("./reward-model")
tokenizer = AutoTokenizer.from_pretrained("./reward-model")

# Score a response
text = "What is AI? AI is artificial intelligence, a field of computer science..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.item()

print(f"Reward score: {score}")

Best Practices

  1. Quality preference data - The reward model is only as good as your annotations
  2. Diverse examples - Include varied prompts and response quality levels
  3. Clear preference signals - Chosen should be clearly better than rejected
  4. Balanced dataset - Avoid bias toward certain response types
  5. Sufficient data - Aim for 1,000+ preference pairs minimum

Example: Building Preference Data

# Example script to create preference data
import json

preferences = [
    {
        "prompt": "Summarize machine learning",
        "chosen": "Machine learning is a subset of AI that enables systems to learn from data...",
        "rejected": "ml is computers learning stuff"
    },
    # Add more examples...
]

with open("preferences.jsonl", "w") as f:
    for item in preferences:
        f.write(json.dumps(item) + "\n")

Next Steps