Skip to main content

GRPO Training

Train language models using Group Relative Policy Optimization (GRPO) with custom reward environments. Instead of a reward model, you provide a Python module with an environment class that runs multi-turn episodes and returns scores.

Overview

GRPO is different from PPO in a key way:
  • PPO requires a pre-trained reward model to score responses
  • GRPO uses a custom environment that you write — it generates multiple completions per prompt, scores them via your environment, and optimizes the policy relative to the group
This makes GRPO ideal for agentic training where rewards come from task execution (tool use, code execution, multi-turn interactions) rather than a static reward model.

Quick Start

aitraining llm --train \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-num-generations 4 \
  --rl-max-new-tokens 256

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    project_name="grpo-agent",

    trainer="grpo",
    rl_env_module="my_envs.hotel_env",
    rl_env_class="HotelEnv",
    rl_num_generations=4,

    # Shared RL parameters
    rl_max_new_tokens=256,
    rl_temperature=1.0,
    rl_kl_coef=0.1,
    rl_clip_range=0.2,

    epochs=1,
    batch_size=4,
    lr=1e-5,
    peft=True,
    lora_r=16,
    lora_alpha=32,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Environment Interface

You implement a Python class with 3 methods:
from datasets import Dataset

class MyEnv:
    def build_dataset(self, tokenizer) -> Dataset:
        """Return a HuggingFace Dataset with a 'prompt' column.

        Can include extra columns (e.g., case_idx) that are passed
        as kwargs to score_episode().
        """
        return Dataset.from_dict({
            "prompt": ["You are a hotel booking agent...", ...],
            "case_idx": [0, 1, ...],
        })

    def score_episode(self, model, tokenizer, completion, case_idx) -> float:
        """Run a multi-turn episode from the completion and return a score.

        Args:
            model: The current model being trained
            tokenizer: The tokenizer
            completion: The model's generated text
            case_idx: Index from the dataset (or any extra column)

        Returns:
            Float between 0.0 and 1.0
        """
        # Your scoring logic here
        return score

    def get_tools(self) -> list[dict]:
        """Return tool schemas for generation (optional).

        Returns:
            List of tool definition dicts (OpenAI function calling format)
        """
        return []

Environment with Configuration

Pass JSON configuration to your environment via --rl-env-config:
aitraining llm --train \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-env-config '{"max_turns": 5, "difficulty": "hard"}'
The JSON is parsed and passed as **kwargs to your environment constructor:
class HotelEnv:
    def __init__(self, max_turns=3, difficulty="normal"):
        self.max_turns = max_turns
        self.difficulty = difficulty

Requirements

GRPO training requires both --rl-env-module and --rl-env-class to be specified. These are validated at startup — if either is missing, training will fail with a clear error message.
GRPO uses TRL’s GRPOTrainer (requires TRL >= 0.28.0). The tokenizer padding side is automatically set to left as required by GRPO.

Parameters

GRPO-Specific Parameters

ParameterCLI FlagDefaultDescription
rl_env_module--rl-env-moduleNonePython module path for the environment (required)
rl_env_class--rl-env-classNoneClass name in the environment module (required)
rl_num_generations--rl-num-generations4Number of completions per prompt
use_vllm--use-vllmFalseUse vLLM for faster generation
vllm_mode--vllm-modecolocatevLLM mode: colocate or server
vllm_gpu_memory_utilization--vllm-gpu-memory-utilization0.3GPU memory fraction for vLLM (colocate mode)
vllm_server_url--vllm-server-urlNoneURL of external vLLM server (server mode)
vllm_tensor_parallel_size--vllm-tensor-parallel-size1GPUs for vLLM tensor parallelism
vllm_server_gpus--vllm-server-gpus1GPUs dedicated to vLLM server (subtracted from training)

Shared RL Parameters (PPO + GRPO)

ParameterCLI FlagDefaultDescription
rl_kl_coef--rl-kl-coef0.1KL divergence penalty (beta in GRPOConfig)
rl_clip_range--rl-clip-range0.2Clipping range (epsilon in GRPOConfig)
rl_env_config--rl-env-configNoneJSON config passed to environment constructor
rl_max_new_tokens--rl-max-new-tokens128Max tokens to generate per completion
rl_top_k--rl-top-k50Top-k sampling
rl_top_p--rl-top-p1.0Top-p (nucleus) sampling
rl_temperature--rl-temperature1.0Generation temperature

vLLM Acceleration

Use vLLM for significantly faster completion generation during GRPO training:
aitraining llm --train --trainer grpo \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --use-vllm \
  --vllm-gpu-memory-utilization 0.3
Two modes are available:
  • colocate (default) — vLLM shares the GPU with training. Adjust --vllm-gpu-memory-utilization (default 0.3) to control the memory split.
  • server — vLLM runs on dedicated GPUs. Training processes are automatically reduced by --vllm-server-gpus.
# Server mode: 8 GPUs total, 2 for vLLM, 6 for training
aitraining llm --train --trainer grpo \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --use-vllm \
  --vllm-mode server \
  --vllm-server-gpus 2 \
  --vllm-tensor-parallel-size 2
vLLM requires a separate install: pip install aitraining[vllm] (requires vllm>=0.14.0).
GRPO does not require --data-path — the dataset is built by your environment’s build_dataset() method.

How It Works

  1. Environment loads — Your module is imported via importlib.import_module(), class instantiated with optional config
  2. Dataset builtenv.build_dataset(tokenizer) returns prompts
  3. Model generates — GRPO generates rl_num_generations completions per prompt
  4. Environment scoresenv.score_episode() is called for each completion, returning 0.0-1.0
  5. GRPO optimizes — Policy is updated relative to the group scores (better completions get higher weight)

Example: Hotel Booking Agent

# my_envs/hotel_env.py
from datasets import Dataset

class HotelEnv:
    def __init__(self, max_turns=5):
        self.max_turns = max_turns
        self.cases = [
            {"prompt": "Book a room in Paris for 2 nights", "expected": "paris"},
            {"prompt": "Find a hotel near the airport in Tokyo", "expected": "tokyo"},
        ]

    def build_dataset(self, tokenizer):
        return Dataset.from_dict({
            "prompt": [c["prompt"] for c in self.cases],
            "case_idx": list(range(len(self.cases))),
        })

    def score_episode(self, model, tokenizer, completion, case_idx):
        expected = self.cases[case_idx]["expected"]
        # Simple: check if the expected keyword appears in completion
        if expected.lower() in completion.lower():
            return 1.0
        return 0.0

    def get_tools(self):
        return [{
            "type": "function",
            "function": {
                "name": "search_hotels",
                "description": "Search for hotels in a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"},
                        "nights": {"type": "integer"}
                    },
                    "required": ["city"]
                }
            }
        }]
aitraining llm --train \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-num-generations 4 \
  --rl-max-new-tokens 256 \
  --peft \
  --lora-r 16 \
  --lr 1e-5 \
  --batch-size 4

GRPO vs PPO

FeaturePPOGRPO
Reward sourcePre-trained reward modelCustom environment (your code)
Training signalReward model scoresEnvironment episode scores (0-1)
Completions per prompt1Multiple (rl_num_generations)
Best forGeneral RLHFAgentic training, tool use, multi-turn
RequiresReward model pathPython env module + class
TRL version>= 0.26.0>= 0.28.0

Best Practices

  1. Start with simple environments — Validate that scoring works before complex multi-turn logic
  2. Use small rl_num_generations — Start with 4, increase if you need more diversity in completions
  3. Score between 0 and 1 — Use the full range; avoid always returning 0 or 1
  4. Test your environment independently — Make sure build_dataset() and score_episode() work before training
  5. Use LoRA — GRPO with full fine-tuning requires significant memory; LoRA makes it practical
  6. Small learning rates — Start with 1e-5, same guidance as PPO

Next Steps