GRPO Training
Train language models using Group Relative Policy Optimization (GRPO) with custom reward environments. Instead of a reward model, you provide a Python module with an environment class that runs multi-turn episodes and returns scores.Overview
GRPO is different from PPO in a key way:- PPO requires a pre-trained reward model to score responses
- GRPO uses a custom environment that you write — it generates multiple completions per prompt, scores them via your environment, and optimizes the policy relative to the group
Quick Start
Python API
Environment Interface
You implement a Python class with 3 methods:Environment with Configuration
Pass JSON configuration to your environment via--rl-env-config:
**kwargs to your environment constructor:
Requirements
GRPO uses TRL’s
GRPOTrainer (requires TRL >= 0.28.0). The tokenizer padding side is automatically set to left as required by GRPO.Parameters
GRPO-Specific Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
rl_env_module | --rl-env-module | None | Python module path for the environment (required) |
rl_env_class | --rl-env-class | None | Class name in the environment module (required) |
rl_num_generations | --rl-num-generations | 4 | Number of completions per prompt |
use_vllm | --use-vllm | False | Use vLLM for faster generation |
vllm_mode | --vllm-mode | colocate | vLLM mode: colocate or server |
vllm_gpu_memory_utilization | --vllm-gpu-memory-utilization | 0.3 | GPU memory fraction for vLLM (colocate mode) |
vllm_server_url | --vllm-server-url | None | URL of external vLLM server (server mode) |
vllm_tensor_parallel_size | --vllm-tensor-parallel-size | 1 | GPUs for vLLM tensor parallelism |
vllm_server_gpus | --vllm-server-gpus | 1 | GPUs dedicated to vLLM server (subtracted from training) |
Shared RL Parameters (PPO + GRPO)
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
rl_kl_coef | --rl-kl-coef | 0.1 | KL divergence penalty (beta in GRPOConfig) |
rl_clip_range | --rl-clip-range | 0.2 | Clipping range (epsilon in GRPOConfig) |
rl_env_config | --rl-env-config | None | JSON config passed to environment constructor |
rl_max_new_tokens | --rl-max-new-tokens | 128 | Max tokens to generate per completion |
rl_top_k | --rl-top-k | 50 | Top-k sampling |
rl_top_p | --rl-top-p | 1.0 | Top-p (nucleus) sampling |
rl_temperature | --rl-temperature | 1.0 | Generation temperature |
vLLM Acceleration
Use vLLM for significantly faster completion generation during GRPO training:colocate(default) — vLLM shares the GPU with training. Adjust--vllm-gpu-memory-utilization(default 0.3) to control the memory split.server— vLLM runs on dedicated GPUs. Training processes are automatically reduced by--vllm-server-gpus.
vLLM requires a separate install:
pip install aitraining[vllm] (requires vllm>=0.14.0).GRPO does not require
--data-path — the dataset is built by your environment’s build_dataset() method.How It Works
- Environment loads — Your module is imported via
importlib.import_module(), class instantiated with optional config - Dataset built —
env.build_dataset(tokenizer)returns prompts - Model generates — GRPO generates
rl_num_generationscompletions per prompt - Environment scores —
env.score_episode()is called for each completion, returning 0.0-1.0 - GRPO optimizes — Policy is updated relative to the group scores (better completions get higher weight)
Example: Hotel Booking Agent
GRPO vs PPO
| Feature | PPO | GRPO |
|---|---|---|
| Reward source | Pre-trained reward model | Custom environment (your code) |
| Training signal | Reward model scores | Environment episode scores (0-1) |
| Completions per prompt | 1 | Multiple (rl_num_generations) |
| Best for | General RLHF | Agentic training, tool use, multi-turn |
| Requires | Reward model path | Python env module + class |
| TRL version | >= 0.26.0 | >= 0.28.0 |
Best Practices
- Start with simple environments — Validate that scoring works before complex multi-turn logic
- Use small
rl_num_generations— Start with 4, increase if you need more diversity in completions - Score between 0 and 1 — Use the full range; avoid always returning 0 or 1
- Test your environment independently — Make sure
build_dataset()andscore_episode()work before training - Use LoRA — GRPO with full fine-tuning requires significant memory; LoRA makes it practical
- Small learning rates — Start with 1e-5, same guidance as PPO