GRPO Training

Train language models using Group Relative Policy Optimization (GRPO) with custom reward environments. Instead of a reward model, you provide a Python module with an environment class that runs multi-turn episodes and returns scores.

Overview

GRPO is different from PPO in a key way:

PPO requires a pre-trained reward model to score responses
GRPO uses a custom environment that you write — it generates multiple completions per prompt, scores them via your environment, and optimizes the policy relative to the group

This makes GRPO ideal for agentic training where rewards come from task execution (tool use, code execution, multi-turn interactions) rather than a static reward model.

Quick Start

aitraining llm --train \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-num-generations 4 \
  --rl-max-new-tokens 256

Python API

from autotrain.trainers.clm.params import LLMTrainingParams
from autotrain.project import AutoTrainProject

params = LLMTrainingParams(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    project_name="grpo-agent",

    trainer="grpo",
    rl_env_module="my_envs.hotel_env",
    rl_env_class="HotelEnv",
    rl_num_generations=4,

    # Shared RL parameters
    rl_max_new_tokens=256,
    rl_temperature=1.0,
    rl_kl_coef=0.1,
    rl_clip_range=0.2,

    epochs=1,
    batch_size=4,
    lr=1e-5,
    peft=True,
    lora_r=16,
    lora_alpha=32,
)

project = AutoTrainProject(params=params, backend="local", process=True)
project.create()

Environment Interface

You implement a Python class with 3 methods:

from datasets import Dataset

class MyEnv:
    def build_dataset(self, tokenizer) -> Dataset:
        """Return a HuggingFace Dataset with a 'prompt' column.

        Can include extra columns (e.g., case_idx) that are passed
        as kwargs to score_episode().
        """
        return Dataset.from_dict({
            "prompt": ["You are a hotel booking agent...", ...],
            "case_idx": [0, 1, ...],
        })

    def score_episode(self, model, tokenizer, completion, case_idx) -> float:
        """Run a multi-turn episode from the completion and return a score.

        Args:
            model: The current model being trained
            tokenizer: The tokenizer
            completion: The model's generated text
            case_idx: Index from the dataset (or any extra column)

        Returns:
            Float between 0.0 and 1.0
        """
        # Your scoring logic here
        return score

    def get_tools(self) -> list[dict]:
        """Return tool schemas for generation (optional).

        Returns:
            List of tool definition dicts (OpenAI function calling format)
        """
        return []

Environment with Configuration

Pass JSON configuration to your environment via --rl-env-config:

aitraining llm --train \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-env-config '{"max_turns": 5, "difficulty": "hard"}'

The JSON is parsed and passed as **kwargs to your environment constructor:

class HotelEnv:
    def __init__(self, max_turns=3, difficulty="normal"):
        self.max_turns = max_turns
        self.difficulty = difficulty

Requirements

GRPO training requires both --rl-env-module and --rl-env-class to be specified. These are validated at startup — if either is missing, training will fail with a clear error message.

GRPO uses TRL’s GRPOTrainer (requires TRL >= 0.28.0). The tokenizer padding side is automatically set to left as required by GRPO.

Parameters

GRPO-Specific Parameters

Parameter	CLI Flag	Default	Description
`rl_env_module`	`--rl-env-module`	None	Python module path for the environment (required)
`rl_env_class`	`--rl-env-class`	None	Class name in the environment module (required)
`rl_num_generations`	`--rl-num-generations`	`4`	Number of completions per prompt
`use_vllm`	`--use-vllm`	`False`	Use vLLM for faster generation
`vllm_mode`	`--vllm-mode`	`colocate`	vLLM mode: `colocate` or `server`
`vllm_gpu_memory_utilization`	`--vllm-gpu-memory-utilization`	`0.3`	GPU memory fraction for vLLM (colocate mode)
`vllm_server_url`	`--vllm-server-url`	None	URL of external vLLM server (server mode)
`vllm_tensor_parallel_size`	`--vllm-tensor-parallel-size`	`1`	GPUs for vLLM tensor parallelism
`vllm_server_gpus`	`--vllm-server-gpus`	`1`	GPUs dedicated to vLLM server (subtracted from training)

Shared RL Parameters (PPO + GRPO)

Parameter	CLI Flag	Default	Description
`rl_kl_coef`	`--rl-kl-coef`	`0.1`	KL divergence penalty (beta in GRPOConfig)
`rl_clip_range`	`--rl-clip-range`	`0.2`	Clipping range (epsilon in GRPOConfig)
`rl_env_config`	`--rl-env-config`	None	JSON config passed to environment constructor
`rl_max_new_tokens`	`--rl-max-new-tokens`	`128`	Max tokens to generate per completion
`rl_top_k`	`--rl-top-k`	`50`	Top-k sampling
`rl_top_p`	`--rl-top-p`	`1.0`	Top-p (nucleus) sampling
`rl_temperature`	`--rl-temperature`	`1.0`	Generation temperature

vLLM Acceleration

Use vLLM for significantly faster completion generation during GRPO training:

aitraining llm --train --trainer grpo \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --use-vllm \
  --vllm-gpu-memory-utilization 0.3

Two modes are available:

colocate (default) — vLLM shares the GPU with training. Adjust --vllm-gpu-memory-utilization (default 0.3) to control the memory split.
server — vLLM runs on dedicated GPUs. Training processes are automatically reduced by --vllm-server-gpus.

# Server mode: 8 GPUs total, 2 for vLLM, 6 for training
aitraining llm --train --trainer grpo \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --use-vllm \
  --vllm-mode server \
  --vllm-server-gpus 2 \
  --vllm-tensor-parallel-size 2

vLLM requires a separate install: pip install aitraining[vllm] (requires vllm>=0.14.0).

GRPO does not require --data-path — the dataset is built by your environment’s build_dataset() method.

How It Works

Environment loads — Your module is imported via importlib.import_module(), class instantiated with optional config
Dataset built — env.build_dataset(tokenizer) returns prompts
Model generates — GRPO generates rl_num_generations completions per prompt
Environment scores — env.score_episode() is called for each completion, returning 0.0-1.0
GRPO optimizes — Policy is updated relative to the group scores (better completions get higher weight)

Example: Hotel Booking Agent

# my_envs/hotel_env.py
from datasets import Dataset

class HotelEnv:
    def __init__(self, max_turns=5):
        self.max_turns = max_turns
        self.cases = [
            {"prompt": "Book a room in Paris for 2 nights", "expected": "paris"},
            {"prompt": "Find a hotel near the airport in Tokyo", "expected": "tokyo"},
        ]

    def build_dataset(self, tokenizer):
        return Dataset.from_dict({
            "prompt": [c["prompt"] for c in self.cases],
            "case_idx": list(range(len(self.cases))),
        })

    def score_episode(self, model, tokenizer, completion, case_idx):
        expected = self.cases[case_idx]["expected"]
        # Simple: check if the expected keyword appears in completion
        if expected.lower() in completion.lower():
            return 1.0
        return 0.0

    def get_tools(self):
        return [{
            "type": "function",
            "function": {
                "name": "search_hotels",
                "description": "Search for hotels in a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"},
                        "nights": {"type": "integer"}
                    },
                    "required": ["city"]
                }
            }
        }]

aitraining llm --train \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
  --trainer grpo \
  --rl-env-module my_envs.hotel_env \
  --rl-env-class HotelEnv \
  --rl-num-generations 4 \
  --rl-max-new-tokens 256 \
  --peft \
  --lora-r 16 \
  --lr 1e-5 \
  --batch-size 4

GRPO vs PPO

Feature	PPO	GRPO
Reward source	Pre-trained reward model	Custom environment (your code)
Training signal	Reward model scores	Environment episode scores (0-1)
Completions per prompt	1	Multiple (`rl_num_generations`)
Best for	General RLHF	Agentic training, tool use, multi-turn
Requires	Reward model path	Python env module + class
TRL version	>= 0.26.0	>= 0.28.0

Best Practices

Start with simple environments — Validate that scoring works before complex multi-turn logic
Use small rl_num_generations — Start with 4, increase if you need more diversity in completions
Score between 0 and 1 — Use the full range; avoid always returning 0 or 1
Test your environment independently — Make sure build_dataset() and score_episode() work before training
Use LoRA — GRPO with full fine-tuning requires significant memory; LoRA makes it practical
Small learning rates — Start with 1e-5, same guidance as PPO

Next Steps

PPO Training

RLHF with reward models

RL Module

Low-level RL building blocks

DPO Training

Simpler alternative with preference data

LoRA/PEFT

Efficient fine-tuning

Training Techniques

Optimization

Custom Development

Evaluation

Research

Production

GRPO Training

GRPO Training

Overview

Quick Start

Python API

Environment Interface

Environment with Configuration

Requirements

Parameters

GRPO-Specific Parameters

Shared RL Parameters (PPO + GRPO)

vLLM Acceleration

How It Works

Example: Hotel Booking Agent

GRPO vs PPO

Best Practices

Next Steps

PPO Training

RL Module

DPO Training

LoRA/PEFT

Training Techniques

Optimization

Custom Development

Evaluation

Research

Production

​GRPO Training

​Overview

​Quick Start

​Python API

​Environment Interface

​Environment with Configuration

​Requirements

​Parameters

​GRPO-Specific Parameters

​Shared RL Parameters (PPO + GRPO)

​vLLM Acceleration

​How It Works

​Example: Hotel Booking Agent

​GRPO vs PPO

​Best Practices

​Next Steps

PPO Training

RL Module

DPO Training

LoRA/PEFT

GRPO Training

Overview

Quick Start

Python API

Environment Interface

Environment with Configuration

Requirements

Parameters

GRPO-Specific Parameters

Shared RL Parameters (PPO + GRPO)

vLLM Acceleration

How It Works

Example: Hotel Booking Agent

GRPO vs PPO

Best Practices

Next Steps