ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

Overview of the ReGRPO framework: Structured Reflective Data Engine, ReGRPO training, and zero-verifier inference. — **Overview of ReGRPO.** (1) A **Structured Reflective Data Engine** synthesizes a near-miss failure from a ground-truth action, executes it to obtain a grounded failure observation, and uses a teacher VLM to annotate a structured `(ErrorType, Evidence, FixPlan)` reflection paired with the corrected action. (2) **ReGRPO training** samples groups of local trajectories — one-shot successes and reflection-based recoveries `a⁰→o⁰→z→a¹→o¹` — and uses group-relative advantages to optimize *both* reflection and correction tokens. (3) **Zero-verifier inference** opens a local reflection–correction block only when a deterministic trigger fires.

| Abstract

Tool-augmented vision–language models (VLMs) can solve multimodal, multi-step tasks by calling external tools, yet they remain fragile in practice. Existing training has two common gaps: supervised fine-tuning (SFT) is built mostly on successful trajectories and offers little signal for recovery after tool failures, while sparse trajectory-level RL rewards provide limited guidance on which step failed and how to repair it.

We introduce ReGRPO (Reflection-augmented Group Relative Policy Optimization), a framework that learns reflection-guided correction in tool-using agents. ReGRPO starts with a structured reflective data engine: we execute near-miss actions to collect grounded failure observations, then build Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions for warm-start SFT. We then optimize reflection tokens and corrective actions jointly within local trajectories using group-relative advantages, and include a reflection-cost term to reduce unnecessary reflection. Experiments on GTA and GAIA show that, under the same backbone and tool suite, ReGRPO consistently outperforms strong open-source baselines and achieves the best results among the compared open-source controllers.

| Approach

Three components turn failures into a learnable recovery skill.

Reflective Data Engine

Perturb a ground-truth tool call into a realistic near-miss, execute it for a grounded failure observation, and annotate a structured (ErrorType, Evidence, FixPlan) reflection + corrected action — explicit Error → Reflection → Correction supervision.

Reflection-Augmented GRPO

Reflection tokens become part of the optimized trajectory, so group-relative advantages directly scale gradients on diagnostic reflection and corrective actions. A reflection-cost term keeps reflection brief.

Zero-Verifier Inference

A deterministic trigger opens at most one local reflection–correction block per step, from hard-failure signals plus a lightweight confidence proxy — no external verifier calls at deployment.

Reflection-aware reward

R(τ) = λexec · 1{success} − η · C(τ) + λval · V(x, τ)

The first two terms form a complete verifier-free objective (default λ_val=0). C(τ) penalizes reflection length (0 for one-shot success); V is an optional, training-only teacher verifier. Group advantage: A_i^(k) = R(τ_i^(k)) − R̄_i.

The Reflection-of-Thought data construction pipeline. — **Reflection-of-Thought (RoT) data pipeline.** Each clean trajectory step is perturbed into a near-miss action, executed to obtain a grounded failure observation, and paired with a structured reflection triplet and the corrected action. Strict schema, evidence-grounding, and label-leak gates keep the corpus well-formed for vision SFT.

Comparison of PPO, DPO, GRPO, and ReGRPO. — **ReGRPO vs. PPO / DPO / GRPO.** PPO and DPO optimize actions or preferences without treating reflection as a decision variable; GRPO reduces variance via group-relative rewards; ReGRPO further includes reflection in the optimized trajectory, providing stronger recovery-oriented supervision for failed steps.

| Results

Same Qwen2-VL-7B backbone and tool suite for every method; single-path, zero-verifier inference. AnsAcc = answer accuracy.

Method (controller)	GTA AnsAcc	GAIA AnsAcc
MAT-Agent / T3-Agent (MAT-Qwen2-VL-7B)	53.85	16.97
SPORT (Tuned-Qwen2-VL-7B)	60.26	20.61
ReGRPO (default, λ_val = 0)	67.66	23.35

The verifier-free default is already the strongest among the compared open-source controllers (+7.40 GTA / +2.74 GAIA over SPORT). An optional deterministic verifier reward adds a further +0.83 / +0.66.