ECCV 2026

ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

Teaching vision–language agents to diagnose and recover from their own tool-use failures — not just imitate successes.

Show Lab, National University of Singapore
Corresponding author
Overview of the ReGRPO framework: Structured Reflective Data Engine, ReGRPO training, and zero-verifier inference.
Overview of ReGRPO. (1) A Structured Reflective Data Engine synthesizes a near-miss failure from a ground-truth action, executes it to obtain a grounded failure observation, and uses a teacher VLM to annotate a structured (ErrorType, Evidence, FixPlan) reflection paired with the corrected action. (2) ReGRPO training samples groups of local trajectories — one-shot successes and reflection-based recoveries a⁰→o⁰→z→a¹→o¹ — and uses group-relative advantages to optimize both reflection and correction tokens. (3) Zero-verifier inference opens a local reflection–correction block only when a deterministic trigger fires.

| Abstract

Tool-augmented vision–language models (VLMs) can solve multimodal, multi-step tasks by calling external tools, yet they remain fragile in practice. Existing training has two common gaps: supervised fine-tuning (SFT) is built mostly on successful trajectories and offers little signal for recovery after tool failures, while sparse trajectory-level RL rewards provide limited guidance on which step failed and how to repair it.

We introduce ReGRPO (Reflection-augmented Group Relative Policy Optimization), a framework that learns reflection-guided correction in tool-using agents. ReGRPO starts with a structured reflective data engine: we execute near-miss actions to collect grounded failure observations, then build Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions for warm-start SFT. We then optimize reflection tokens and corrective actions jointly within local trajectories using group-relative advantages, and include a reflection-cost term to reduce unnecessary reflection. Experiments on GTA and GAIA show that, under the same backbone and tool suite, ReGRPO consistently outperforms strong open-source baselines and achieves the best results among the compared open-source controllers.

| Approach

Three components turn failures into a learnable recovery skill.

1

Reflective Data Engine

Perturb a ground-truth tool call into a realistic near-miss, execute it for a grounded failure observation, and annotate a structured (ErrorType, Evidence, FixPlan) reflection + corrected action — explicit Error → Reflection → Correction supervision.

2

Reflection-Augmented GRPO

Reflection tokens become part of the optimized trajectory, so group-relative advantages directly scale gradients on diagnostic reflection and corrective actions. A reflection-cost term keeps reflection brief.

3

Zero-Verifier Inference

A deterministic trigger opens at most one local reflection–correction block per step, from hard-failure signals plus a lightweight confidence proxy — no external verifier calls at deployment.

Reflection-aware reward

R(τ) = λexec · 1{success} − η · C(τ) + λval · V(x, τ)

The first two terms form a complete verifier-free objective (default λval=0). C(τ) penalizes reflection length (0 for one-shot success); V is an optional, training-only teacher verifier. Group advantage: Ai(k) = R(τi(k)) − R̄i.

The Reflection-of-Thought data construction pipeline.
Reflection-of-Thought (RoT) data pipeline. Each clean trajectory step is perturbed into a near-miss action, executed to obtain a grounded failure observation, and paired with a structured reflection triplet and the corrected action. Strict schema, evidence-grounding, and label-leak gates keep the corpus well-formed for vision SFT.
Comparison of PPO, DPO, GRPO, and ReGRPO.
ReGRPO vs. PPO / DPO / GRPO. PPO and DPO optimize actions or preferences without treating reflection as a decision variable; GRPO reduces variance via group-relative rewards; ReGRPO further includes reflection in the optimized trajectory, providing stronger recovery-oriented supervision for failed steps.

| Results

Same Qwen2-VL-7B backbone and tool suite for every method; single-path, zero-verifier inference. AnsAcc = answer accuracy.

Method (controller)GTA AnsAccGAIA AnsAcc
MAT-Agent / T3-Agent (MAT-Qwen2-VL-7B)53.8516.97
SPORT (Tuned-Qwen2-VL-7B)60.2620.61
ReGRPO (default, λval = 0)67.6623.35

The verifier-free default is already the strongest among the compared open-source controllers (+7.40 GTA / +2.74 GAIA over SPORT). An optional deterministic verifier reward adds a further +0.83 / +0.66.

| BibTeX

@inproceedings{zhang2026regrpo,
  title     = {ReGRPO: Reflection-Augmented Group Relative Policy Optimization for Tool-Using Agents},
  author    = {Zhang, Binjie and Shou, Mike Zheng},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}