ReGRPO — Reflection-Augmented RL for Tool-Using Agents

Group-relative policy optimization with structured self-reflection for long-horizon tool use.

ReGRPO is a reflection-augmented variant of group-relative policy optimization for tool-using agents.

  • Injects structured self-reflection into the agent’s multi-modal chain-of-thought during training.
  • The group-relative signal stabilizes credit assignment over long tool-use trajectories, while reflection turns failed rollouts into reusable learning signal.
  • Improves tool-selection reward and sample efficiency on long-horizon, multi-tool tasks — currently under review at ECCV 2026.