ReGRPO — Reflection-Augmented RL for Tool-Using Agents
Group-relative policy optimization with structured self-reflection for long-horizon tool use.
ReGRPO is a reflection-augmented variant of group-relative policy optimization for tool-using agents.
- Injects structured self-reflection into the agent’s multi-modal chain-of-thought during training.
- The group-relative signal stabilizes credit assignment over long tool-use trajectories, while reflection turns failed rollouts into reusable learning signal.
- Improves tool-selection reward and sample efficiency on long-horizon, multi-tool tasks — currently under review at ECCV 2026.