Flow Policy Gradients for Robot Control

Brent Yi^1,2*

Hongsuk Choi^1,2*

Himanshu Gaurav Singh^1,2

Xiaoyu Huang^1,2

Takara E. Truong^1,3

Carmelo Sferrazza¹

Yi Ma^2,4

Yan Duan¹

Pieter Abbeel^1,2

Guanya Shi^1,5

Karen Liu^1,3

Angjoo Kanazawa^1,2

¹Amazon FAR

²UC Berkeley

³Stanford

⁴HKU

⁵CMU

*Equal Contribution

TLDR; a simple recipe for online RL with flow policies, validated on robot locomotion, humanoid motion tracking, and manipulation

Legged Locomotion

Default IsaacLab Environments

Sim-to-Real

Motion Tracking from BeyondMimic, Locomotion

Manipulation Fine-tuning

Robomimic & DexMimicGen

Motivation

Policy gradient algorithms are the dominant approach for training robot control policies from rewards. These methods typically assume access to a differentiable action likelihood \(\pi_\theta(a_t \mid o_t)\); PPO, for example, uses this to optimize a clipped objective:

\[ \max_\theta \;\; \mathbb{E}_{\pi_{\theta_{\text{old}}}} \left[ \min\!\left(\rho_\theta\hat{A}_t,\; \text{clip}(\rho_\theta, 1 \pm \varepsilon^\text{clip})\hat{A}_t \right) \right];\quad \rho_\theta = \frac{\pi_\theta(a_t \mid o_t)}{\pi_{\theta_{\text{old}}}(a_t \mid o_t)} \]

We often want to apply algorithms like PPO to more expressive flow and diffusion policies—for example, to exploit more expressive distributions for exploration or to fine-tune flow policies learned from demonstrations. Prior work in robotics typically solves this by introducing mechanisms for computing likelihoods in flow models (DPPO, ReinFlow, NCDPO, GenPO, etc).

Our contribution: we show how a simpler approach based on flow matching policy gradients, which bypasses likelihood computation entirely, can be made effective across a range of robotics tasks.

Algorithm

We introduce FPO++, an updated version of FPO (Flow Policy Optimization) that succeeds in real-world robotics tasks. We found two changes necessary for this.

Preliminaries

Flow matching policy gradients. The goal of FPO is to enable policy gradient-style training of flow policies without explicit likelihoods. FPO proposes a surrogate for the likelihood ratio:

\[ \hat{\rho}_{\text{FPO}}(\theta) = \exp\!\Big(\hat{\mathcal{L}}_{\text{CFM},\theta_{\text{old}}}(a_t; o_t) - \hat{\mathcal{L}}_{\text{CFM},\theta}(a_t; o_t)\Big) \]

where \(\hat{\mathcal{L}}_{\text{CFM},\theta}\) is a Monte Carlo estimate of the conditional flow matching (CFM) loss. This enables PPO-style training:

\[ \max_\theta \;\; \mathbb{E}_{\pi_{\theta_\text{old}}} \left[ \min\!\left(\hat{\rho}_{\text{FPO}}(\theta)\hat{A}_t,\; \text{clip}(\hat{\rho}_{\text{FPO}}(\theta), 1 \pm \varepsilon^\text{clip})\hat{A}_t \right) \right] \]

Intuitively, FPO uses CFM loss differences to approximate action log-likelihood differences, then uses advantage estimates to shift probability flow toward higher-reward actions.

Conditional flow matching loss. To estimate CFM losses, we draw \(N_\text{mc}\) noise \(\epsilon_i \sim \mathcal{N}(0,I)\) and flow step \(\tau_i \in [0, 1]\) pairs. For linear interpolation \(a_{t}^{\tau_i} = \tau_i a_t + (1 - \tau_i)\epsilon_i\), squared errors are computed for velocity predictions:

\[ \ell_\theta^{(i,t)} = \left\| \hat{v}_\theta\!\left(a_t^{\tau_i}, \tau_i; o_t\right) - (a_t - \epsilon_i) \right\|^2_2 \]

FPO++

While FPO succeeds in synthetic benchmarks, we found it required refinements for more difficult tasks. FPO++ proposes two changes: (1) per-sample ratios and (2) an asymmetric trust region.

Per-sample ratio. Standard FPO produces a single ratio per action by averaging over \((\tau_i, \epsilon_i)\) samples. In FPO++, we instead calculate a separate ratio for each sample:

\[ \hat{\rho}_\text{FPO++}^{(i)}(\theta) = \exp\!\left(\ell_{\theta_{\text{old}}}^{(i,t)} - \ell_\theta^{(i,t)}\right) \]

This provides a finer-grained trust region by allowing each \((\tau_i, \epsilon_i)\) pair to be clipped independently.

Asymmetric trust region (ASPO). We use PPO clipping for positive-advantage actions; for negative-advantage actions, we adopt the more constrained SPO objective. Instead of zeroing out gradients for samples with ratios that surpass the trust region, SPO provides a gradient signal that pulls ratios back. Applying SPO to negative advantages disincentivizes large CFM loss increases, preserving entropy and stabilizing training.

We validate the effect of these changes in our experiments; see paper for details!

Citation

@article{yi2026flow,
  title={Flow Policy Gradients for Robot Control},
  author={Yi, Brent and Choi, Hongsuk and Singh, Himanshu Gaurav and Huang, Xiaoyu and Truong, Takara E. and Sferrazza, Carmelo and Ma, Yi and Duan, Rocky and Abbeel, Pieter and Shi, Guanya and Liu, Karen and Kanazawa, Angjoo},
  journal={arXiv preprint arXiv:2602.02481},
  year={2026}
}