$\pi$0.7: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities
This is a follow-up paper of # $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control, # $\pi_{0.5}$: A Vision-Language-Action Model with Open-World Generalization, # $\pi_{0.6}$: A VLA That Learns From Experience.
Update: train with context prompts that include not only task language, but also subtask instructions, generated subgoal images, control mode, and episode metadata (speed, quality, mistake labels).
History observation encoding: $\pi_{_{0.7}}$ uses a MEM-style (# MEM: Multi-Scale Embodied Memory for Vision Language Action Models) history encoder: multiple past observation frames are temporally/spatially compressed into a fixed number of visual tokens, equal to one frame’s token budget. This gives memory without increasing transformer context length proportional to history length.
$q_t$: Unlike 0.6 that uses discretized text tokens to represent, 0.7 follows MEM and embeds the state using a linear projection that maps the state dimension to the backbone dimension.
Sampling scheme: We found the following sampling scheme to be effective for selecting the timesteps for the real images: with probability 0.25, we sample the end-of-segment images (consistent with the prediction target for the world model), and with probability 0.75 we sample future images uniformly from 0–4 seconds ahead of the current timestep.