$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
This paper proposes a novel # Flow Matching architecture built on the top of a pre-trained vision-language model (paligemma) and introduces details of how to build and train on a large and diverse dataset from multiple robot platforms. The flow matching algorithm is a variant of diffusion, which allows to handle high-frequency action chunks and highly dexterous tasks.
Some interesting ideas from the paper:
- The axis along which human intelligence most outpaces machine intelligence is versatility: the ability to solve diverse tasks situated in varied physical environments.
- Three major challenges in developing generalist robot policies: (1) Large scale (model); (2) Right model structures; (3) Right training recipe.
- The training process contains 2 stages: (1) Pre-training phase for training a base model that exhibits broad capabilities and generalization, but is not necessarily specialized for high performance on any one task; (2) Post-training phase for specific downstream tasks by using high-quality curated data.
- They also mention that utilizing a high-level policy (such as # SuSIE: Subgoal Synthesis via Image Editing) that decomposes high-level tasks into more immediate subtasks to assist the proposed VLA to complete more complex tasks.
Training
- They use the PaliGemma VLM as basebone, with the following differences: (1) additional input and output projections for the robotics-specific tokens, including the robot states and actions, (2) additional MLP for the flow matching timestep information, (3) a second, smaller set of weights for the action expert.
- The action chunk size is set to 50.
- The action expert / head is implemented as a single transformer with two sets of weights, where each token is routed to one of the experts; the weights interact only through the transformer's self-attention layers.
- The image and language prompt are routed to the VLM backbone, while the robot states and noise actions are routed to the action expert.
- The model structure is quite interesting: it contains one gemma-2b for llm reasoning, and one gemma-300M for action expert. The image and text tokens will input to llm head, and the robot state and action noise will input to action head. And they use attention to connect the output of llm head and the robot state.
Attention Mask
$\pi_0$ uses a blockwise # Causal Attention mask with 3 blocks $[I_t^1, \cdots, I_t^n, l_t]$, $[q_t]$, and $[a_t, \cdots, a_{t+H-1}]$, where $I_t^n$ denotes the image captured by camera $n$ at time step $t$, $l_t$ is the language prompt or task description, $q_t$ is the proprioceptive state at time step $t$, and $[a_t, \cdots, a_{t+H-1}]$ denotes the action chunks from time step $t$ to $t+H-1$.
With each block, there is full bidirectional attention, whereas the tokens in each block cannot attend to the tokens in future blocks.
For modality sequence [img1, img2, text, q, a_1, a_o], The attention mask will be: [0, 0, 0, 1, 1, 0], which will generate the following:
| modality | img1 | img2 | text | q | a_1 | a_o |
|---|---|---|---|---|---|---|
| img1 | 1 | 1 | 1 | 0 | 0 | 0 |
| img2 | 1 | 1 | 1 | 0 | 0 | 0 |
| text | 1 | 1 | 1 | 0 | 0 | 0 |
| q | 1 | 1 | 1 | 1 | 0 | 0 |
| a_1 | 1 | 1 | 1 | 1 | 1 | 0 |
| a_o | 1 | 1 | 1 | 1 | 1 | 1 |
Advantages
- The first block includes the input modalities from PaliGemma’s VLM pre-training, which are prevented from attending to future blocks (which include new inputs) to minimize distribution shift from said pre-training.
- The robot state $q_t$ is its own block because it does not change with each flow matching integration step; preventing it from attending to the final block allows its corresponding keys and values to be cached during sampling.
- From the perspective of casual attention design, State can gather information from ALL visual and textual tokens.
Disadvantages
- From the perspective of casual attention design, state cannot influence how vision-language features are processed.