FF's Notes
← Home

Vision-Language-Action Models

Jul 19, 2024
VLA

From RT-2 (Google DeepMind):

We propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2.

Vision-Language-Action models typically utilize pre-trained vision-language models (VLMs) as their base and are fine-tuned to predict robotic actions. The models that do not rely on pre-trained VLMs are referred to here as Robotic Foundation Models.

VLA Models

We categorize current VLA architectures into two broad families:

  • Single‑system VLAs. A single vision–language model handles perception, reasoning, and action prediction. Continuous motor commands are first discretized by a learned action tokenizer (derived from the text tokenizer), and the model simply generates these tokens in sequence.
  • Dual‑system VLAs. High-level understanding and low-level control are split. A vision–language backbone encodes the current images and instruction, while a dedicated action‑policy network produces continuous commands from those embeddings.

Two principal mechanisms are used to connect the vision–language backbone to the action head:

  • Special-token-bridged. During VLM fine‑tuning, a reserved token such as <ACT> is appended; its final embedding is passed directly to the policy.
  • Feature-pooling-bridged. The full sequence of hidden states is aggregated—via max-pooling, mean‑pooling, or learned attention—to yield a compact feature vector fed to the policy.

Single-system VLA

Dual-system VLA

Conditioning Strategy Comparison

How does the vision language features and proprio. states connected to the action head?

Check # Scalable Diffusion Models with Transformers for more details of Cross-attn, adaLN, and In-context variantion of DiT.

Model VLM Features Proprio State
# π0 Cross-attn1 Cross-attn2
# \pi0.5 Cross-attn3 Cross-attn^{}^{}
# ManiFlow Cross-attn adaLN
# FLOWER Cross-attn4^{} adaLN
# GR00T N1 Cross-attn5^{} In-context6^{}
# SmolVLA Cross-attn Cross-attn7^{}
# CogACT In-context8
# MoLe-VLA In-context9
# DeeR-VLA In-context10 In-context
OTTER Cross-attn11 Cross-attn

Note: All models use adaLN-Zero for timestep conditioning.

  1. π0 uses blockwise causal attention: VLM tokens in first block, action tokens attend to all blocks. Preserves VLM pre-training distribution. π series VLA models are using Gemma as action expert, not DiT structure.
  2. π0 places proprio in a separate block between VLM and actions, enabling KV caching during flow sampling.
  3. π0.5 uses blockwise causal attention: VLM tokens and proprio state tokens (by text-tokenzier) in first block, action tokens attend to all blocks.
  4. FLOWER projects VLM latent tokens through linear layer + RMSNorm before cross-attention.
  5. GR00T N1 uses middle-layer (12th) LLM embeddings instead of final layer for better downstream performance.
  6. GR00T N1 uses per-embodiment MLP projection for heterogeneous robot morphologies.
  7. SmolVLA projects proprio to a single token via linear layer, and do cross-attention with vision and language in VLM backbone.
  8. CogACT takes a special token as conditional token, assuming it contains all information of vision and language, into action head, and combined with noised action tokens.
  9. MoLe-VLA uses almost same architecture as CogACT.
  10. DeeR-VLA doesn't use DiT as action head, so it just combines both pooled vlm features and proprio state history.
  11. OTTER first select text-aware vision tokens, then combine with proprio state token (by using MLP), then do pooling (I guess mean) to get one token, then do cross attention with noised action tokens.

Special-token-bridged

Feature-pooling-bridged

Others

For Humanoid Robots