From RT-2 (Google DeepMind):

We propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2.

Vision-Language-Action models typically utilize pre-trained vision-language models (VLMs) as their base and are fine-tuned to predict robotic actions. The models that do not rely on pre-trained VLMs are referred to here as Robotic Foundation Models.

VLA Models

We categorize current VLA architectures into two broad families:

  • Single‑system VLAs. A single vision–language model handles perception, reasoning, and action prediction. Continuous motor commands are first discretized by a learned action tokenizer (derived from the text tokenizer), and the model simply generates these tokens in sequence.

  • Dual‑system VLAs. High‑level understanding and low‑level control are split. A vision–language backbone encodes the current images and instruction, while a dedicated action‑policy network produces continuous commands from those embeddings.

Two principal mechanisms are used to connect the vision–language backbone to the action head:

  • Special‑token-bridged. During VLM fine‑tuning, a reserved token such as <ACT> is appended; its final embedding is passed directly to the policy.

  • Feature‑pooling-bridged. The full sequence of hidden states is aggregated—via max‑pooling, mean‑pooling, or learned attention—to yield a compact feature vector fed to the policy.

Single-system VLA

Dual-system VLA

Special-token-bridged

Feature-pooling-bridged

Others

For Humanoid Robots