Vision-Language-Action Models
From RT-2 (Google DeepMind):
We propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2.
Vision-Language-Action models typically utilize pre-trained vision-language models (VLMs) as their base and are fine-tuned to predict robotic actions. The models that do not rely on pre-trained VLMs are referred to here as Robotic Foundation Models.
VLA Models
We categorize current VLA architectures into two broad families:
- Single‑system VLAs. A single vision–language model handles perception, reasoning, and action prediction. Continuous motor commands are first discretized by a learned action tokenizer (derived from the text tokenizer), and the model simply generates these tokens in sequence.
- Dual‑system VLAs. High-level understanding and low-level control are split. A vision–language backbone encodes the current images and instruction, while a dedicated action‑policy network produces continuous commands from those embeddings.
Two principal mechanisms are used to connect the vision–language backbone to the action head:
- Special-token-bridged. During VLM fine‑tuning, a reserved token such as <ACT> is appended; its final embedding is passed directly to the policy.
- Feature-pooling-bridged. The full sequence of hidden states is aggregated—via max-pooling, mean‑pooling, or learned attention—to yield a compact feature vector fed to the policy.
Single-system VLA
- LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks, arxiv, May 31 2025. [Paper]
- UniVLA: Learning to Act Anywhere with Task-centric Latent Actions, The University of Hong Kong, arxiv, May 9 2025. [Paper] [Code]
- 3D-CAVLA: 3D-CAVLA: Leveraging Depth and 3D Context to Generalize Vision–Language Action Models for Unseen Tasks, New York University, arxiv, May 9 2025. [Paper] [Website]
- NORA: NORA: A SMALL OPEN-SOURCED GENERALIST VISION LANGUAGE ACTION MODEL FOR EMBODIED TASKS, Singapore University of Technology and Design, arxiv, Apr 28 2025. [Paper]
- CoT-VLA: # CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, NVIDIA & Stanford, arxiv, Mar 27 2025. [Paper] [Website]
- PD-VLA: # Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding, HKUST (GZ), arxiv, Mar 4 2025. [Paper]
- VLAS: # VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation, Westlake University, arxiv, Feb 21 2025. [Paper] [Code]
- VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation, University of Sydney, arxiv, Feb 4 2025. [Paper]
- Spatial-VLA: SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models, Shanghai AI Lab, arxiv, Jan 28 2025. [Paper] [Website] [Code] [Model]
- TRACEVLA: # TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policiy, University of Maryland, arxiv, Dec 25 2024. [Paper]
- OpenVLA: # OpenVLA: An Open-Source Vision-Language-Action Model, Stanford University & UC Berkeley & Toyota Research Insititute, arxiv, Jun 13 2024. [Website] [Paper] [Code] [Model]
- RT-2: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, Google DeepMind, July 28 2023.
Dual-system VLA
Conditioning Strategy Comparison
How does the vision language features and proprio. states connected to the action head?
Check # Scalable Diffusion Models with Transformers for more details of Cross-attn, adaLN, and In-context variantion of DiT.
| Model | VLM Features | Proprio State |
|---|---|---|
| # π0 | Cross-attn1 | Cross-attn2 |
| # \pi0.5 | Cross-attn3 | Cross-attn^{}^{} |
| # ManiFlow | Cross-attn | adaLN |
| # FLOWER | Cross-attn4^{} | adaLN |
| # GR00T N1 | Cross-attn5^{} | In-context6^{} |
| # SmolVLA | Cross-attn | Cross-attn7^{} |
| # CogACT | In-context8 | — |
| # MoLe-VLA | In-context9 | — |
| # DeeR-VLA | In-context10 | In-context |
| OTTER | Cross-attn11 | Cross-attn |
Note: All models use adaLN-Zero for timestep conditioning.
- π0 uses blockwise causal attention: VLM tokens in first block, action tokens attend to all blocks. Preserves VLM pre-training distribution. π series VLA models are using Gemma as action expert, not DiT structure.
- π0 places proprio in a separate block between VLM and actions, enabling KV caching during flow sampling.
- π0.5 uses blockwise causal attention: VLM tokens and proprio state tokens (by text-tokenzier) in first block, action tokens attend to all blocks.
- FLOWER projects VLM latent tokens through linear layer + RMSNorm before cross-attention.
- GR00T N1 uses middle-layer (12th) LLM embeddings instead of final layer for better downstream performance.
- GR00T N1 uses per-embodiment MLP projection for heterogeneous robot morphologies.
- SmolVLA projects proprio to a single token via linear layer, and do cross-attention with vision and language in VLM backbone.
- CogACT takes a special token as conditional token, assuming it contains all information of vision and language, into action head, and combined with noised action tokens.
- MoLe-VLA uses almost same architecture as CogACT.
- DeeR-VLA doesn't use DiT as action head, so it just combines both pooled vlm features and proprio state history.
- OTTER first select text-aware vision tokens, then combine with proprio state token (by using MLP), then do pooling (I guess mean) to get one token, then do cross attention with noised action tokens.
Special-token-bridged
- OpenHelix: # OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation, Westlake University, arxiv, May 6 2025. [Paper]
- MoLe-VLA: # MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation, Nanjing University & HK PolyU & Peking University, Mar 26 2025. [Paper] [Website] [Code]
- FuSe: Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding, UC Berkeley, arxiv, Jan 8 2025. [Website] [Paper] [Code] [Model]
- Diffusion-VLA: # Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression, East China Normal University, arxiv, Dec 4 2024. [Website] [Paper]
- CogACT: # CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation, Tsinghua University, arxiv, Nov 29 2024. [Paper] [Website] [Code] [Model]
Feature-pooling-bridged
- SmolVLA: # SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics, Hugging Face, arxiv, Jun 4 2025. [Paper] [Website] [Model]
- $\pi_{0.5}$: $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization, Physical Intelligence, arxiv, Apr 22 2025. [Paper] [Website]
- Hi Robot: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models, Physical Intelligence & Stanford University, arxiv, Feb 26 2025. [Paper] [Website]
- ChatVLA: ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model, Midea Group & East China Normal University, arxiv, Feb 21 2025. [Paper] [Website]
- DexVLA: DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control, Midea Group & East China Normal University, arxiv, Feb 9 2025. [Paper] [Website] [Code]
- UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent, Tsinghua University & Shanghai Qi Zhi Institute, arxiv, Feb 3 2025. [Paper]
- iRe-VLA: # Improving Vision-Language-Action Model with Online Reinforcement Learning, Tsinghua University & Shanghai Qi Zhi Institute, arxiv, Jan 28 2025. [Paper]
- FAST: FAST: Efficient Action Tokenization for Vision-Language-Action Models, Physical Intelligence & UC Berkeley & Stanford, arxiv, Jan 16 2025. [Website] [Paper] [Tokenizer] [Code]
- $\pi_0$: $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control, Physical Intelligence, arxiv, Oct 31 2024. [Website] [Paper] [Code]
- DeeR-VLA: # DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution, Tsinghua University, NeurIPS 24. [Paper] [Website] [Code]
Others
- # Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better, Physical Intelligence, May 29 2025. [Paper] [Website]
- OFT: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Stanford, arxiv, Apr 28 2025. [Paper] [Website] [Code] [Model]
- HybridVLA: # HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model, Peking University, Mar 13 2025. [Paper] [Website] [Code]
For Humanoid Robots
- GR00T N1: # GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, NVIDIA, Mar 27 2025. [Paper] [Website] [Code] [Dataset]
- NAVILA: NAVILA: LEGGED ROBOT VISION-LANGUAGEACTION MODEL FOR NAVIGATION, UC San Diego, arxiv, Dec 5 2024. [Website] [Paper]
- Humanoid-VLA: Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration, Westlake University & Zhejiang University, arxiv, Feb 21 2025. [Paper]
- GO-1: # AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems, AgiBot-World (Shanghai AI Lab & AgiBot Inc.), AgiBot World, Mar 10 2025. [Paper] [Website] [Code] [Model]