FF's Roam Notes

❯

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision Language Action Model

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jun 05, 20251 min read

VLA

Existing autoregressive VLA methods (VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation) leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods (Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression) incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. This paper introduces HybridVLA, which seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than connecting them as additional head.

Graph View

Backlinks

Vision-Language-Action Models

Created with Quartz v4.5.1 © 2025

Portfolio