Several limitation of current VLAs:
-
The large model size of VLAs makes achieving efficient real-time performance difficult.
-
The end-to-end fine-tuning of pre-trained VLAs on embodied data is challenging due to domain shift and catastrophic forgetting.
What is dual-system VLAs:
From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control first proposed the dual-system VLA structure, which consists of two system: system 1 closely resembles traditional lightweight policy networks, which are efficient but often task-specific, and system 2 is computationally heavy but offer superior generalization capabilities.
Key design of dual-system VLAs:
-
MLLM selection. What kind of MLLM model is lightweight enough yet sufficient to complete robotic tasks?
-
Policy selection. Current general policy models are based on DiT structure and Flow Matching structure. However, dense policy and other new architectures, downstream small models may also see new designs.
-
Latent feature representation selection. Some are choosing last layers hidden embeddings, some are choosing selected hidden embeddings from middle layers. Some are useing maxpooling of the last layer embeddings. Some are introducing special token, such as <ACT> or <Cognition> token.