FF's Roam Notes

❯

CogACT: A Foundational Vision Language Action Model for Synergizing Cognition and Action in Robotic Manipulation

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Jun 05, 20251 min read

VLA

This paper seems the first one that proposes a specialized action module conditoned on VLM output, rather than directly repurpose VLM for action prediction by simple action quantization. They employ advanced diffusion-based transformers (DiT) as action module, preconditioned on VLM output via attention mechanism.

Graph View

Backlinks

Vision-Language-Action Models

Created with Quartz v4.5.1 © 2025

Portfolio