This paper introduces explicit visual chain-of-though (CoT) reasoning into vision-language-action models by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals.
Training
-
Pretrain the base 7B VILA-U model on robot demonstrations and action-less videos. Three components, the LLM backbone, projector, and depth transformer are optimized during training, while the vision tower fixed.
-
Adaptation phase for downstream closed-loop deployment is using task-specific robot demonstrations, and trained with same config as pretrain phase.
-
The VILA-U model has the ability of generating images.
-
Action chunk size is set to 10. Learning rate is set to . Learning rate scheduler is Cosine decay. The global batch size is 2048. The epoch is set to 10.
Improvements
-
It might not easy for the VLM to generation subgoal images directly, and may consume large training burden.
-
Is one subgoal image enough? It might be useful to have subgoal chunks, which of course will come with much more burden but it definitely will have more power.
-
“Think visually” means the model first generate subgoal image, then generate action conditioned on subgoal image. Maybe “think visually” is not a efficiency way?