This paper has two major contributions:
-
proposes a large dataset, containing 217 specific tasks, 87 skills and 106 scenes. Most of tasks involving dual-arm manipulation, dexterous hands, and collaborative tasks.
-
proposes a hierarchical Vision-Language-Latent-Action (ViLLA) framework with three training stages: latent action model trained by VQ-VAE objective, latent planner (for action latent) trained by VLM backbone-based model, and action expert (for action).
Some interesting things from the paper:
-
Two-billion parameter scale has proven effective for robotic tasks.
-
Action chunking set to 30.