AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
This paper has two major contributions:
- proposes a large dataset, containing 217 specific tasks, 87 skills and 106 scenes. Most of tasks involving dual-arm manipulation, dexterous hands, and collaborative tasks.
- proposes a hierarchical Vision-Language-Latent-Action (ViLLA) framework with three training stages: latent action model trained by VQ-VAE objective, latent planner (for action latent) trained by VLM backbone-based model, and action expert (for action).
Some interesting things from the paper:
- Two-billion parameter scale has proven effective for robotic tasks.
- Action chunking set to 30.