FF's Roam Notes

❯

AgiBot World Colosseo: Large scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Jun 05, 20251 min read

This paper has two major contributions:

proposes a large dataset, containing 217 specific tasks, 87 skills and 106 scenes. Most of tasks involving dual-arm manipulation, dexterous hands, and collaborative tasks.
proposes a hierarchical Vision-Language-Latent-Action (ViLLA) framework with three training stages: latent action model trained by VQ-VAE objective, latent planner (for action latent) trained by VLM backbone-based model, and action expert (for action).

Some interesting things from the paper:

Two-billion parameter scale has proven effective for robotic tasks.
Action chunking set to 30.

Graph View

Backlinks

Vision-Language-Action Models

Created with Quartz v4.5.1 © 2025

GitHub
Discord Community