OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

By combining VLMs (CLIP, Lang-SAM, OWL-ViT) for object detection, navigation primitves for movement, and grasping prmitives (AnyGrasp) for object manipulation, OK-Robot can achieve Pick-and-Place tasks in unseen scenes without any training.

Using iPhone to scan RGB-D images for map building and object detection (OWL-ViT)
Building VoxelMap of objects with CLIP embeddings
Querying the memory to find the target voxel and corresponding position (only for navigation)
Navigating to the target
Generating grasp pose by AnyGrasp net and filter with VLM (GPT-4V)
Do the job.

This paper is using lots of Large Models together to do Open Vocabulary Mobile Manipulation (OVMM). While I think they didn’t use the power of GPT-4 in subgoal planning.

FF's Roam Notes

Explorer

OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics

Graph View

Backlinks