By combining VLMs (CLIP, Lang-SAM, OWL-ViT) for object detection, navigation primitves for movement, and grasping prmitives (AnyGrasp) for object manipulation, OK-Robot can achieve Pick-and-Place tasks in unseen scenes without any training.
- Using iPhone to scan RGB-D images for map building and object detection (OWL-ViT)
- Building VoxelMap of objects with CLIP embeddings
- Querying the memory to find the target voxel and corresponding position (only for navigation)
- Navigating to the target
- Generating grasp pose by AnyGrasp net and filter with VLM (GPT-4V)
- Do the job.
This paper is using lots of Large Models together to do Open Vocabulary Mobile Manipulation (OVMM). While I think they didn’t use the power of GPT-4 in subgoal planning.