Real-World Robot Applications of Foundation Models: A Review

This paper has a great organization of current various large models for robotics in real world.

Foundation Models

Foundation models, mostly in NLP and CV and based on tranformer structure, are characterized by three main characteristics:

While this paper is not aims to comprehensively cover all foundation models by above rules, it focus on addressing differences in modalities and classification of foundation models. The following table discusses foundation models for modalities such as language, vision, audio and 3D presentations (point clouds or shapes).

From	To	Examples
Language	Language	GPT-3, LLaMA
Language	Latent	BERT
Vision	Latent	R3M, VC-1
Vision	Recognition	SAM
Vision + Language	Latent	CLIP
	Language	GPT-4V
	Vision	Stable Diffusion
	Recognition	OWL-ViT, DinoV2
Audio + Language	Language	Whisper
Audio + Vision + Language	Latent	AudioCLIP, CLAP
Audio + Vision + Language	Audio	MusicLM, VALLE
Vision + Language	3D	Point-E
3D + Vision + Language	Latent	ULIP
3D + Vision + Language	Recognition	3D-LLM

Other modalities includes IMUs, heatmaps, object poses, and skeletal movements including gestures are not discussed so much recently.

Applications

Level	Application	Details
Low	Perception	Feature extraction and scene recoginition.
Low	Planning	IK.
High	Perception	Map construction and reward design.
High	Planning	Task planning and code generation.

FF's Roam Notes

Explorer

Real-World Robot Applications of Foundation Models: A Review

Foundation Models

Applications

Graph View

Table of Contents

Backlinks