This paper has a great organization of current various large models for robotics in real world.
Foundation Models
Foundation models, mostly in NLP and CV and based on tranformer structure, are characterized by three main characteristics:
While this paper is not aims to comprehensively cover all foundation models by above rules, it focus on addressing differences in modalities and classification of foundation models. The following table discusses foundation models for modalities such as language, vision, audio and 3D presentations (point clouds or shapes).
From | To | Examples |
Language | Language | GPT-3, LLaMA |
Latent | BERT | |
Vision | Latent | R3M, VC-1 |
Recognition | SAM | |
Vision + Language | Latent | CLIP |
Language | GPT-4V | |
Vision | Stable Diffusion | |
Recognition | OWL-ViT, DinoV2 | |
Audio + Language | Language | Whisper |
Audio + Vision + Language | Latent | AudioCLIP, CLAP |
Audio | MusicLM, VALLE | |
Vision + Language | 3D | Point-E |
3D + Vision + Language | Latent | ULIP |
Recognition | 3D-LLM |
Other modalities includes IMUs, heatmaps, object poses, and skeletal movements including gestures are not discussed so much recently.
Applications
Level | Application | Details |
Low | Perception | Feature extraction and scene recoginition. |
Planning | IK. | |
High | Perception | Map construction and reward design. |
Planning | Task planning and code generation. |