FF's Notes
← Home

Real-World Robot Applications of Foundation Models: A Review

Jun 1, 2024

This paper has a great organization of current various large models for robotics in real world.

Foundation Models

Foundation models, mostly in NLP and CV and based on tranformer structure, are characterized by three main characteristics:

  1. # In-Context Learning
  2. # Scaling Law
  3. # Homogenization

While this paper is not aims to comprehensively cover all foundation models by above rules, it focus on addressing differences in modalities and classification of foundation models. The following table discusses foundation models for modalities such as language, vision, audio and 3D presentations (point clouds or shapes).

—————————–—————–————————

From To Examples

—————————–—————–————————

Language GPT-3, LLaMA
Language —————–————————+
Latent BERT

—————————–—————–————————

Latent R3M, VC-1
Vision —————–————————+
Recognition SAM

—————————–—————–————————

Latent CLIP
—————–————————+
Language GPT-4V
Vision + Language —————–————————+
Vision Stable Diffusion
—————–————————+
Recognition OWL-ViT, DinoV2

—————————–—————–————————

Audio + Language Language Whisper

—————————–—————–————————

Latent AudioCLIP, CLAP
Audio + Vision + Language —————–————————+
Audio MusicLM, VALLE

—————————–—————–————————

Vision + Language 3D Point-E

—————————–—————–————————

Latent ULIP
3D + Vision + Language —————–————————+
Recognition 3D-LLM

—————————–—————–————————

Other modalities includes IMUs, heatmaps, object poses, and skeletal movements including gestures are not discussed so much recently.

Applications

——-——————————————————————-

Level Application Details

——-——————————————————————-

Perception Feature extraction and scene recoginition.
Low ——————————————————————-+
Planning IK.

——-——————————————————————-

Perception Map construction and reward design.
High ——————————————————————-+
Planning Task planning and code generation.

——-——————————————————————-