This paper has a great organization of current various large models for robotics in real world.

Foundation Models

Foundation models, mostly in NLP and CV and based on tranformer structure, are characterized by three main characteristics:

  1. In-Context Learning
  2. Scaling Law
  3. Homogenization

While this paper is not aims to comprehensively cover all foundation models by above rules, it focus on addressing differences in modalities and classification of foundation models. The following table discusses foundation models for modalities such as language, vision, audio and 3D presentations (point clouds or shapes).

            From                     To                 Examples        
                             
          Language           
                             
    Language            GPT-3, LLaMA      
     Latent                 BERT          
                             
           Vision            
                             
     Latent              R3M, VC-1        
   Recognition              SAM           
                             
                             
                             
      Vision + Language      
                             
                             
                             
     Latent                 CLIP          
    Language               GPT-4V         
     Vision           Stable Diffusion    
   Recognition        OWL-ViT, DinoV2     
      Audio + Language            Language              Whisper         
                             
  Audio + Vision + Language  
                             
     Latent           AudioCLIP, CLAP     
      Audio           MusicLM, VALLE      
      Vision + Language              3D                 Point-E         
                             
   3D + Vision + Language    
                             
     Latent                 ULIP          
   Recognition             3D-LLM         

Other modalities includes IMUs, heatmaps, object poses, and skeletal movements including gestures are not discussed so much recently.

Applications

 Level   Application                         Details                        
       
  Low  
       
 Perception        Feature extraction and scene recoginition.       
  Planning                             IK.                          
       
 High  
       
 Perception            Map construction and reward design.          
  Planning             Task planning and code generation.