FF's Notes
← Home

Robotic Foundation Models

Jan 17, 2025

Robotic foundation models are models that consider inputs and outputs in robotic domain. It can be categorized into three:

  • pre-trained visual representations for robotics. Some works use pre-trained CNN (ResNet) and vision transformer backbones to extract features from inputs and then use latent vector for later tasks.
  • vision language models for robotics. Some works train a multimodal language model, which allows multimodal inputs (image and text) to answer questions, such as PaLM.
  • dynamics models (learn system dynamics such as Q-value, state and reward)
  • end-to-end control policies (generate action for robots to execute directly)

    The main difference between transformer-based and VLA models are: transformer-based models do not contains any pre-trained VLMs.