Robotic foundation models are models that consider inputs and outputs in robotic domain. It can be categorized into three:

  • pre-trained visual representations for robotics. Some works use pre-trained CNN (ResNet) and vision transformer backbones to extract features from inputs and then use latent vector for later tasks.

  • vision language models for robotics. Some works train a multimodal language model, which allows multimodal inputs (image and text) to answer questions, such as PaLM.

  • dynamics models (learn system dynamics such as Q-value, state and reward)

  • end-to-end control policies (generate action for robots to execute directly)

    The main difference between transformer-based and VLA models are: transformer-based models do not contains any pre-trained VLMs.