Robotic foundation models are models that consider inputs and outputs in robotic domain. It can be categorized into three:
-
pre-trained visual representations for robotics. Some works use pre-trained CNN (ResNet) and vision transformer backbones to extract features from inputs and then use latent vector for later tasks.
-
vision language models for robotics. Some works train a multimodal language model, which allows multimodal inputs (image and text) to answer questions, such as PaLM.
-
dynamics models (learn system dynamics such as Q-value, state and reward)
-
end-to-end control policies (generate action for robots to execute directly)
The main difference between transformer-based and VLA models are: transformer-based models do not contains any pre-trained VLMs.