Scaling proprioceptive-Visual Learning with heterogeneous Pre-trained Transformers

This paper proposes a pre-training concept, which

maps different embodiments, each with its own proprioception (e.g., the pose of eef), and vision sensors (e.g., the input image of embodiments, like camera), to a shared latent space by embodiment-specific tokenizers
trains a shared transformer trunk on the union of all heterogeneous datasets
transfers to different embodiment with a small, new tokenizer learned at transferring time

The biggest novelty of this paper should be the first one. Nowadays, a lot of works are using different decoders to fit different embodiments, while this paper first proposes to encoding the different embodiments to a shared latent space so that the unseen embodiments can fit well too.

FF's Roam Notes

Explorer

Scaling proprioceptive-Visual Learning with heterogeneous Pre-trained Transformers

Graph View