SmolVLA is a compact and efficient VLA model that skipping layers in the VLM, using a minimal number of visual tokens, leveraging small pretrained VLM and interleaving self-attention layers with lighter cross-attention layers.
It uses SmolVLM-2 for backbone, which relies on SigLIP to encode visual features for SmolLM2 language decoder and reduces visual token count through a token-shuffling technique for efficiency.
The proprioceptive states are projected into a single token via a linear layer, then visual, language and state tokens are concatenated and passed to the language decoder.
They also use a linear projection to adapt the VLM features to align with the action expert’s dimension.
For faster inference through layer skipps, they find the feature of half the total layers offers a good tradeoff between speed and performance.
They use a conditional Flow Matching Transfomer as action expert.
Code Structure
Policy
The hierarchical design of the policy contains:
-
SmolVLAPolicy
: Top level, policy interface- Input/output normalization
- Action selection and queuing
- Batch preparation (images, language, state)
- Framework integration (checkpointing, evaluation)
-
VLAFlowMatching
: Middle level, flow matching logic- Flow matching forward pass (training)
- Action sampling (inference)
- Embedding preparation (prefix/suffix)
- Noise sampling and denoising steps
-
SmolVLMWithExpert
: bottom level- Vision and language embedding
- Cross-attention between VLM and expert
- Multi-layer transformer processing