SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

SmolVLA is a compact and efficient VLA model that skipping layers in the VLM, using a minimal number of visual tokens, leveraging small pretrained VLM and interleaving self-attention layers with lighter cross-attention layers.

It uses SmolVLM-2 for backbone, which relies on SigLIP to encode visual features for SmolLM2 language decoder and reduces visual token count through a token-shuffling technique for efficiency.

The proprioceptive states are projected into a single token via a linear layer, then visual, language and state tokens are concatenated and passed to the language decoder.

They also use a linear projection to adapt the VLM features to align with the action expert’s dimension.

For faster inference through layer skipps, they find the feature of half the total layers offers a good tradeoff between speed and performance.

They use a conditional Flow Matching Transfomer as action expert.

Code Structure

Policy

The hierarchical design of the policy contains:

SmolVLAPolicy: Top level, policy interface
- Input/output normalization
- Action selection and queuing
- Batch preparation (images, language, state)
- Framework integration (checkpointing, evaluation)
VLAFlowMatching: Middle level, flow matching logic
- Flow matching forward pass (training)
- Action sampling (inference)
- Embedding preparation (prefix/suffix)
- Noise sampling and denoising steps
SmolVLMWithExpert: bottom level
- Vision and language embedding
- Cross-attention between VLM and expert
- Multi-layer transformer processing

FF's Roam Notes

Explorer

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Code Structure

Policy

Graph View

Table of Contents

Backlinks