Where the robot is tells us what to look at. Joint instruction-proprioception guidance selects task-relevant visual patches.
Latency drops from 52ms to 22ms by retaining only 15% of visual tokens while maintaining or improving performance.
4.55 average completion length on ABC→D with 82.1% success on 5-task chains (LH-5).
Discretizing proprioception into VLM tokens outperforms MLP projection, leveraging the pretrained token embedding space.
ThinkProprio introduces proprioception as a first-class modality in VLA pipelines. We text-tokenize the robot's proprioceptive state (joint angles, end-effector pose) by discretizing continuous values into bins and mapping them to VLM vocabulary tokens.
The tokenized proprioception, combined with the task instruction, guides physically grounded token selection: a cross-attention mechanism scores visual patches based on their relevance to both the instruction and the robot's current configuration. A vote-based selection retains only the most relevant patches, with a global context token preserving coarse scene information.
The compact token set is processed by the VLM, and a flow-matching action head generates continuous action chunks via cross-attention over the fused features.
| Method | LH-1 | LH-2 | LH-3 | LH-4 | LH-5 | Avg. Len. |
|---|---|---|---|---|---|---|
| OpenVLA | 91.3 | 77.8 | 62.0 | 52.1 | 43.5 | 3.27 |
| GR-1 | 85.4 | 71.2 | 59.6 | 49.7 | 40.1 | 3.06 |
| π0 | 70.0 | 48.0 | 37.0 | 28.0 | 18.0 | 2.01 |
| π0.5 | 71.0 | 56.0 | 45.0 | 37.0 | 29.0 | 2.38 |
| VPP | 95.7 | 91.2 | 86.3 | 81.0 | 75.0 | 4.29 |
| Seer | 96.3 | 91.6 | 86.1 | 80.3 | 74.0 | 4.29 |
| FLOWER | 99.3 | 96.0 | 90.3 | 82.3 | 75.5 | 4.44 |
| ThinkProprio (Ours) | 97.7 | 96.1 | 92.2 | 86.7 | 82.1 | 4.55 |
| Method | LH-1 | LH-2 | LH-3 | LH-4 | LH-5 | Avg. Len. |
|---|---|---|---|---|---|---|
| Diff-P-CNN | 86.3 | 72.7 | 60.1 | 51.2 | 41.7 | 3.16 |
| RoboFlamingo | 96.4 | 89.6 | 82.4 | 74.0 | 66.0 | 4.09 |
| GR-1 | 94.9 | 89.6 | 84.4 | 78.9 | 73.1 | 4.21 |
| FLOWER | 98.9 | 96.7 | 93.9 | 90.2 | 85.5 | 4.62 |
| FLOWER† | 99.2 | 96.9 | 96.9 | 92.3 | 88.3 | 4.67 |
| ThinkProprio (Ours) | 99.5 | 97.2 | 96.6 | 92.3 | 88.5 | 4.74 |
| Method | LH-1 | LH-2 | LH-3 | LH-4 | LH-5 | Avg. Len. |
|---|---|---|---|---|---|---|
| MDT | 93.7 | 84.5 | 74.1 | 64.4 | 55.6 | 3.72 |
| RoboUniView | 96.2 | 88.8 | 77.6 | 66.6 | 56.3 | 3.85 |
| FLOWER† | 97.4 | 92.4 | 86.9 | 81.3 | 74.9 | 4.35 |
| ThinkProprio (Ours) | 96.9 | 89.8 | 83.6 | 80.5 | 72.7 | 4.23 |
| Method | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|
| OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| OpenVLA-OFT | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| π0 | 96.8 | 98.8 | 95.8 | 85.2 | 94.2 |
| FLOWER | 97.5 | 99.1 | 96.1 | 94.9 | 96.9 |
| LightVLA | 98.4 | 98.4 | 98.2 | 94.6 | 97.4 |
| ThinkProprio (Ours) | 97.6 | 98.4 | 98.0 | 95.2 | 97.3 |
| Method | Visual Tokens | Latency (ms) | VRAM (MB) | Avg. Len. |
|---|---|---|---|---|
| OpenVLA | 256 | 164 | 14574 | 3.27 |
| π0 | 256 | 104 | 6692 | 2.01 |
| FLOWER | 100 | 52 | 1848 | 4.44 |
| ThinkProprio (Ours) | 15 | 22 | 1899 | 4.55 |
Long-horizon task execution (5 sequential subtasks per trajectory) with token selection visualization.
Token retention across four timesteps for two tasks. Heatmaps visualize which visual patches are selected based on joint instruction-proprioception guidance. The selection shifts between object-centric and proprioception-centric focus as the task progresses.
@article{wang2025thinkproprio,
title={Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation},
author={Wang, Fangyuan and Zhou, Peng and Qi, Jiaming and Lyu, Shipeng and Navarro-Alarcon, David and Guo, Guodong},
journal={arXiv preprint},
year={2025}
}
This work builds upon several excellent open-source projects: FLOWER, CALVIN, and LIBERO.
Website template inspired by Diffusion Policy.