Think Proprioceptively

Embodied Visual Reasoning for VLA Manipulation

Fangyuan Wang1,2 Peng Zhou3 Jiaming Qi4 Shipeng Lyu1,2 David Navarro-Alarcon1 Guodong Guo2
1The Hong Kong Polytechnic University 2Eastern Institute of Technology 3Great Bay University 4Northeast Forestry University
ThinkProprio overview showing proprioception-guided token selection

ThinkProprio tokenizes proprioception into the VLM space to guide early visual reasoning. This yields strong CALVIN/LIBERO performance with ~15% of visual tokens and 58% lower latency than prior VLA policies.

Highlights

Token selection visualization

Proprioceptive Token Selection

Where the robot is tells us what to look at. Joint instruction-proprioception guidance selects task-relevant visual patches.

Efficiency comparison

58% Faster Inference

Latency drops from 52ms to 22ms by retaining only 15% of visual tokens while maintaining or improving performance.

CALVIN results

CALVIN State-of-the-Art

4.55 average completion length on ABC→D with 82.1% success on 5-task chains (LH-5).

Text tokenization

Text-Tokenized Proprioception

Discretizing proprioception into VLM tokens outperforms MLP projection, leveraging the pretrained token embedding space.

4.55 Avg. Completion Length CALVIN ABC→D
58% Latency Reduction 52ms → 22ms
15% Visual Tokens Retained 15 of 100 patches
97.3% LIBERO Average Across all suites

Method

ThinkProprio architecture

ThinkProprio introduces proprioception as a first-class modality in VLA pipelines. We text-tokenize the robot's proprioceptive state (joint angles, end-effector pose) by discretizing continuous values into bins and mapping them to VLM vocabulary tokens.

The tokenized proprioception, combined with the task instruction, guides physically grounded token selection: a cross-attention mechanism scores visual patches based on their relevance to both the instruction and the robot's current configuration. A vote-based selection retains only the most relevant patches, with a global context token preserving coarse scene information.

The compact token set is processed by the VLM, and a flow-matching action head generates continuous action chunks via cross-attention over the fused features.

Results

CALVIN ABC→D

Method LH-1 LH-2 LH-3 LH-4 LH-5 Avg. Len.
OpenVLA 91.3 77.8 62.0 52.1 43.5 3.27
GR-1 85.4 71.2 59.6 49.7 40.1 3.06
π0 70.0 48.0 37.0 28.0 18.0 2.01
π0.5 71.0 56.0 45.0 37.0 29.0 2.38
VPP 95.7 91.2 86.3 81.0 75.0 4.29
Seer 96.3 91.6 86.1 80.3 74.0 4.29
FLOWER 99.3 96.0 90.3 82.3 75.5 4.44
ThinkProprio (Ours) 97.7 96.1 92.2 86.7 82.1 4.55

CALVIN ABCD→D

Method LH-1 LH-2 LH-3 LH-4 LH-5 Avg. Len.
Diff-P-CNN 86.3 72.7 60.1 51.2 41.7 3.16
RoboFlamingo 96.4 89.6 82.4 74.0 66.0 4.09
GR-1 94.9 89.6 84.4 78.9 73.1 4.21
FLOWER 98.9 96.7 93.9 90.2 85.5 4.62
FLOWER† 99.2 96.9 96.9 92.3 88.3 4.67
ThinkProprio (Ours) 99.5 97.2 96.6 92.3 88.5 4.74

CALVIN D→D

Method LH-1 LH-2 LH-3 LH-4 LH-5 Avg. Len.
MDT 93.7 84.5 74.1 64.4 55.6 3.72
RoboUniView 96.2 88.8 77.6 66.6 56.3 3.85
FLOWER† 97.4 92.4 86.9 81.3 74.9 4.35
ThinkProprio (Ours) 96.9 89.8 83.6 80.5 72.7 4.23

LIBERO Benchmark Suites

Method Spatial Object Goal Long Avg.
OpenVLA 84.7 88.4 79.2 53.7 76.5
OpenVLA-OFT 97.6 98.4 97.9 94.5 97.1
π0 96.8 98.8 95.8 85.2 94.2
FLOWER 97.5 99.1 96.1 94.9 96.9
LightVLA 98.4 98.4 98.2 94.6 97.4
ThinkProprio (Ours) 97.6 98.4 98.0 95.2 97.3

Inference Efficiency (CALVIN ABC→D)

Method Visual Tokens Latency (ms) VRAM (MB) Avg. Len.
OpenVLA 256 164 14574 3.27
π0 256 104 6692 2.01
FLOWER 100 52 1848 4.44
ThinkProprio (Ours) 15 22 1899 4.55

Video Demonstrations

Long-horizon task execution (5 sequential subtasks per trajectory) with token selection visualization.

Trajectory 1: Object Manipulation

0
1
2
3
4

Trajectory 2: Environmental Interaction

0
1
2
3
4

Token Selection Visualization

Token retention heatmaps across timesteps

Token retention across four timesteps for two tasks. Heatmaps visualize which visual patches are selected based on joint instruction-proprioception guidance. The selection shifts between object-centric and proprioception-centric focus as the task progresses.

Citation

@article{wang2025thinkproprio,
  title={Think Proprioceptively: Embodied Visual Reasoning for VLA Manipulation},
  author={Wang, Fangyuan and Zhou, Peng and Qi, Jiaming and Lyu, Shipeng and Navarro-Alarcon, David and Guo, Guodong},
  journal={arXiv preprint},
  year={2025}
}

Acknowledgements

This work builds upon several excellent open-source projects: FLOWER, CALVIN, and LIBERO.

Website template inspired by Diffusion Policy.