MEM: Multi-Scale Embodied Memory for Vision Language Action Models
Built on top of # $\pi_{0.6}$: A VLA That Learns From Experience.
MEM factorizes policy memory into two levels:
Short-term memory: a video encoder compresses a recent sequence of observations so the policy can handle occlusion, timing, local dynamics, and failure recovery.Long-term memory: a high-level policy maintains a natural-language summary of past semantic events, e.g. which substeps of a recipe are done.
So instead of throwing the whole history into one huge context window, MEM stores different timescales in different formats.
Efficient ViT-based video encoder for short-horizon memory.
- Patchify each frame separately.
- Add a sinusoidal temporal embedding with
e(0)=0so the current frame matches the original pretrained single-image ViT behavior. - Every 4th ViT layer adds
causal temporal attentionacross timesteps for the same patch location. - Attention is factorized into space and time instead of full spatiotemporal attention.
- Past-frame patch tokens are dropped in upper layers, so only the current-frame representation is passed downstream.
Complexity:
- naive joint space-time attention: ~$O(K^2 n^2)$
- their factorized attention: ~$O(K n^2 + n K^2)$
Why naive memory is not enough?
Just feed more history: too expensive for real-time robot control.Just append all past text: causes train-test mismatch, since inference may contain repeated failed attempts that are rare in demonstrations.