FF's Notes
← Home

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Jun 1, 2026

Built on top of # $\pi_{0.6}$: A VLA That Learns From Experience.

MEM factorizes policy memory into two levels:

  1. Short-term memory: a video encoder compresses a recent sequence of observations so the policy can handle occlusion, timing, local dynamics, and failure recovery.
  2. Long-term memory: a high-level policy maintains a natural-language summary of past semantic events, e.g. which substeps of a recipe are done.

So instead of throwing the whole history into one huge context window, MEM stores different timescales in different formats.


Efficient ViT-based video encoder for short-horizon memory.

  • Patchify each frame separately.
  • Add a sinusoidal temporal embedding with e(0)=0 so the current frame matches the original pretrained single-image ViT behavior.
  • Every 4th ViT layer adds causal temporal attention across timesteps for the same patch location.
  • Attention is factorized into space and time instead of full spatiotemporal attention.
  • Past-frame patch tokens are dropped in upper layers, so only the current-frame representation is passed downstream.

Complexity:

  • naive joint space-time attention: ~$O(K^2 n^2)$
  • their factorized attention: ~$O(K n^2 + n K^2)$

Why naive memory is not enough?

  1. Just feed more history: too expensive for real-time robot control.
  2. Just append all past text: causes train-test mismatch, since inference may contain repeated failed attempts that are rare in demonstrations.