← Home

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

Jun 1, 2026

MEM factorizes policy memory into two levels:

Short-term memory: a video encoder compresses a recent sequence of observations so the policy can handle occlusion, timing, local dynamics, and failure recovery.
Long-term memory: a high-level policy maintains a natural-language summary of past semantic events, e.g. which substeps of a recipe are done.

So instead of throwing the whole history into one huge context window, MEM stores different timescales in different formats.

Efficient ViT-based video encoder for short-horizon memory.

Patchify each frame separately.
Add a sinusoidal temporal embedding with e(0)=0 so the current frame matches the original pretrained single-image ViT behavior.
Every 4th ViT layer adds causal temporal attention across timesteps for the same patch location.
Attention is factorized into space and time instead of full spatiotemporal attention.
Past-frame patch tokens are dropped in upper layers, so only the current-frame representation is passed downstream.

Complexity:

Why naive memory is not enough?

Just feed more history: too expensive for real-time robot control.
Just append all past text: causes train-test mismatch, since inference may contain repeated failed attempts that are rare in demonstrations.