$\pi_{0.6}$: A VLA That Learns From Experience
This paper is an follow-up one of # $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control and # $\pi_{0.5}$: A Vision-Language-Action Model with Open-World Generalization.
Core difference between VLA and VLM
Or the core problem inherit in VLA is, When a VLA trained with imitation controls the robot, it will, like any model, make small mistakes – it might put the gripper in the wrong spot, miss a grasp, or knock over an object. Because the robot is interacting with a real physical environment, this mistake will produce a situation that is a bit different from situations in the training data.
VLMs don't face this because:
- Each response is essentially independent
- There's no physical state that drifts from training distribution
- Mistakes don't compound over time
So actually # Action Chunking is designed to solve this issue in VLA.
Three stage training process
Data collection
Robot executes tasks autonomously, with optional human interventions to correct large mistakes. Episodes are labeled with sparse rewards (success/failure).
Value function training
A multi-task distributional value function is trained to predict the return (See # Reinforcement Learning Return) until successful task completion, using a 670M parameter VLM backbone.
The sparse reward design:
- Every timestep gets -1 (except the last)
- Final timestep gets 0 (success) or -Cfail (failure)
- The return becomes: negative of steps-to-success for successful episodes
Advantage-conditioned training
The VLA is trained with supervised learning on all data, but with an additional input indicating action quality based on advantages from the value function. The policy is conditioned on text inputs "Advantage: positive" or "Advantage: negative" based on whether the action advantage exceeds a task-specific threshold
Why Text Conditioning rather than 0/1?
- Leverages pre-trained representations: The VLA already understands language, so "Advantage: positive/negative" taps into semantic understanding
- Compatibility with VLA architecture: The advantage indicator appears in the training sequence after l but before the (discretized and continuous) actions, such that only the action log-likelihoods are affected
- Enables classifier-free guidance: They can sample with or without conditioning and interpolate (like CFG in diffusion models)