FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
Existing diffusion-based VLA policies require multi-billion-parameter models and massive dataset to achieve strong performance. They tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation.
They take the projected VLM latent tokens (map the VLM hidden states through a linear layer followed by RMSNorm) and inject them into the Flow Transformer via cross-attention.