From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

This paper introduce a hierarchical robot control method, Latent Codes as Bridges, which uses a learnable latent code to act as a bridge between LLMs and low level polices.

They category current LLMs based control system as two:

calling a set of pre-defined skills or APIs.
- needs to have semantics attached to them that make linguistic sense.
- constrains the set of skills to a closed vocabulary and prevents any form of generalization to new skills or capability.
- tunes those pre-defined skills are challenging.
language as interface: not directly use premitive skills but those simple language command as input of policy
- not all high level tasks can be decomposed into subtasks in simple language.
- end-to-end fine-tuning might erase reasoning capabilities that the LLM originally had.

This paper proposes an learnable <ACT> token which act as a learned intermediary: instead of mapping directly from natural language to actions, the token’s context-dependent embedding creates a latent space that can better capture subtle details necessary for low-level control.

Model

The model contains two components: a pre-trained LLM (LLaVA) and a pre-trained policy (Actor Diffusion). The model is trained to output <ACT> corresponding tokens. Then the <ACT> tokens’ embeddings, which is extracted from the last-layer embedding features, will considered as conditional latent input of the action policy, along with environment observations, to output the action.

Training

Two stages:

aligning the embeddings (<ACT>) produced by the LLM with the feature space of the policy by freezing the action policy;
then fine-tuning all models.

Theyuse LoRA with a rank of 16, and takes about 8 hours to finish on an 8 80GB A100 GPU.

Questions

How many tokens/embeddings used for action policy?
What’s the structure of all MLPs?
Is this a VLA?

Yes. But with a pre-trained action policy.
Any improvements?
- Need to align the <ACT> with the action policy
- The latent space information are not verified or proved. It may contains unnessary information or less acurate information to help the action policy to generate actions.

FF's Roam Notes

Explorer

From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

Model

Training

Questions

Graph View

Table of Contents

Backlinks