This paper introduce a hierarchical robot control method, Latent Codes as Bridges, which uses a learnable latent code to act as a bridge between LLMs and low level polices.
They category current LLMs based control system as two:
-
calling a set of pre-defined skills or APIs.
- needs to have semantics attached to them that make linguistic sense.
- constrains the set of skills to a closed vocabulary and prevents any form of generalization to new skills or capability.
- tunes those pre-defined skills are challenging.
-
language as interface: not directly use premitive skills but those simple language command as input of policy
- not all high level tasks can be decomposed into subtasks in simple language.
- end-to-end fine-tuning might erase reasoning capabilities that the LLM originally had.
This paper proposes an learnable <ACT> token which act as a learned intermediary: instead of mapping directly from natural language to actions, the token’s context-dependent embedding creates a latent space that can better capture subtle details necessary for low-level control.
Model
The model contains two components: a pre-trained LLM (LLaVA) and a pre-trained policy (Actor Diffusion). The model is trained to output <ACT> corresponding tokens. Then the <ACT> tokens’ embeddings, which is extracted from the last-layer embedding features, will considered as conditional latent input of the action policy, along with environment observations, to output the action.
Training
Two stages:
-
aligning the embeddings (<ACT>) produced by the LLM with the feature space of the policy by freezing the action policy;
-
then fine-tuning all models.
Theyuse LoRA with a rank of 16, and takes about 8 hours to finish on an 8 80GB A100 GPU.
Questions
-
How many tokens/embeddings used for action policy?
-
What’s the structure of all MLPs?
-
Is this a VLA?
Yes. But with a pre-trained action policy.
-
Any improvements?
-
Need to align the <ACT> with the action policy
-
The latent space information are not verified or proved. It may contains unnessary information or less acurate information to help the action policy to generate actions.
-