Vision-Language Models Provide Promptable Representations for Reinforcement Learning

This paper utilize VLMs to extract relavant background knowledge, abstractions and grounded features that can aid for RL action trainning.

For task-relevant prompts, which used to elicit useful representations from VLMs, always are questions that make the VLM attend to and encode semantic features in the image that are useful for the RL policy learning to solve the task.

FF's Roam Notes

Explorer

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Graph View