This paper utilize VLMs to extract relavant background knowledge, abstractions and grounded features that can aid for RL action trainning.
For task-relevant prompts, which used to elicit useful representations from VLMs, always are questions that make the VLM attend to and encode semantic features in the image that are useful for the RL policy learning to solve the task.