This paper is exactly what I what do and published recently. That gives me a lot of confidence to continue this work. At least this is not some simple or unnecessary work.(Jeffery do give me big help! The partial order idea!)
This paper consists of two parts. First one to decompose the long-horizon tasks (maybe contains order, sequential) to goal-reaching problem, and use a high-level planner to generate subgoals for low-level policy. Another part try to train the low-level policy by using offline data and fine-tuned by online data which trained on a shorten task (subgoal reinforcement learning problem).
Previous methods either propose subgoals from the set of previously seen states, or directly optimize over subgoals, often by utilizing a latent variable model to obtain a concise representation of image-based states.
-
Train low-level policy by offline data. (Q: How to promise all goal has been trained? ⇒ Exploration, Distribution of goal)
-
Use
conditional variational encoder
to sample a set of subgoals, and choose the optimal subgoals as subgoals -
For each subgoals, use to generate actions and fine-tune it by online data
Recommend to check the part how to generate a set of subgoals which I think is the most novel part of this paper.