Train a subgoal policy π(sg∣st) by using imitation learning. That is, by collecting a bunch of expert data.