Previous diffuser based subgoal model utilize the language and image embeddin directly, while this paper tries to first generate a skill embedding (using image and language), then use that embedding in diffusion model.