Previous diffuser based subgoal model utilize the language and image embeddin directly, while this paper tries to first generate a skill embedding (using image and language), then use that embedding in diffusion model.
Previous diffuser based subgoal model utilize the language and image embeddin directly, while this paper tries to first generate a skill embedding (using image and language), then use that embedding in diffusion model.