This paper proposes a transformer decoder based network, which use multimodal prompts and historical interactions as inputs to predicts motor commands.