This paper apply the planning and reasoning features of LLMs to complex loco-manipulation tasks, constructing a hierarchical task graph that comprises a series of primitive behaviors to bridge the low-level execution and high-level planning. It leverage the interaction of distilled spatial geometry and 2D observation with a VLM to ground knowledge into a robotic morphology selector to choose appropriate actions.
The basic idea used in this paper is almost same as mine. The differences are:
- The paper didn’t mention details of obstaining current state, i.e., the task status and the observation. What inputed into the LLM are: prompts, 2D image, depth and human-inputed task state.
- The paper proposed is an one-shot planning method, which plan at the beginning of the excution.
- Although the paper did process the 2D image and depth raw data, i.e., obtaining pose of the target object and creating voxel map, it is still challenging for the VLM to obtain precise current state.
- We also add reachability map.