Enabling humanoid robots to perform long-horizon mobile manipulation planning in real-world environments based on embodied perception and comprehension abilities has been a longstanding challenge.
With the recent rise of large language models (LLMs), there has been a notable increase in the development of LLM-based planners. These approaches either utilize human-provided textual representations of the real world or heavily depend on prompt engineering to extract such representations, lacking the capability to quantitatively understand the environment, such as determining the feasibility of manipulating objects.
To address these limitations, we present the Instruction-Augmented Long-Horizon Planning (IALP) system, a novel framework that employs LLMs to generate feasible and optimal actions based on real-time sensor feedback, including grounded knowledge of the environment, in a closed-loop interaction. Distinct from prior works, our approach augments user instructions into PDDL problems by leveraging both the abstract reasoning capabilities of LLMs and grounding mechanisms. By conducting various real-world long-horizon tasks, each consisting of seven distinct manipulatory skills, our results demonstrate that the IALP system can efficiently solve these tasks with an average success rate exceeding 80%. Our proposed method can operate as a high-level planner, equipping robots with substantial autonomy in unstructured environments through the utilization of multi-modal sensor inputs.
(:action move ; move the robot to the object
:parameters (?obj - locatable)
:precondition (and
(find ?obj)
)
:effect (and
(at ?obj)
)
)
(:action scan ; scan the object
:parameters (?obj - locatable)
:precondition (and
(at ?obj)
(not (detected ?obj))
)
:effect (and
(detected ?obj)
)
)
(:action grasp ; grasp the object
:parameters (?obj - locatable)
:precondition (and
(at ?obj)
(detected ?obj)
(graspable ?obj)
(reachable ?obj)
)
:effect (and
(holding ?obj)
)
)
(:action place ; place the object on the place
:parameters (?obj - locatable ?pla - locatable)
:precondition (and
(at ?place)
(holding ?obj)
(placeable ?pla)
)
:effect (and
(on ?obj ?pla)
)
)
(:action pull ; pull the object
:parameters (?obj - locatable)
:precondition (and
(at ?obj)
(graspable ?obj)
(not (opened ?obj))
)
:effect (and
(opened ?obj)
)
)
(:action push ; push the object
:parameters (?obj - locatable)
:precondition (and
(at ?obj)
(graspable ?obj)
(opened ?obj)
)
:effect (and
(not (opened ?obj))
)
)
(:action lift ; lift the object
:parameters (?obj - locatable)
:precondition (and
(at ?obj)
(holding ?obj)
)
:effect (and
)
)
(:action rotate ; rotate the object
:parameters (?obj - locatable)
:precondition (and
(at ?obj)
(holding ?obj)
)
:effect (and
)
)
(:action reach ; reach the pose
:parameters (?pose - locatable)
:precondition (and
(reachable ?pose)
)
:effect (and
)
)
(:action adjust ; adjust the robot if the object is not reachable, graspable, or placeable
:parameters (?obj - locatable)
:precondition (or
(not (reachable ?obj))
(not (graspable ?obj))
(not (placeable ?obj))
)
:effect (and
)
)
(:action alert ; alert when there's an error the robot cannot solve based on the current state
:parameters ()
:precondition (and
)
:effect (and
)
)
(:action stop ; stop the robot when the taks is done
:parameters ()
:precondition (and
)
:effect (and
)
)
(define (domain room)
(:requirements :strips :fluents :durative-actions :timed-initial-literals :typing :conditional-effects :negative-preconditions :duration-inequalities :equality :disjunctive-preconditions)
(:types
locatable - object
floor - object
bot table coat_stand cloth box - locatable
robot - locatable
pose - locatable
)
(:predicates
(:predicates ;todo: define predicates here
(on ?obj1 - locatable ?obj2 - locatable) ; if the object is on another object
(in ?obj - locatable ?obj2 - locatable); if the object1 is inside others
(holding ?obj - locatable) ; if the arm is holding the object
(opened ?obj) ; if the object is opened
(at ?obj - locatable) ; if the robot is at the object (near)
(find ?obj - locatable); if has a path from the current state to the object
(detected ?obj - locatable); if the object is detected
(graspable ?obj - locatable); if the object is graspable
(reachable ?obj - locatable); if the object is reachable
(placeable ?obj - locatable); if the object is placeable
)
)
(:functions
)
; Actions
)
Based on the instruction, you are using the following predicates to generate the goal of the PDDL problem. The robot's name is Tiago.
Instruction: [INSTRUCTION]
The predicate candidates are:
"""
**Check predicate before**
"""
You can use (and ) and (or ) to combine the goal predicates. Please only return answers without any explanation. Do not return markdown code wrappers.
Here's an example:
[user]
Instruction: Pick up the box on the table and place it on the black table.
[assistant]
(on box black_table)
Example finished.
Here's what I give to you:
Instruction: [INSTRUCTION]
Please extract the object name and related object name from the given instruction. The related object name will help humans to find the object. Also, extract other object names that may be related to finishing the instruction, such as the target position-related objects. You need to concatenate multi-word object names by using "_." For example, the "black table" in the instruction should be converted to "black_table." The answer should be in JSON format without markdown code block triple backticks:
{
"object_name": "str",
"related_object_name": "str",
"other_object_names": [
"str",
"str"
]
}
The instruction: [INSTRUCTION]
Listing all possible predicates that are necessary to check the current state.
You are required to detect the observation of the current environment by using the predicates and related objects provided below. You need to give all predicates and objects necessary to check the current state so that the robot can choose the best action. You don't need to verify each predicate. Every predicate can be used multiple times. Do NOT return markdown code block triple backticks.
Instruction: [INSTRUCTION]
The possible ?obj could be: [OBJECTS]
The predicate candidates are:
"""
predicates ...
"""
The possible actions are:
"""
actions ...
"""
The output is formatted as:
"""
(on box table)
(holding robot box)
...
"""
{System Role}
[user]
You are an excellent interpreter of human instructions for daily tasks. Given an instruction and information about the working environment, you break it down into a sequence of robotic actions. Please do not begin working until I say "Start working." Instead, simply output the message "Waiting for next input." Understood?
[assistant]
Understood. Waiting for the next input.
\end{lstlisting}
{Environments}
[user]
Information about environments, objects, and tasks is given as a PDDL function.
The Planning Domain Definition Language (PDDL) is a domain-specific language designed for the Benchmark for creating a standard for Artificial Intelligence (AI) planning.
Here's the domain description you used:
"""
**Check domain description**
"""
The =:action= blocks define all the action/subtasks used for completing the task.
Later, you will receive the task/problem defined by PDDL and the above domain.
Here's an example:
"""
...
"""
The =:init= block defines the current observation of the environment.
The =:goal= block defines the goal of the task.
You need to take action from the current state, not from the start. If the current task is over and no action is needed for the robot, you can use the "stop" action. If the robot doesn't know what to execute in its current state, for example, it cannot find the target object, you can use the "alert" action and stop the robot. If the object is not reachable, graspable, or placeable after the scan, you can use the "adjust" action to adjust the robot's pose to make it reachable, graspable, or placeable.
-------------------------------------------------------
The texts above are part of the overall instruction. Do not start working yet:
[assistant]
Understood. Waiting for the next input.
\end{lstlisting}
{Observation input}
[user]
Start working. Resume from the environment below.
The instruction is as follows:
"""
[INSTRUCTION]
"""
The action executed last time is as follows:
"""
[ACTION]
"""
The observation of the current environment is as follows:
"""
[OBSERVATION]
"""
@article{fangyuan2024instruction,
author = {Fangyuan Wang, Shipeng Lyu, Peng Zhou, Anqing Duan, Guodong Guo, David Navarro-Alarcon},
title = {Instruction-Augmented Long-Horizon Planning: Embedding Grounding Mechanisms in Embodied Mobile Manipulation},
journal = {AAAI},
year = {2024},
}