Introduction
Often, a very simple pattern of extra rewards suffices to render straightforward an otherwise completely intractable problem.
Current Problems
Consider the following example of bugs that can arise:
In a system that learns to ride a simulated bicycle to a particular location. To speed up learning, they provided positive rewards whenever the agent made progress towards the goal. Because no penalty was incurred for riding away from the goal.


Hence it’s now better for the bicycle to try to go in a cycle than to go to the goal.
Preliminaries
A finite-state Markov decision process(MDP) is a tuple , where:
- is a set of states;
- is a set of actions;
- is the next state transition probability;
- is the discount factor;
- is a bounded real function called the reward function.
A policy over a set of states S is a function .
Thus,
- Value function
- Action function
Hence, the optimal value function is
The optimal Q-function is
The optimal policy is
Views
We will run on a transformed MDP where , and is also a bounded real-value function called the shaping reward function.
We are trying to learn a policy for some MDP M, and we wish to help our learning algorithm by giving it additional shaping rewards which will hopefully guide it towards learning a good policy faster.
But for what forms of shaping-rewards can we guarantee that , the optimal policy in will also be optimal in ?
Definition A shaping reward function is potential-based if there exists s.t.
If is a potential-based shaping function, then every optimal policy in will also be an optimal policy in .
Proof omission.
Conclusion
This suggests that a way to define a good potential function might be to try to approximate .