Reward Shaping

Introduction

Often, a very simple pattern of extra rewards suffices to render straightforward an otherwise completely intractable problem.

Current Problems

Consider the following example of bugs that can arise:

In a system that learns to ride a simulated bicycle to a particular location. To speed up learning, they provided positive rewards whenever the agent made progress towards the goal. Because no penalty was incurred for riding away from the goal.

![](/ox-hugo/2021-05-15_15-08-24_screenshot.png” caption=“<span class=“figure-number”>Figure 1: Original problem of riding bicycle)

![](/ox-hugo/2021-05-15_15-10-46_screenshot.png” caption=“<span class=“figure-number”>Figure 2: Speed up learning by adding positive rewards whenever the agent made progress towards goal.)

Hence it’s now better for the bicycle to try to go in a cycle than to go to the goal.

Preliminaries

A finite-state Markov decision process(MDP) is a tuple $M = (S, A, T, γ, R)$ , where:

$S$ is a set of states;
$A = a_{1}, \dots, a_{k}$ is a set of actions;
$T = P_{s a} (\cdot ∣ s \in S, a \in A)$ is the next state transition probability;
$γ$ is the discount factor;
$R : S \times A \times S \to R$ is a bounded real function called the reward function.

A policy over a set of states S is a function $π : S \to A$ .

Thus,

Value function $V_{M}^{π} (s) = E [R_{1} + γ R_{2} + \dots]$
Action function $Q_{M}^{π} (s, a) = E_{s^{' \sim P_{s a}}} [R (s, a, s^{'}) + γ V_{M}^{π} (s^{'})]$

Hence, the optimal value function is

V_{M}^{\*} (s) = π sup V_{M}^{π} (s)

The optimal Q-function is

Q_{M}^{\*} (s, a) = π sup Q_{M}^{π} (s, a)

The optimal policy is

π_{M}^{\*} (s) = ar g a \in A max Q_{M}^{\*} (s, a)

Views

We will run on a transformed MDP $M^{'} = (S, A, T, γ, R^{'})$ where $R^{'} = R + F$ , and $F : S \times A \times S \to R$ is also a bounded real-value function called the shaping reward function.

We are trying to learn a policy for some MDP M, and we wish to help our learning algorithm by giving it additional shaping rewards which will hopefully guide it towards learning a good policy faster.

But for what forms of shaping-rewards $F$ can we guarantee that $π_{M^{'}}^{\*}$ , the optimal policy in $M^{'}$ will also be optimal in $M$ ?

Definition A shaping reward function $F$ is potential-based if there exists $Φ : S \to R$ s.t.

F (s, a, s^{'}) = γ Φ (s^{'}) - Φ (s)

If $F$ is a potential-based shaping function, then every optimal policy in $M^{'}$ will also be an optimal policy in $M$ .

Proof omission.

Conclusion

This suggests that a way to define a good potential function might be to try to approximate $V_{M}^{\*} (s)$ .

FF's Roam Notes

Explorer

Reward Shaping

Introduction

Current Problems

Preliminaries

Views

Conclusion

Graph View

Table of Contents

Backlinks