Preliminaries
Policy gradient methods maximize the expected total reward by repeatedly estimating the gradient . There are several different related expressions for the policy gradient, which have the form
\begin{equation*} g = \mathbf{E}\left[ \sum_{t=0}^{\infty}\Psi_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t}) \right] \end{equation*}
where may be one of the follows(All unbiased):
- : total reward of the trajectory.
- : rewards following action .
- : baselined version of previous formula.
- state-action value function.
- advantage function.
- : TD residual.
The first three formulas are deduced in Policy Gradient.
The latter formulas use the definitions:
\begin{equation*} V^{\pi}(s_t) := \mathbf{E}_{s_{t+1}:\infty,a_{t}:\infty} \left[ \sum_{l=0}^{\infty}r_{t+l} \right] \end{equation*}
\begin{equation*} Q^{\pi}(s_t,a_t) := \mathbf{E}_{s_{t+1}:\infty,a_{t+1}:\infty} \left[ \sum_{l=0}^{\infty}r_{t+l} \right] \end{equation*}
\begin{equation*} A^{\pi}(s_t, a_t) := Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) \end{equation*}
The advantage function measures whether or not the action is better or worse than the policy’s default behavior. And it’s gradient item points in the direction of increased policy if and only if the advantage function 0.
Then we introduce a parameter that allow us to reduce variance by downweighting rewards corresponding to delayed effects, at the cost of introducing bias. The discounted value functions are given by:
\begin{equation*} V^{\pi,\gamma}(s_t) := \mathbf{E}_{s_{t+1}:\infty,a_{t}:\infty} \left[ \sum_{l=0}^{\infty}\gamma^{l}r_{t+l} \right] \end{equation*}
\begin{equation*} Q^{\pi,\gamma}(s_t) := \mathbf{E}_{s_{t+1}:\infty,a_{t+1}:\infty} \left[ \sum_{l=0}^{\infty}\gamma^{l}r_{t+l} \right] \end{equation*}
\begin{equation*} A^{\pi\gamma}(s_t, a_t) := Q^{\pi,\gamma}(s_t, a_t) - V^{\pi,\gamma}(s_t) \end{equation*}
It’s a biased(but not too biased) estimate of .
And the discounted approximation to the policy gradient is
\begin{equation*} g^{\gamma} := \mathbf{E}_{s_{0:\infty},a_{0:\infty}} \left[ \sum_{t=0}^{\infty} A^{\pi,\gamma}(s_t,a_t)\nabla_{\theta}\log\pi_{\theta}(a_t|s_t) \right] \end{equation*}
Here we want to concern about unbiased estimate of .
Definition The estimator is γ-just if
It follows immediately that if is γ-just, then
\begin{equation*} \mathbf{E}_{s_{0:\infty},a_{0:\infty}} \left[ \sum_{t=0}^{\infty}\hat{A_t}(s_{0:\infty})\nabla_{\theta}\log\pi_{\theta}(a_t|s_t) \right] = g^{\gamma} \end{equation*}
Suppose that can be written in the form such that for all , then is γ-just.
STILL not prove above. Didn’t understand the prefix proof in original paper.
It’s easy to verify that the following expressions are γ-just advantage estimators for :
Advantage Function estimation
Our estimator of policy gradient is
where indexes over a batch of episodes.
Let be an approximate value function. Define , i.e., the TD residual of V with dicounted .
If we have correct value function , then
\begin{aligned} \mathbf{E}_{s_{t+1}}\left[\sigma_{t}^{V^{\pi,\gamma}}\right] &= \mathbf{E}_{s_{t+1}}\left[ r_t+\gamma V^{\pi,\gamma}(s_{t+1}) - V^{\pi,\gamma}(s_t) \right] \\ &= \mathbf{E}_{s_{t+1}}\left[ Q^{\pi,\gamma}(s_{t}, a_{t}) - V^{\pi,\gamma}(s_t) \right] \\ &= A^{\pi,\gamma}(s_t,a_t) \end{aligned}
However, this estimator is only γ-just for , otherwise it will yield biased policy gradient estimates.
Next let us consider taking the sum of of these items, which we will denote by
which is simply the empirical returns minus the value function baseline.
The GAE(generalized advantage estimator) is defined as the exponentially-weighted average of these -step estimators:
\begin{aligned} \hat{A}_{t}^{\mathrm{GAE}(\gamma, \lambda)} &:=(1-\lambda)\left(\hat{A}_{t}^{(1)}+\lambda \hat{A}_{t}^{(2)}+\lambda^{2} \hat{A}_{t}^{(3)}+\ldots\right) \\ &=(1-\lambda)\left(\delta_{t}^{V}+\lambda\left(\delta_{t}^{V}+\gamma \delta_{t+1}^{V}\right)+\lambda^{2}\left(\delta_{t}^{V}+\gamma \delta_{t+1}^{V}+\gamma^{2} \delta_{t+2}^{V}\right)+\ldots\right) \\ &=(1-\lambda)\left(\delta_{t}^{V}\left(1+\lambda+\lambda^{2}+\ldots\right)+\gamma \delta_{t+1}^{V}\left(\lambda+\lambda^{2}+\lambda^{3}+\ldots\right)\right.\\ &\left.\quad+\gamma^{2} \delta_{t+2}^{V}\left(\lambda^{2}+\lambda^{3}+\lambda^{4}+\ldots\right)+\ldots\right) \\ =&(1-\lambda)\left(\delta_{t}^{V}\left(\frac{1}{1-\lambda}\right)+\gamma \delta_{t+1}^{V}\left(\frac{\lambda}{1-\lambda}\right)+\gamma^{2} \delta_{t+2}^{V}\left(\frac{\lambda^{2}}{1-\lambda}\right)+\ldots\right) \\ =& \sum_{l=0}^{\infty}(\gamma \lambda)^{l} \delta_{t+l}^{V} \end{aligned}
There are two notable special cases of this formula, obtained by setting and .
\begin{array}{ll} \operatorname{GAE}(\gamma, 0): & \hat{A}_{t}:=\delta_{t} \quad=r_{t}+\gamma V\left(s_{t+1}\right)-V\left(s_{t}\right) \\ \operatorname{GAE}(\gamma, 1): & \hat{A}_{t}:=\sum_{l=0}^{\infty} \gamma^{l} \delta_{t+l}=\sum_{l=0}^{\infty} \gamma^{l} r_{t+l}-V\left(s_{t}\right) \end{array}
is -just for and otherwise induces bias, but it typically has much lower variance. is -just regardless of the accuracy of , but it has high variance due to the sum of terms. Empirically, we find that the best value of is much lower than the best value of , likely because introduces far less bias than for a reasonably accurate value function.
where equality hold s when .