Actor Critic

Algorithm

In REINFORCE, we know that

The variance of the gradient estimator scales unfavorably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.

It make a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in REINFORCE, and that is what the Actor-Critic dose.

Actor Critic methods consists of two components, which may optionally share parameters:

Critic update the value function parameters and depending on the algorithm it could be action-value $Q (s, a)$ or state-value $V (s)$ .
Actor update the policy function parameters, in the direction suggested by the critic.

Implements

Initialize initial state $s$ , actor parameters $θ$ and critic parameters $μ$ ;
For $t = 1 \dots T$ :
1. Sample reward $r_{t} \sim R (s, a)$ and next state $s^{'} \sim P (s^{'} ∣ s, a)$ ;
2. Sample next action $a^{'} \sim π_{θ} (a^{'} ∣ s^{'})$ ;
3. Update the policy parameters by $θ \leftarrow θ + α_{θ} Q_{μ} (s, a) \nabla_{θ} lo g π_{π} (a ∣ s)$ ;
4. Compute the TD residual $σ_{t} = r_{t} + γ Q_{μ} (s^{'}, a^{'}) - Q_{μ} (s, a)$
5. Update the critic parameters by $μ \leftarrow μ + α_{μ} σ_{t} \nabla_{μ} Q_{μ} (s, a)$
6. Update $a \leftarrow a^{'}$ and $s \leftarrow s^{'}$

Reference

https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#actor-critic

FF's Roam Notes

Explorer

Actor Critic

Algorithm

Implements

Reference

Graph View

Table of Contents