Algorithm
In REINFORCE, we know that
The variance of the gradient estimator scales unfavorably with the time horizon, since the effect of an action is confounded with the effects of past and future actions.
It make a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in REINFORCE, and that is what the Actor-Critic dose.
Actor Critic methods consists of two components, which may optionally share parameters:
-
Critic update the value function parameters and depending on the algorithm it could be action-value or state-value .
-
Actor update the policy function parameters, in the direction suggested by the critic.
Implements
-
Initialize initial state , actor parameters and critic parameters ;
-
For :
-
Sample reward and next state ;
-
Sample next action ;
-
Update the policy parameters by ;
-
Compute the TD residual
-
Update the critic parameters by
-
Update and
-