Related Works
The basic idea of policy gradient algorithm in continuous action spaces environment is to represent the policy by a parametric probability distribution that stochastically selects action a in state s according to parameter vector .
In this paper we instead consider deterministic policies .
In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state spaces. As a result, computing the stochastic policy gradient may require more samples, especially if action space has many dimensions.
Then we introduce the off-policy learning algorithm to ensure that our deterministic policy continue to explore satisfactorily.
Preliminary
We denote the density at state after transitioning for time steps from state by and the discounted state distribution by
Stochastic Policy Gradient Theorem
The basic idea behind these algorithm is to adjust the parameters of the policy in the direction of the performance gradient .
We can write the performance objective as an expectation:
\begin{equation*} \begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \int_{\mathcal{A}} \nabla_{\theta} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{~d} s \\ &=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] \end{aligned} \end{equation*}
Notes:
-
is the probability distribution of action spaces. For instance, in the discrete action spaces, policy net’s outputs are several action probability number. Our goal here is to increase the probability of good actions.
-
is the unbiased expectation of state and action pair. We use this to estimate the goodness of the action.
And the derivation procedure of the above is in Policy Gradient.
Stochastic Actor-Critic Algorithm
The actor-critic is a widely used architecture based on the policy gradient theorem. The actor adjusts the parameters of the stochastic policy by above equation. The critic uses an action-value function with parameter vector instead of unknown true action-value function .
Off-Policy Actor-Critic
Views
Deterministic Policy Gradient Theorem
In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of , rather than globally maximizing . Specifically, for each visited state , the policy parameters are updated in proportion to the gradient .
By applying the chain rule we see that the policy improvement may be decomposed into the gradient of the action value with respect to actions, and the gradient of the policy with respect to the policy parameters.
\begin{equation*} \begin{aligned} \theta^{k+1} &= \theta^{k} + \alpha\mathbb{E}_{s\sim\rho^{\mu^{k}}} \left[ \nabla_{\theta} Q^{\mu^{k}} (s,\mu_{\theta}(s)) \right] \\ &= \theta^{k} + \alpha\mathbb{E}_{s\sim\rho^{\mu^{k}}} \left[ \nabla_{\theta}\mu_{\theta}(s) \nabla_{a} Q^{\mu^{k}} (s,a)|_{a=\mu_{\theta}(s)} \right] \\ \end{aligned} \end{equation*}
And then the deterministic policy gradient theorem is
\begin{equation*} \nabla_{\theta} J(\mu_{\theta}) = \mathbb{E}_{s\sim\rho^{\mu}} \left[ \nabla_{\theta}\mu_{\theta}(s)\nabla_{a}\mu^{\mu}(s,a)|_{a=\mu_{\theta}(s)} \right] \end{equation*}
Actually, the deterministic policy gradient is indeed a special case of the stochastic policy gradient. We parameterize stochastic policies by a deterministic policy and a variance parameter , such that for the stochastic policy is equivalent to the deterministic policy.
On-Policy Deterministic Actor-Critic
Like stochastic actor-critic algorithm, we substitute a differentiable action- value function in place of the true action-value function .
For example, in the following deterministic actor-critic algorithm, the critic uses Sarsa updates to estimate the action-value function.
Off-Policy Deterministic Actor-Critic
We now consider off-policy methods that learn a deterministic target policy from trajectories generated by an arbitrary stochastic behaviour policy .
As before, we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy.
Notes
- We only calculate the expectation over the state distribution cause the action is deterministic.
- STILL don’t understand the approximation equation.
We again substitute a differentiable action-value function in place of the true action-value function . A critic estimates the action- value function , off-policy from trajectories generated by , using an appropriate policy evaluation algorithm.
In the following off-policy deterministic actor-critic algorithm, the critic uses Q-learning updates to estimate the action-value function.
Because the deterministic policy gradient removes the integral over action, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic.