Deterministic Policy Gradient

The basic idea of policy gradient algorithm in continuous action spaces environment is to represent the policy by a parametric probability distribution $π_{θ} (a ∣ s) = P [a ∣ s; θ]$ that stochastically selects action a in state s according to parameter vector $θ$ .

In this paper we instead consider deterministic policies $a = μ_{θ} (s)$ .

In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state spaces. As a result, computing the stochastic policy gradient may require more samples, especially if action space has many dimensions.

Then we introduce the off-policy learning algorithm to ensure that our deterministic policy continue to explore satisfactorily.

Preliminary

We denote the density at state $s^{'}$ after transitioning for $t$ time steps from state $s$ by $p (s \to s^{'}, t, π)$ and the discounted state distribution by $ρ^{π} (s^{'}) := \int_{S} \sum_{t = 1}^{\infty} γ^{t - 1} p_{1} (s) p (s \to s^{'}, t, π) d s$

Stochastic Policy Gradient Theorem

The basic idea behind these algorithm is to adjust the parameters $θ$ of the policy in the direction of the performance gradient $\nabla_{θ} J (π_{θ})$ .

We can write the performance objective as an expectation:

\begin{equation*} \begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \int_{\mathcal{A}} \nabla_{\theta} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{~d} s \\ &=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] \end{aligned} \end{equation*}

Notes:

$π_{θ} (a ∣ s)$ is the probability distribution of action spaces. For instance, in the discrete action spaces, policy net’s outputs are several action probability number. Our goal here is to increase the probability of good actions.
$Q^{π} (s, a)$ is the unbiased expectation of state and action pair. We use this to estimate the goodness of the action.

And the derivation procedure of the above is in Policy Gradient.

Stochastic Actor-Critic Algorithm

The actor-critic is a widely used architecture based on the policy gradient theorem. The actor adjusts the parameters $θ$ of the stochastic policy $π_{θ}$ by above equation. The critic uses an action-value function $Q^{w} (s, a)$ with parameter vector $w$ instead of unknown true action-value function $Q^{π} (s, a)$ .

Off-Policy Actor-Critic

See Off Policy Actor Critic

Views

Deterministic Policy Gradient Theorem

In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximization at every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of $Q$ , rather than globally maximizing $Q$ . Specifically, for each visited state $s$ , the policy parameters $θ$ are updated in proportion to the gradient $\nabla_{θ} Q^{μ} (s, μ_{θ} (s))$ .

By applying the chain rule we see that the policy improvement may be decomposed into the gradient of the action value with respect to actions, and the gradient of the policy with respect to the policy parameters.

\begin{equation*} \begin{aligned} \theta^{k+1} &= \theta^{k} + \alpha\mathbb{E}_{s\sim\rho^{\mu^{k}}} \left[ \nabla_{\theta} Q^{\mu^{k}} (s,\mu_{\theta}(s)) \right] \\ &= \theta^{k} + \alpha\mathbb{E}_{s\sim\rho^{\mu^{k}}} \left[ \nabla_{\theta}\mu_{\theta}(s) \nabla_{a} Q^{\mu^{k}} (s,a)|_{a=\mu_{\theta}(s)} \right] \\ \end{aligned} \end{equation*}

And then the deterministic policy gradient theorem is

\begin{equation*} \nabla_{\theta} J(\mu_{\theta}) = \mathbb{E}_{s\sim\rho^{\mu}} \left[ \nabla_{\theta}\mu_{\theta}(s)\nabla_{a}\mu^{\mu}(s,a)|_{a=\mu_{\theta}(s)} \right] \end{equation*}

Actually, the deterministic policy gradient is indeed a special case of the stochastic policy gradient. We parameterize stochastic policies $π_{μ_{θ}, σ}$ by a deterministic policy $μ_{θ} : S \to A$ and a variance parameter $σ$ , such that for $σ = 0$ the stochastic policy is equivalent to the deterministic policy.

On-Policy Deterministic Actor-Critic

Like stochastic actor-critic algorithm, we substitute a differentiable action- value function $Q^{w} (s, a)$ in place of the true action-value function $Q^{μ} (s, a)$ .

For example, in the following deterministic actor-critic algorithm, the critic uses Sarsa updates to estimate the action-value function.

δ_{t} w_{t + 1} θ_{t + 1} = r_{t} + γ Q^{w} (s_{t + 1}, a_{t + 1}) - Q^{w} (s_{t}, a_{t}) = w_{t} + α_{w} δ_{t} \nabla_{w} Q^{w} (s_{t}, a_{t}) = θ_{t} + α_{θ} \nabla_{θ} μ_{θ} (s_{t}) \nabla_{a} Q^{w} (s_{t}, a_{t}) ∣_{a = μ_{θ} (s)}

Off-Policy Deterministic Actor-Critic

We now consider off-policy methods that learn a deterministic target policy $μ_{θ} (s)$ from trajectories generated by an arbitrary stochastic behaviour policy $π (s, a)$ .

As before, we modify the performance objective to be the value function of the target policy, averaged over the state distribution of the behaviour policy.

\nabla_{θ} J_{β} (μ_{θ}) \approx \int_{S} ρ^{β} (s) \nabla_{θ} μ_{θ} (a ∣ s) Q^{μ} (s, a) d s = E_{s \sim ρ^{β}} [\nabla_{θ} μ_{θ} (s) \nabla_{a} Q^{μ} (s, a) ∣_{a = μ_{θ} (s)}]

Notes

We only calculate the expectation over the state distribution cause the action is deterministic.
STILL don’t understand the approximation equation.

We again substitute a differentiable action-value function $Q^{w} (s, a)$ in place of the true action-value function $Q^{μ} (s, a)$ . A critic estimates the action- value function $Q^{w} (s, a)$ , off-policy from trajectories generated by $β (a ∣ s)$ , using an appropriate policy evaluation algorithm.

In the following off-policy deterministic actor-critic algorithm, the critic uses Q-learning updates to estimate the action-value function.

δ_{t} w_{t + 1} θ_{t + 1} = r_{t} + γ Q^{w} (s_{t + 1}, μ_{θ} (s_{t + 1})) - Q^{w} (s_{t}, a_{t}) = w_{t} + α_{w} δ_{t} \nabla_{w} Q^{w} (s_{t}, a_{t}) = θ_{t} + α_{θ} \nabla_{θ} μ_{θ} (s_{t}) \nabla_{a} Q^{w} (s_{t}, a_{t}) ∣_{a = μ_{θ} (s)}

Because the deterministic policy gradient removes the integral over action, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic.

FF's Roam Notes

Explorer

Deterministic Policy Gradient

Preliminary

Stochastic Policy Gradient Theorem

Stochastic Actor-Critic Algorithm

Off-Policy Actor-Critic

Views

Deterministic Policy Gradient Theorem

On-Policy Deterministic Actor-Critic

Off-Policy Deterministic Actor-Critic

Graph View

Table of Contents

Backlinks

FF's Roam Notes

Explorer

Deterministic Policy Gradient

Related Works

Preliminary

Stochastic Policy Gradient Theorem

Stochastic Actor-Critic Algorithm

Off-Policy Actor-Critic

Views

Deterministic Policy Gradient Theorem

On-Policy Deterministic Actor-Critic

Off-Policy Deterministic Actor-Critic

Graph View

Table of Contents

Backlinks