Preliminary

One disadvantage of REINFORCE is low data utilization.

The advantages of off-policy methods:

  1. Learning about an optimal policy while executing an exploratory policy.
  2. Learning from demonstration.
  3. Learn multiple tasks in parallel.

The objective function is

\begin{equation*} J_{\gamma}(\mu) = \sum_{s \in S} d^{b}(s)V^{\pi_{u},\gamma}(s) \end{equation*}

where is the limiting distribution of states under and is the probability that when starting in and executing . In the off-policy settings, data is obtained according to this behavior distribution .

In this paper, we consider the version of off-PAC that update its critic weights by the Gradient Temporal-Difference.

Off-PAC policy gradient estimation

\begin{aligned} &\nabla_{u} J_{\gamma}(u) \\ &= \nabla_{u} \left[ \sum_{s\in S} d^{b}(s) \sum_{a\in A} \pi(a|s) Q^{\pi,\gamma}(s,a) \right] \\ &= \sum_{s\in S}d^{b}(s) \sum_{a\in A} \left[ \nabla_{u}\pi(a|s)Q^{\pi,\gamma}(s,a) + \pi(a|s)\nabla_{u}Q^{\pi,\gamma}(s,a) \right] \\ &\approx \sum_{s\in S}d^{b}(s) \sum_{a\in A} \nabla_{u}\pi(a|s)Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{s\sim d^{b}} \sum_{a\in A} \nabla_{u}\pi(a|s)Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{s\sim d^{b}} \sum_{a\in A} b(a|s) \frac{\pi(a|s)}{b(a|s)}\frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{s\sim d^{b},a\sim b} \frac{\pi(a|s)}{b(a|s)}\frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{b} \frac{\pi(a|s)}{b(a|s)}\frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{b} \frac{\pi(a|s)}{b(a|s)}\nabla_{u}\log\pi(a|s)Q^{\pi,\gamma}(s,a) \\ \end{aligned}

The derivation above use Importance Sampling.