← Home

Off Policy Actor Critic

May 18, 2021

rl off_policy

Preliminary

One disadvantage of # REINFORCE is low data utilization.

The advantages of off-policy methods:

Learning about an optimal policy while executing an exploratory policy.
Learning from demonstration.
Learn multiple tasks in parallel.

The objective function is $ J_{\gamma}(\mu) = \sum_{s \in S} d^{b}(s)V^{\pi_{u},\gamma}(s) $

where $d^{b}(s)$ is the limiting distribution of states under $b$ and $P(s_t = s | s_0, b)$ is the probability that $s_t = s$ when starting in $s_0$ and executing $b$. In the off-policy settings, data is obtained according to this behavior distribution $b$.

In this paper, we consider the version of off-PAC that update its critic weights by the # Gradient Temporal-Difference.

Off-PAC policy gradient estimation

\begin{aligned} &\nabla_{u} J_{\gamma}(u) \\ &= \nabla_{u} \left[ \sum_{s\in S} d^{b}(s) \sum_{a\in A} \pi(a|s) Q^{\pi,\gamma}(s,a) \right] \\ &= \sum_{s\in S}d^{b}(s) \sum_{a\in A} \left[ \nabla_{u}\pi(a|s)Q^{\pi,\gamma}(s,a) + \pi(a|s)\nabla_{u}Q^{\pi,\gamma}(s,a) \right] \\ &\approx \sum_{s\in S}d^{b}(s) \sum_{a\in A} \nabla_{u}\pi(a|s)Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{s\sim d^{b}} \sum_{a\in A} \nabla_{u}\pi(a|s)Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{s\sim d^{b}} \sum_{a\in A} b(a|s) \frac{\pi(a|s)}{b(a|s)}\frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{s\sim d^{b},a\sim b} \frac{\pi(a|s)}{b(a|s)}\frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{b} \frac{\pi(a|s)}{b(a|s)}\frac{\nabla_{u}\pi(a|s)}{\pi(a|s)}Q^{\pi,\gamma}(s,a) \\ &= \mathbb{E}_{b} \frac{\pi(a|s)}{b(a|s)}\nabla_{u}\log\pi(a|s)Q^{\pi,\gamma}(s,a) \\ \end{aligned}

The derivation above use # Importance Sampling.