FF's Notes
← Home

Model Free RL

Nov 19, 2020
rl

There are two main approaches to representing and training agents with model-free RL.

Policy Optimization

Methods in this family represent a policy explicitly as $\pi_{\theta}(a|s)$. They optimize the parameters $\theta$ either directly by gradient ascent on the performance objective $J(\pi_{\theta})$, or indirectly, by maximizing local approximations of $J(\pi_{\theta})$. This optimization is almost always performed on-policy, which means that each update only uses data collected while acting according to the most recent version of the policy.

Q-Learning (Value Optimization)

Methods in this family learn an approximator $Q_{\theta}(s,a)$ for the optimal action-value function, $Q^*(s,a)$. Typically they use an objective function based on the Bellman equation. This optimization is almost always performed off-policy, which means that each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained.

Interpolating Between Policy Optimization and Q-Learning

Algorithms that live on this spectrum are able to carefully trade-off between the strengths and weaknesses of either side.