← Home

Model Free RL

Nov 19, 2020

There are two main approaches to representing and training agents with model-free RL.

Policy Optimization

Methods in this family represent a policy explicitly as $\pi_{\theta}(a|s)$. They optimize the parameters $\theta$ either directly by gradient ascent on the performance objective $J(\pi_{\theta})$, or indirectly, by maximizing local approximations of $J(\pi_{\theta})$. This optimization is almost always performed on-policy, which means that each update only uses data collected while acting according to the most recent version of the policy.

Q-Learning (Value Optimization)

Methods in this family learn an approximator $Q_{\theta}(s,a)$ for the optimal action-value function, $Q^*(s,a)$. Typically they use an objective function based on the Bellman equation. This optimization is almost always performed off-policy, which means that each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained.

# DQN

Interpolating Between Policy Optimization and Q-Learning

Algorithms that live on this spectrum are able to carefully trade-off between the strengths and weaknesses of either side.

# Deep Deterministic Policy Gradient