← Home

Reinforcement Learning

Nov 19, 2025

Key Concepts

States and Observations

A state $s$ is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation $o$ is a partial description of a state, which may omit information.
When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Actions

Some environment, like Atari and Go, have discrete action space. only a finite number of moves are available to the agent.
Other environments, like where the agent controls a robot in a physical world, have continuous action spaces.

Policies

A policy is a rule used by an agent to decide what actions to take.
Deterministic Policies output deterministic/specific action values.
Stochastic Policies output distribution where action values can be sampled from.
- # Categorical Policies for discrete action space.
- # Stochastic Policies for continuous action space.

Trajectories

A trajectory τ is a sequence of states and actions in the world,

\[ \tau = (s_0, a_0, s_1, a_1, ...), \]

where $s_0 \sim \rho_0(\cdot)$, $a_t = \pi(s_t)$, and $s_{t+1} = P(s_t, a_t)$.

Rewards

The reward function $R$ is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:

\[ r_t = R(s_t, a_t, s_{t+1}) \]

Return

One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps:

\[ R(\tau) = \sum_{t=0}^T r_t. \]

Objectives

The goal in RL is to select a policy which maximizes expected return when the agent acts according to it.

Let's suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a T-step trajectory is:

\[ P(\tau|\pi) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1} | s_t, a_t) \pi(a_t | s_t). \]

The expected return, denoted by $J(\pi)$, is then:

\[ J(\pi) = \int_{\tau} P(\tau|\pi) R(\tau) = \mathbb{E}{\tau\sim \pi}{R(\tau)}. \]

The central optimization problem in RL can then be expressed by

\[ \pi^* = \arg \max_{\pi} J(\pi), \]

with $\pi^*$ being the optimal policy.

On-Policy Value Function

The On-Policy Value Function, $V^{\pi}(s)$, which gives the expected return if you start in state s and always act according to policy $\pi$:

\[ V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi}[{R(\tau)\left| s_0 = s\right.}] \]

On-Policy Action-Value Function

The On-Policy Action-Value Function, $Q^{\pi}(s,a)$, which gives the expected return if you start in state $s$, take an arbitrary action $a$ (which may not have come from the policy), and then forever after act according to policy $\pi$:

\[ Q^{\pi}(s,a) = \mathbb{E}_{\tau \sim \pi}[{R(\tau)\left| s_0 = s, a_0 = a\right.}] \]

Bellman Equation

The Bellman equations for the on-policy value functions are

$\begin{aligned} V^{\pi}(s) &= \mathbb{E}_{a \sim \pi, s'\sim P} [r(s,a) + \gamma V^{\pi}(s')], \\ Q^{\pi}(s,a) &= \mathbb{E}_{s'\sim P} [r(s,a) + \gamma \mathbb{E}{a'\sim \pi}{Q^{\pi}(s',a')}], \end{aligned}$

where $s' \sim P$ is shorthand for $s' \sim P(\cdot |s,a)$, indicating that the next state $s'$ is sampled from the environment’s transition rules; $a \sim \pi$ is shorthand for $a \sim \pi(\cdot|s)$; and $a' \sim \pi$ is shorthand for $a' \sim \pi(\cdot|s')$.

The Bellman equations for the optimal value functions are

$\begin{aligned} V^*(s) &= \max_a \mathbb{E}_{s'\sim P}[{r(s,a) + \gamma V^*(s')}], \\ Q^*(s,a) &= \mathbb{E}_{s'\sim P}[{r(s,a) + \gamma \max_{a'} Q^*(s',a')}]. \end{aligned}$

Advantage

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action $a$ in state $s$, over randomly selecting an action according to $\pi(\cdot|s)$, assuming you act according to $\pi$ forever after. Mathematically, the advantage function is defined by

\[ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s). \]

Kinds of RL Algorithms

One of the most important branching points in an RL algorithm is the question of whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transitions and rewards. Algorithms which use a model are called # Model Based RL, and those that don’t are called # Model Free RL.