FF's Notes
← Home

Reinforcement Learning

Nov 19, 2025
rl

Key Concepts

2025-11-19_20-34-41_rl_diagram_transparent_bg.png

States and Observations

  • A state $s$ is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation $o$ is a partial description of a state, which may omit information.
  • When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Actions

  • Some environment, like Atari and Go, have discrete action space. only a finite number of moves are available to the agent.
  • Other environments, like where the agent controls a robot in a physical world, have continuous action spaces.

Policies

  • A policy is a rule used by an agent to decide what actions to take.
  • Deterministic Policies output deterministic/specific action values.
  • Stochastic Policies output distribution where action values can be sampled from.

Trajectories

A trajectory τ is a sequence of states and actions in the world,

\[ \tau = (s_0, a_0, s_1, a_1, ...), \]

where $s_0 \sim \rho_0(\cdot)$, $a_t = \pi(s_t)$, and $s_{t+1} = P(s_t, a_t)$.

Rewards

The reward function $R$ is critically important in reinforcement learning. It depends on the current state of the world, the action just taken, and the next state of the world:

\[ r_t = R(s_t, a_t, s_{t+1}) \]

Return

One kind of return is the finite-horizon undiscounted return, which is just the sum of rewards obtained in a fixed window of steps:

\[ R(\tau) = \sum_{t=0}^T r_t. \]

Objectives

The goal in RL is to select a policy which maximizes expected return when the agent acts according to it.

Let's suppose that both the environment transitions and the policy are stochastic. In this case, the probability of a T-step trajectory is:

\[ P(\tau|\pi) = \rho_0 (s_0) \prod_{t=0}^{T-1} P(s_{t+1} | s_t, a_t) \pi(a_t | s_t). \]

The expected return, denoted by $J(\pi)$, is then:

\[ J(\pi) = \int_{\tau} P(\tau|\pi) R(\tau) = \mathbb{E}{\tau\sim \pi}{R(\tau)}. \]

The central optimization problem in RL can then be expressed by

\[ \pi^* = \arg \max_{\pi} J(\pi), \]

with $\pi^*$ being the optimal policy.

On-Policy Value Function

The On-Policy Value Function, $V^{\pi}(s)$, which gives the expected return if you start in state s and always act according to policy $\pi$:

\[ V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi}[{R(\tau)\left| s_0 = s\right.}] \]

On-Policy Action-Value Function

The On-Policy Action-Value Function, $Q^{\pi}(s,a)$, which gives the expected return if you start in state $s$, take an arbitrary action $a$ (which may not have come from the policy), and then forever after act according to policy $\pi$:

\[ Q^{\pi}(s,a) = \mathbb{E}_{\tau \sim \pi}[{R(\tau)\left| s_0 = s, a_0 = a\right.}] \]

Bellman Equation

The Bellman equations for the on-policy value functions are

$\begin{aligned} V^{\pi}(s) &= \mathbb{E}_{a \sim \pi, s'\sim P} [r(s,a) + \gamma V^{\pi}(s')], \\ Q^{\pi}(s,a) &= \mathbb{E}_{s'\sim P} [r(s,a) + \gamma \mathbb{E}{a'\sim \pi}{Q^{\pi}(s',a')}], \end{aligned}$

where $s' \sim P$ is shorthand for $s' \sim P(\cdot |s,a)$, indicating that the next state $s'$ is sampled from the environment’s transition rules; $a \sim \pi$ is shorthand for $a \sim \pi(\cdot|s)$; and $a' \sim \pi$ is shorthand for $a' \sim \pi(\cdot|s')$.

The Bellman equations for the optimal value functions are

$\begin{aligned} V^*(s) &= \max_a \mathbb{E}_{s'\sim P}[{r(s,a) + \gamma V^*(s')}], \\ Q^*(s,a) &= \mathbb{E}_{s'\sim P}[{r(s,a) + \gamma \max_{a'} Q^*(s',a')}]. \end{aligned}$

Advantage

The advantage function $A^{\pi}(s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action $a$ in state $s$, over randomly selecting an action according to $\pi(\cdot|s)$, assuming you act according to $\pi$ forever after. Mathematically, the advantage function is defined by

\[ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s). \]

Kinds of RL Algorithms

2025-11-23_22-05-46_rl_algorithms_9_15.svg

One of the most important branching points in an RL algorithm is the question of whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transitions and rewards. Algorithms which use a model are called # Model Based RL, and those that don’t are called # Model Free RL.