Flow Matching for Gaussian Probability Paths

Probability Path

Conditional Gaussian Probability Path

The Gaussian conditional probability path forms the foundation of denoising diffusion models and flow matching.

Definition: Let $α_{t}$ , $β_{t}$ be noise schedulers: two continuously differentiable, monotonic functions with boundary conditions $α_{0} = β_{1} = 0$ and $α_{1} = β_{0} = 1$ . We define the conditional probability path as a family of distribution $p_{t} (x ∣ z)$ over $R^{d}$ :

p_{t} (\cdot ∣ z) = N (α_{t} z, β_{t}^{2} I_{d})

Boundary Conditions: The imposed conditions on $α_{t}$ and $β_{t}$ ensure:

p_{0} (\cdot ∣ z) = N (0, I_{d}), and p_{1} (\cdot ∣ z) = N (z, 0) = δ_{z}

where $z \in R^{d}$ is a data point, $δ_{z}$ is the Dirac delta “distribution” that sample from $δ_{z}$ always returns $z$ .

Marginal Gaussian Probability Path

Sampling Procedure: For $z \sim p_{d a t a}$ , $ϵ \sim p_{ini t} = N (0, I_{d})$ , we can sample from marginal path:

$x_{t} = α_{t} z + β_{t} ϵ \sim p_{t}$

This provides a tracable way to sample from the marginal distribution $p_{t} (x) = \int p_{t} (x ∣ z) p_{d a t a} (z) d z$ .

Vector Field

Conditinoal Gaussian Vector Field

Definition: Let $\overset{α_{t}}{˙} = \partial_{t} α_{t}$ and $\dot{β_{t}} = \partial_{t} β_{t}$ denote respective time derivatives of $α_{t}$ and $β_{t}$ . The conditional Gaussian vector field given by:

u_{t}^{t a r g e} (x ∣ z) = (\overset{α_{t}}{˙} - \frac{β _{t} ˙}{β _{t}} α_{t}) z + \frac{β _{t} ˙}{β _{t}} x

is a valid conditional vector field model.

Property: This vector field generates ODE trajectories $X_{t}$ that satisfy $X_{t} \sim p_{t} (\cdot ∣ z) = N (α_{t} z, β_{t}^{2} I_{d})$ if $X_{0} \sim N (0, I_{d})$ .

Proof

Construct a conditional flow model $ψ_{t}^{t a r g e} (x ∣ z)$ by defining

ψ_{t}^{t a r g e} (x ∣ z) = α_{t} z + β_{t} x

If $X_{t}$ is the ODE trajectory of $ψ_{t}^{t a r g e t} (\cdot ∣ z)$ with $X_{0} \sim p_{ini t} = N (0, I_{d})$ , then

X_{t} = ψ_{t}^{t a r g e t} (X_{0} ∣ z) = α_{t} z + β_{t} X_{0} \sim N (α_{t} z, β_{t}^{2} I_{d}) = p_{t} (\cdot ∣ z)

The conditional vector field is:

\frac{d}{d t} ψ_{t}^{target} (x ∣ z) \Leftrightarrow (i) \overset{α}{˙}_{t} z + \dot{β}_{t} x \Leftrightarrow (ii) \overset{α}{˙}_{t} z + \dot{β}_{t} (\frac{x - α _{t} z}{β _{t}}) \Leftrightarrow (iii) (\overset{α}{˙}_{t} - \frac{β ˙ _{t}}{β _{t}} α_{t}) z + \frac{β ˙ _{t}}{β _{t}} x = u_{t}^{target} (ψ_{t}^{target} (x ∣ z) ∣ z) for all x, z \in R^{d} = u_{t}^{target} (α_{t} z + β_{t} x ∣ z) for all x, z \in R^{d} = u_{t}^{target} (x ∣ z) for all x, z \in R^{d} = u_{t}^{target} (x ∣ z) for all x, z \in R^{d}

In (ii), we reparameterized $x \to (x - α_{t} z) / β_{t}$ .

Score Function for Conditional Gaussian Probability Paths

For the Gaussian path $p_{t} (x ∣ z) = N (x; α_{t} z, β_{t}^{2} I_{d})$ , we can use the form of the Gaussian probability density to get the conditional Gaussian score function, which is the derivetive of $x$ ,

\nabla lo g p_{t} (x ∣ z) = - \frac{x - α _{t} z}{β _{t}^{2}}

This linear relationship is a unique property of Gaussian distributions and is fundamental to efficient training.

Flow Matching for Gaussian Conditional Probability Paths

The conditional flow matching loss is

L_{CFM} (θ) = E_{t \sim Unif (0, 1), z \sim p_{data}, x \sim N (α_{t} z, β_{t}^{2} I_{d})} u_{t}^{θ} (x) - (\overset{α}{˙}_{t} - \frac{β ˙ _{t}}{β _{t}} α_{t}) z - \frac{β ˙ _{t}}{β _{t}} x^{2} = (i) E_{t \sim Unif (0, 1), z \sim p_{data}, ϵ \sim N (0, I_{d})} [u_{t}^{θ} (α_{t} z + β_{t} ϵ) - (\overset{α}{˙}_{t} z + \dot{β}_{t} ϵ)^{2}]

In (i) we replace $x$ by $α_{t} z + β_{t} ϵ$ .

Let us make $L_{CFM}$ even more concrete for the special case of $α_{t} = t$ , and $β_{t} = 1 - t$ . The corresponding conditional probability path $p_{t} (x ∣ z) = N (t z, (1 - t)^{2})$ is referred to as the (Gaussian) CondOT probability path. Then we have $\overset{α_{t}}{˙} = 1, \dot{β_{t}} = - 1$ , so that

L_{CFM} (θ) = E_{t \sim Unif, z \sim p_{data}, ϵ \sim N (0, I_{d})} [u_{t}^{θ} (t z + (1 - t) ϵ) - (z - ϵ)^{2}]

Models like Stable Diffusion 3, Meta’s Movie Gen Video are using this procedure.

Training Procedure

Given a dataset of samples $z \sim p_{d a t a}$ , vector field network $u_{t}^{θ}$ . For each batch of data:

Sample a data example $z$ from the dataset,
Sample a random time $t \sim Unif_{[} 0, 1]$ ,
Sample noise $ϵ \sim N (0, I_{d})$
Set $x = t z + (1 - t) ϵ$
Compute loss $L_{θ} = ∣∣ u_{t}^{θ} (x) - (z - ϵ) ∣ ∣^{2}$
Update the model.

Score Matching for Gaussian Probability Paths

The conditional score matching loss is

L_{CSM} (θ) = E_{t \sim Unif, z \sim p_{d a t a}, x \sim p_{t} (\cdot ∣ z)} [∣∣ s_{t}^{θ} (x) + \frac{x - α _{t} z}{β _{t}^{2}} ∣ ∣^{2}] = E_{t \sim Unif, z \sim p_{d a t a}, ϵ \sim N (0, I_{d})} [∣∣ s_{t}^{θ} (α_{t} z + β_{t} ϵ) + \frac{ϵ}{β _{t}} ∣ ∣^{2}] = E_{t \sim Unif, z \sim p_{d a t a}, ϵ \sim N (0, I_{d})} [\frac{1}{β _{t}^{2}} ∣∣ β_{t} s_{t}^{θ} (α_{t} z + β_{t} ϵ) + ϵ ∣ ∣^{2}]

Note that $s_{t}^{θ}$ learns to predict the noise that was used to corrupt a data sample $z$ . Therefore, the above training loss is also called denoising score matching. It was soon realized that the above loss is numerically unstable for $β_{t} \approx 0$ close to zero (i.e. denoising score matching only works if you add a sufficient amount of noise).

In Denosing Diffusion Probabilitic Models (DDPM), it was proposed to drop the constant $\frac{1}{β _{t}^{2}}$ in the loss, and reparameterize $s_{t}^{θ}$ into a noise predictor network $ϵ_{t}^{θ}$ via:

- β_{t} s_{t}^{θ} (x) = ϵ_{t}^{θ} (x)

thus,

L_{DDPM} (θ) = E_{t \sim Unif, z \sim p_{d a t a}, ϵ \sim N (0, I_{d})} [∣∣ ϵ_{t}^{θ} (α_{t} z + β_{t} ϵ) - ϵ ∣ ∣^{2}]

FF's Roam Notes

Explorer