Denoising Diffusion Probabilistic Models

Diffusion models are latent variable models of the form $p_{θ} := \int p_{θ} (x_{0 : T}) d x_{1 : T}$ , where $x_{1}, \dots, x_{T}$ are latents of the same dimensionality as the data $x_{0} \sim q (x_{0})$ . $x_{0}$ represents true data observations such as natural images, $x_{T}$ represents pure Gaussian noise, and $x_{t}$ is an intermediate noisy version of $x_{0}$ . The joint distribution $p_{θ} (x_{0 : T})$ is called the reverse process, and it is defined as a Markov chain (shown in Fig.1) with learned Gaussian transitions starting at $p (x_{T}) = N (x_{T}; 0, I)$ :

p_{θ} (x_{0 : T}) := p (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t})

where $p_{θ} (x_{t - 1} ∣ x_{t}) := N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$ .

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior $q (x_{1 : T} ∣ x_{0})$ , called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule.

Diffusion Process

For a diffusion process from $x_{t - 1}$ to $x_{t}$ ,

x_{t} = α_{t} x_{t - 1} + β_{t} ϵ_{t}, ϵ_{t} \sim N (0, I)

where $α_{t}, β_{t} > 0$ and $α_{t}^{2} + β_{t}^{2} = 1$ . Generally, $β_{t}$ always close to $0$ and can be considered as noise degree and $ϵ_{t}$ is the Gaussian noise to diffuse $x_{t - 1}$ .

Thus we can get:

x_{t} = α_{t} x_{t - 1} + β_{t} ϵ_{t} = α_{t} (α_{t - 1} x_{t - 2} + β_{t - 1} ϵ_{t - 1}) + β_{t} ϵ_{t} = \dots = (α_{t} \dots α_{1}) x_{0} + (α_{t} \dots α_{2}) β_{1} ϵ_{1} + (α_{t} \dots α_{3}) β_{2} ϵ_{2} + \dots + α_{t} β_{t - 1} ϵ_{t - 1} + β_{t} ϵ_{t}

Since $ϵ_{t} \sim N (0, I)$ , so $(α_{t} \dots α_{2}) β_{1} ϵ_{1} \sim N (0, (α_{t} \dots α_{2})^{2} β_{1}^{2})$ , so as others. Thus $(α_{t} \dots α_{2}) β_{1} ϵ_{1} + (α_{t} \dots α_{3}) β_{2} ϵ_{2} + \dots + α_{t} β_{t - 1} ϵ_{t - 1} + β_{t} ϵ_{t} \sim N (0, (α_{t} \dots α_{2})^{2} β_{1}^{2} + \dots + α_{t}^{2} β_{t - 1}^{2} + β_{t}^{2})$ .

When $α_{t}^{2} + β_{t}^{2} = 1$ ,

(α_{t} \dots α_{1})^{2} + (α_{t} \dots α_{2})^{2} β_{1}^{2} + (α_{t} \dots α_{3})^{2} β_{2}^{2} + \dots + α_{t}^{2} β_{t - 1}^{2} + β_{t}^{2} = 1

So $x_{t} = \overset{α}{ˉ}_{t} x_{0} + \overset{ˉ}{β}_{t} \overset{ϵ_{t}}{ˉ}$ , where $\overset{ϵ}{ˉ}_{t} \sim N (0, I)$ , $\overset{α}{ˉ}_{t} = (α_{t} \dots α_{1}), \overset{ˉ}{β}_{t} = 1 - (α_{t} \dots α 1)^{2}$ .

Reverse Process

We are trying to minimize the distance between original image $x_{t - 1}$ and de-noised image $θ (x_{t})$ , i.e.:

\begin{aligned} \mathcal{L} &= || x_{t-1} - \theta(x_t) ||^2 \\\\ &= || \frac{1}{\alpha_t}(x_t - \beta_t \epsilon_t) - \theta(x_t) ||^2 \\\\ &= || \frac{1}{\alpha_t}(x_t - \beta_t \epsilon_t) - \frac{1}{\alpha}(x_t - \beta_t \epsilon_{\theta}(x_t, t)) ||^2 \\\\ &= \frac{\beta_t^2}{\alpha_t^2} || \epsilon_t - \epsilon_{\theta}(x_t, t) || ^2 \\\\ &\approx || \epsilon_t - \epsilon_{\theta}(\alpha_t x_{t-1} + \beta_t \epsilon_t, t) || ^2 \\\\ &= || \epsilon_t - \epsilon_{\theta}(\alpha_t (\bar_{\alpha}_{t-1} x_0 + \bar{\beta}_{t-1}\bar{\epsilon}_{t-1}) + \beta_t \epsilon_t, t) || ^2 \\\\ &= || \epsilon_t - \epsilon_{\theta}(\bar{\alpha}_t x_0 + \alpha_t \bar{\beta}_{t-1}\bar{\epsilon}_{t-1} + \beta_t\epsilon_t, t) || ^2 \end{aligned}

We resample $x_{t}$ without $\overset{ϵ}{ˉ}_{t}$ because $ϵ_{t}$ and $\overset{ϵ}{ˉ}_{t}$ are not independent.

Currently we have four variable need to sample: $x_{0}$ , $ϵ_{t}$ , $\overset{ϵ}{ˉ}_{t}$ and $t$ . We can use a trick to reduce the variance of training by combining $ϵ_{t}$ and $\overset{ϵ}{ˉ}_{t}$ :

FF's Roam Notes

Explorer

Denoising Diffusion Probabilistic Models

Diffusion Process

Reverse Process

Graph View

Table of Contents