You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned in the diffusion models introduction, a (denoising) Diffusion model is a kind of generative model that is used to generate data similar to the data it is trained on.
Fundamentally, a diffusion process consists of two prominenet steps:
A fixed or pre-defined Forward diffusion process that gradually adds noise (Gaussian in DDPM case) to an input image, until we end up with pure noise.
A learned reverse diffusion process (denoising) where we train a neural network to gradually denoise an image starting from pure noise which is equivalent to sampling from the estimated data distribution.
Notations
$q(x_0)$ : the real data distribution
$\bar x$ : a data point sampled from a real data distribution
$\bar x_T$ : the final pure Gaussian noise $\mathcal{N}(\bar x_T; 0, \mathbf{I})$ after the forward diffusion proceess
$q(\bar x_{1:T} \vert \bar x_{0})$ : forward diffion process
$\beta_t$ : the fixed variance schedule in the diffusion process
Forward diffusion process
For a sample $\bar x_0$ from the given real distribution, $q(x_0)$, we define a forward diffusion process, $q(\bar x_{1:T} \vert \bar x_{0})$, in which we add small amount of Gaussian noise to the $\bar x_0$ in $T$ steps, producing a sequence of noisy samples $\bar x_1$, $\bar x_2$,..., $\bar x_T$, according to a pre-defined variance schedule ${\beta_t \in (0,1) }_{t=1}^{T}$. The data sample gradually loses its features as the steps approaches $T$ such that $\bar x^T$ is equivalent to isotropic Gaussian noise.
As the forward process is a Markov chain, therefore:
Here, $\alpha_t=1-\beta_t$ and $\bar \alpha_t = \prod_{i=1}^{t} \alpha_i$.
The following shows how to derive the forward diffusion process.
Reverse diffusion process
If we know the $p(\bar x_{t-1} \vert \bar x_{t})$ conditional process, then we can reverse the forward process starting from pure noise and gradually "denoising" it so that we end up with a sample from the real distribution.
However, it is intractable and requires knowing the actual data distribution of the images in order to calculate this conditional probability. Hence, we use a neural network $p_\theta$ to approximate (learn) the $p_{\theta}(\bar x_{t-1} \vert \bar x_{t})$ conditional probability distribution.
Starting with the pure Gaussian noise $p(\bar x_T) = \mathcal{N}(\bar x_T; 0, \mathbf{I})$, assuming the reverse process to be Gaussian and Markov, the joint conditional ditribution $p_{\theta}(\bar x_{0:T})$ is given as follows:
The following shows how to rewrite $L_{ELBO}$ almost completely in terms of KL divergencess.
$L_{T}$
$L_T$ is a constant and can be ignored during training since $q$ has no learnable parameters and $\bar x_T$ is a Guassian noise.
$L_0$ (Need to review again)
The reverse process consists of transformations under continuous conditional Gaussian distributions. However, we want to produce an image at the end of the reverse diffusion process which has integer pixel values. Therefore, at the last reverse steps we need to obtain discrete integer (log) likelihood values for each pixel.
This is done by setting the last transition ($L_0$) in the reverse diffusion chain to an independent discrete decoder.
$$ L_0 = -log\ p_\theta(\bar{x_0}|\bar{x_1}) $$
First, the authors impose the independence between the data dimensions. This allows the following parameterization:
Here, $D$ is the dimensionality of the data. Thus, the multivariate Gaussian
for step $\bar x_1$ to $\bar x_0$ can be written as product of univariate Gaussians for each value of D.
The probability of a pixel value for $\bar x_0$, given the univariate Gaussian distribution of the corresponding pixel in $\bar x_1$, is the area under that univariate Gaussian distribution within the bucket centered at the pixel value. This can be written as follows:
The above expressions is MSE loss between the reverse process posterior mean and the forward process posterior mean. Thus, training a neural network for the reverse process simply means predicting the
mean for the forward process.
Training to predict $\mu_\theta$ leads to unstable trainig. Since, Because $\bar x_t$ is available as input during training time, we can reparameterize the Gaussian noise term and instead train the network to predict noise ($\epsilon_t$) by using:
Empirically, thy also found that training the diffusion model works better with a simplified objective that ignores the weighting terms. Thus, leading to:
The training and sampling algorithms for the Diffusion Model can be summarized as follows:
Network architecture
The only requirement for the neural network is that its input and output dimensionality remain identical. Given this restriction, it is perhaps unsurprising that diffusion models are commonly implemented with U-Net-like architectures.
Experimentation setup (paper)
Timesteps for diffusion proces: $T$ = 1000
Beta schedule for noise: $\beta_1$ = $10^{−4}$ linearly increases to $\beta_T$ = $0.02$
Model is a modified U-net, parameters shared across time
timesteps specified via a sinusoidal position embedding