Diffusion model
Deep learning algorithm
In machine learning , diffusion models , also known as diffusion probabilistic models or score-based generative models , are a class of latent variable models . They are Markov chains trained using variational inference . [1] The goal of diffusion models is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space . In computer vision , this means that a neural network is trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process. [2] [3] It mainly consists of three major components: the forward process, the reverse process, and the sampling procedure. [4] Three examples of generic diffusion modeling frameworks used in computer vision are denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations. [5]
Diffusion models were introduced in 2015 with a motivation from non-equilibrium thermodynamics . [6]
Diffusion models can be applied to a variety of tasks, including image denoising , inpainting , super-resolution , and image generation . For example, an image generation model would start with a random noise image and then, after having been trained reversing the diffusion process on natural images, the model would be able to generate new natural images. Announced on 13 April 2022, OpenAI 's text-to-image model DALL-E 2 is a recent example. It uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image. [7]
Mathematical principles
This section
needs additional citations for
verification
.
Please help
improve this article
by
adding citations to reliable sources
in this section. Unsourced material may be challenged and removed.
(
February 2023
)
(
Learn how and when to remove this template message
)
|
Generating an image in the space of all images
Consider the problem of image generation. Let
represent an image, and let
be the probability distribution over all possible images. If we have
itself, then we can say for certain how likely a certain image is. However, this is intractable in general.
Most often, we are uninterested in knowing the absolute probability that a certain image is — when, if ever, are we interested in how likely an image is in the space of all possible images? Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — how more likely is this image of cat, compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?
Consequently, we are actually quite uninterested in
itself, but rather,
. This performs two effects
-
One, we no longer need to normalize
, but can use any
, where
is any unknown constant that is of no concern to us.
-
Two, we are comparing
neighbors
, by
Let the
score function
be
, then consider what we can do with
.
As it turns out,
allows us to sample from
using
stochastic gradient Langevin dynamics
, which is essentially an infinitesimal version of
Markov chain Monte Carlo
.
[2]
Learning the score function
The score function can be learned by learning to predict that noise content of an noisy input. [1] The real score can then be obtained from this noise content.
Main variants
Classifier guidance
Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution
, where
ranges over images, and
ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).
Taking the perspective of the
noisy channel model
, we can understand the process as follows: To generate an image
conditional on description
, we imagine that the requester really had in mind an image
, but the image is passed through a noisy channel and came out garbled, as
. Image generation is then nothing but inferring which
the requester had in mind.
In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get
in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". The SGLD uses
where
is the score function, trained as previously described, and
is found by using a differentiable image classifier.
With temperature
The classifier-guided diffusion model samples from
, which is concentrated around the
maximum a posteriori estimate
. If we want to force the model to move towards the
maximum likelihood estimate
, we can use
where
is interpretable as
inverse temperature
. In the context of diffusion models, it is usually called the
guidance scale
. A high
would force the model to sample from a distribution concentrated around
. This often improves quality of generated images.
[8]
This can be done simply by SGLD with
Classifier-free guidance
If we do not have a classifier
, we could still extract one out of the image model itself:
[9]
Such a model is usually trained by presenting it with both
and
, allowing it to model both
and
.
This is an integral part of systems like GLIDE, [10] DALL-E [11] and Google Imagen. [12]
See also
Further reading
- Guidance: a cheat code for diffusion models . Good overview up to 2022.
References
-
1
2
Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (19 June 2020). "Denoising Diffusion Probabilistic Models".
arXiv
:
2006.11239
.
{{ cite journal }}
: Cite journal requires| journal=
( help ) - 1 2 Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv : 2011.13456 [ cs.LG ].
- ↑ Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv : 2111.14822 [ cs.CV ].
- ↑ Chang, Ziyi; Koulieris, George Alex; Shum, Hubert P. H. (2023). "On the Design Fundamentals of Diffusion Models: A Survey". arXiv : 2306.04542 [ cs.LG ].
- ↑ Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2022). "Diffusion models in vision: A survey". arXiv : 2209.04747 [ cs.CV ].
- ↑ Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF) . Proceedings of the 32nd International Conference on Machine Learning . PMLR. 37 : 2256–2265.
- ↑ Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv : 2204.06125 [ cs.CV ].
- ↑ Dhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis". arXiv : 2105.05233 [ cs.LG ].
- ↑ Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv : 2207.12598 [ cs.LG ].
- ↑ Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". arXiv : 2112.10741 [ cs.CV ].
- ↑ Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv : 2204.06125 [ cs.CV ].
- ↑ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Ghasemipour, Seyed Kamyar Seyed; Ayan, Burcu Karagol; Mahdavi, S. Sara; Lopes, Rapha Gontijo; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-05-23). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv : 2205.11487 [ cs.CV ].
![]() |
This artificial intelligence -related article is a stub . You can help Wikipedia by expanding it . |