The variational autoencoder (VAE) is a type of generative model that combines principles from neural networks and probabilistic models to learn the underlying probabilistic distribution of a dataset and generate new data samples similar to the given dataset.
Due to its ability to combine probabilistic modeling and learn complex data distributions, VAEs have become a fundamental tool and have had a profound impact on the fields of machine learning and deep learning.
In this blog post, we will start with a quick introduction to the architecture of variational autoencoders and a comparison between variational autoencoders and conventional autoencoders. Next, we will use mathematical expressions and graphics to explain the concepts behind the variational autoencoder network design. Lastly, we will provide a step-by-step tutorial on how to build and train a variational autoencoder network using PyTorch. All the code and demonstration scripts can be found in my VAE GitHub repository: (URL https://github.com/JianZhongDev/VariationalAutoencoderPytorch )
Variational Autoencoder Structures
A variational autoencoder (VAE) model usually includes an encoder network, a distribution model, and a decoder network.
The encoder network learns how to transform each data point in the dataset into the parameters of a probabilistic distribution in the latent space.
The distribution model uses these parameters to create the probabilistic distribution and draws random variables from this distribution in the latent space.
The decoder network then learns how to transform these latent space random variables back into their corresponding data points in the original dataset space.
A forward pass through the entire VAE works like this: the encoder processes the input data and generates the parameters for the latent space distribution. The distribution model then creates the latent space probabilistic distribution based on these parameters and draws a random variable from it. Finally, the decoder reconstructs the original data from this latent space random variable.
![](./Images/VariationalAutoEncoderStructure.png#center)
Structure of a variational autoencoder. (image credit: Jian Zhong)
Variational Autoencoder and Autoencoder
The variational autoencoder (VAE) has a similar encoder-decoder structure to conventional autoencoders, but the functions of the encoders and decoders are very different.
In conventional autoencoders, the encoder and decoder networks learn direct transformations between the data space and the latent space. The encoder maps data points to a specific point in the latent space, and the decoder maps these points back to the original data space.
![](./Images/ConventionalAutoEncoder.png#center)
Conventional autoencoder. (image credit: Jian Zhong)
In contrast, a VAE learns to transform data into a probabilistic distribution in the latent space, rather than specific points. The encoder maps data points to the parameters of a probabilistic distribution. The distribution model then draws random variables from this distribution. The decoder transforms these random variables back into the data space.
![](./Images/VariationalAutoEncoder.png#center)
Variational autoencoder. (image credit: Jian Zhong)
Math of Variational Autoencoder
This section explains the concepts behind variational autoencoders and walks through their mathematical derivation. There will be a lot of math here. If you prefer a quick understanding of how variational autoencoders work and find this section too math-heavy, feel free to skip ahead to the graphical explanation. Please note that the mathematical expressions in this section are meant to convey the basic ideas behind variational autoencoders, so they might not be very rigorous. For a more detailed and rigorous mathematical explanation, please refer to the variational autoencoder paper.
Problem Description
Let’s consider a dataset \( X = \{x^{(i)}\}_{i=1}^{N} \) consisting of \( N \) independent and identically distributed (i.i.d.) samples. The probabilistic distribution of these samples is described as \( p_{\theta}(x) \) , where \( \theta \) are the parameters defining this distribution. The \( \theta \) here is a general description of the potential parameters for the distribution and may or may not include our model’s parameters.
From a probabilistic perspective, learning is about finding the optimal parameters \( \theta^{*} \) for the distribution \( p_{\theta}(x) \) so that the probability of each data point xxx in the learning dataset is maximized.
The variational autoencoder addresses a specific part of this learning problem, where the process of generating data xxx depends on an unobserved random variable \( z \) . To generate the random variable \( x \) , a random variable \( z \) is first generated from a prior distribution \( p_{\theta}(z) \) , and then \( x \) is generated from a conditional distribution \( p_{\theta}(x|z) \) . Using Bayes’s theorem, the probabilistic distribution \( p_{\theta}(x) \) can be expressed as follows:
$$ p_{\theta}(x) = \frac{p_{\theta}(x|z)}{p_{\theta}(z|x)} p_{\theta}(z) $$
The variational autoencoder aims to learn and model \( p_{\theta}(z) \) , \( p_{\theta}(z|x) \) , and \( p_{\theta}(x|z) \) to maximize \( p_{\theta}(x) \) after the learning process.
Maximizing Probability and Lower Bound
Usually, modeling \( p_{\theta}(z|x) \) exactly is not feasible. So, we use an approximate model \( q_{\phi}(z|x) \) , where \( \phi \) represents the model parameters.
Given a dataset \( X = \{x^{(i)}\}_{i=1}^{N} \) of the random variable \( x \) , maximizing \( p_{\theta}(x) \) means maximizing the likelihood of the data points, \( p_{\theta}(x^{(1)}, ..., x^{(N)} ) \) . This is the same as maximizing \( log(p_{\theta}(x^{(1)}, ..., x^{(N)} )) \) . Since all data points are i.i.d., we have:
$$ log(p_{\theta}(x^{(1)}, …, x^{(N)} )) = log(\prod_{i=1}^{N} x^{(i)}) = \sum_{i=1}^{N} log(x^{(i)}) $$
Using the KL divergence between \( q_{\phi}(z|x^{(i)}) \) and \( p_{\theta}(z|x^{(i)}) \) , we can rewrite \( log(p_{\theta}(x^{(i)}) ) \) :
$$ D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z|x^{(i)})) = \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot log(\frac{q_{\phi}(z|x^{(i)})}{p_{\theta}(z|x^{(i)})}) ) $$
Expanding and rearranging, we get:
$$ D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z|x^{(i)})) = \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot log(\frac{q_{\phi}(z|x^{(i)}) p_{\theta}(x^{(i)})}{p_{\theta}(x^{(i)}|z) p_{\theta}(z)}) ) $$ $$ = \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot log(\frac{q_{\phi}(z|x^{(i)})}{p_{\theta}(z)})) + \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot log(p_{\theta}(x^{(i)}))) - \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot p_{\theta}(x^{(i)}|z)) $$ $$ = \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot log(\frac{q_{\phi}(z|x^{(i)})}{p_{\theta}(z)})) + log(p_{\theta}(x^{(i)})) - \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) $$
So,
$$ log(p_{\theta}(x^{(i)})) = D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z|x^{(i)})) + \mathcal{L}(\theta, \phi, x^{(i)}) $$
where,
$$ \mathcal{L}(\theta, \phi, x^{(i)}) = - \sum_{z}( q_{\phi}(z|x^{(i)}) \cdot log(\frac{q_{\phi}(z|x^{(i)})}{p_{\theta}(z)})) + \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) $$ $$ = - D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z)) + \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) $$
According to the definition of KL divergence, we know
$$ D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z|x^{(i)})) \geq 0 $$
Therefore,
$$ log(p_{\theta}(x^{(i)})) \geq \mathcal{L}(\theta, \phi, x^{(i)}) $$
In other words, \( \mathcal{L}(\theta, \phi, x^{(i)}) \) is the lower bound of \( log(p_{\theta}(x^{(i)})) \) . If we maximize \( \mathcal{L}(\theta, \phi, x^{(i)}) \) for each \( x^{(i)} \) , we effectively maximize \( log(p_{\theta}(x^{(1)}, ..., x^{(N)} )) \) , which is our learning objective.
During training, we typically define a loss function to minimize. Based on the above discussion, the loss function for the variational autoencoder can be set up as:
$$ Loss = \sum_{i=1}^{N}(-\mathcal{L}(\theta, \phi, x^{(i)})) $$
Now, let’s take a closer look at the terms in \( \mathcal{L}(\theta, \phi, x^{(i)}) \) :
$$ \mathcal{L}(\theta, \phi, x^{(i)}) = - D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z)) + \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) $$
Within this expression, the probabilistic distribution \( q_{\phi}(z|x^{(i)}) \) , \(p_{\theta}(z) \) , \( p_{\theta}(x^{(i)}|z) \) need to be modeled and learned. In a variational autoencoder, \( q_{\phi}(z|x^{(i)}) \) is typically learned using the encoder network with a predefined probabilistic model. \(p_{\theta}(z) \) is usually given by prior knowledge or assumptions. \( p_{\theta}(x^{(i)}|z) \) is learned by the decoder network.
The term \( \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) \) measures the probability of sample \( x^{(i)} \) across all possible \( z \) values, which also indicates how similar the encoder’s reconstruction is to the original data. This term corresponds to the reconstruction error term in the autoencoder’s loss function. The term \( D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z)) \) measures the similarity between \( q_{\phi}(z|x^{(i)}) \) and \(p_{\theta}(z) \) , and acts as a regularization term in the total loss function.
Adding Prior Knowledge or Assumptions About the Distributions
Now that we’ve established the relationships within the loss function and the components of variational autoencoders, we’re ready to incorporate prior knowledge or assumptions into the model and proceed with the learning process.
For instance, we can introduce the following Gaussian-based assumptions to the variational autoencoder:
For \( q_{\phi}(z|x^{(i)}) \) , we model it as a diagonal multivariate Gaussian distribution \(\mathcal{N}(\mu^{(i)}, \sigma^{(i)}) \) for each input data point \( x^{(i)} \) . The encoder learns to map each sample from the dataset space to the parameters (i.e., \( \mu^{(i)} \) and \( \sigma^{(i)} \) ) of its corresponding Gaussian distribution in the latent space.
\(p_{\theta}(z) \) is simply modeled as a unit normal distribution \(\mathcal{N}(0, 1) \)
\( p_{\theta}(x^{(i)}|z) \) is modeled such that the decoder learns to map \( z \) to \( \hat{x} \) , where \( \hat{x} = D_{\theta}(z) \) represents the reconstructed data in the dataset space. Here, \( D_{\theta}(z) \) denotes the output of the decoder, ensuring \( p_{\theta}(x^{(i)}|z) \) follows a Gaussian distribution \(\mathcal{N}(D_{\theta}(z), \sigma) \)
With this prior knowledge, the KL divergence term using diagonal Gaussian and unit Gaussian can be expressed as:
$$ D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z|x^{(i)})) = \frac{1}{2} \sum_{j=1}^{J}(-1 - log( (\sigma_{j}^{(i)})^{2}) + (\mu_{j}^{(i)})^{2} + (\sigma_{j}^{(i)})^{2}) $$
(NOTE: refer the Examples section of the LK Divergence Wikipedia page for the expressoin above)
Here, \( j \) indexes the dimensions of the latent space Gaussian distribution, and \( J \) specifies the dimensionality, which is set when building the variational autoencoder model.
The term \( \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) \) can be written as:
$$ \mathbb{E_{z}}( log(p_{\theta}(x^{(i)}|z)) ) = \mathbb{E_{z}}(log( A \cdot exp( - \frac{(x^{(i)} - D_{\theta}(z))^{2}}{2 \sigma^{2}} ) )) $$ $$ = log(A) -\frac{1}{2 \sigma^{2}} \mathbb{E_{z}}((x^{(i)} - D_{\theta}(z))^{2}) $$ $$ = log(A) -\frac{1}{2 \sigma^{2}} (x^{(i)} - \hat{x}^{(i)})^{2} $$
Here, \( A \) is the normalization constant for the proposed Gaussian distribution, which is independent of model parameters and can be disregarded during optimization. \( \sigma \) specified value when constructing the variational autoencoder model, adjusting how distinct each data reconstruction should be and balancing the weights of the reconstruction error loss term and the normal distribution regularization term.
Furthermore, the lower bound \( \mathcal{L}(\theta, \phi, x^{(i)}) \) can be approximated as:
$$ \mathcal{L}(\theta, \phi, x^{(i)}) \simeq \frac{1}{2} \sum_{j=1}^{J}(1 + log( (\sigma_{j}^{(i)})^{2}) - (\mu_{j}^{(i)})^{2} - (\sigma_{j}^{(i)})^{2}) -\frac{1}{2 \sigma^{2}} (x^{(i)} - \hat{x}^{(i)})^{2} $$
Finally, the loss function can be defined as:
$$ Loss = \sum_{i=1}^{N}(-\mathcal{L}(\theta, \phi, x^{(i)})) $$ $$ = \sum_{i=1}^{N}(\frac{1}{2 \sigma^{2}} (x^{(i)} - \hat{x}^{(i)})^{2}) + \sum_{i=1}^{N}(\frac{1}{2} \sum_{j=1}^{J}(-1 - log( (\sigma_{j}^{(i)})^{2}) + (\mu_{j}^{(i)})^{2} + (\sigma_{j}^{(i)})^{2}) ) $$
These assumptions help solidify each component of the variational autoencoder. The encoder is a neural network that takes data as input and outputs parameters \( \mu^{(i)} \) and \( \sigma^{(i)} \) of a diagonal distribution in the latent space. In the latent space, a diagonal multivariate Gaussian distribution is created based on the encoder’s parameters, and a random variable sample \( z^{(i)} \) is sampled from this distribution. The decoder, another neural network, takes the latent space variable as input and produces a reconstruction of the data in the dataset space.
Graphical Explanation of Variational Autoencoder Learning Process
A variational autoencoder comprises an encoder, a distribution model, and a decoder. In processing each sample from the dataset during a forward pass, the encoder first transforms the data into a corresponding probability distribution in the latent space. Then, a random variable is drawn from this distribution in the latent space. Finally, the decoder uses this random variable to reconstruct the distribution in the dataset space.
During training, each dataset sample serves as input to the variational autoencoder and as the target for comparing the difference between the encoder’s output and the original input data.
Through this process, the encoder adjusts by pushing the predicted distributions of latent space apart for samples with distinct features and pulling them closer together for samples with similar features. Consequently, random variables drawn from input samples with similar features tend to be close to each other in the latent space, while those from dissimilar samples are farther apart.
Moreover, as the learning progresses, latent space variables from similar input features move closer together, leading to similar reconstructed representations in the data space. Conversely, variables from different input features move farther apart, resulting in distinct representations in the data space.
![](./Images/VariationalAutoEncoderLearning.png#center)
Learn process of variational autoencoder. (image credit: Jian Zhong)
Building a Variational Autoencoder with PyTorch
Starting from this point onward, we will use the variational autoencoder with the Gaussian modeling prior knowledge we discussed earlier to demonstrate how to build and train a variational autoencoder using PyTorch.
Please refer to the TrainSimpleGaussFCVAE notebook in my GitHub repository for the complete training notebook.
As mentioned earlier, here is how you can define the encoder, distribution model, and decoder:
|
|
The VGGStackedLinear module creates several fully-connected networks based on the input layer descriptors. For a detailed explanation, please refer to my blog post on building and training VGG network with PyTorch.
And here’s how you can implement a forward pass of the autoencoder:
|
|
Training and Evaluating a Variational Autoencoder with PyTorch
Loss function
Based on the discussion about the loss function, we can easily implement the verification loss for the decoder reconstruction result like this:
|
|
We can also implement the Gaussian prior regularization term like this:
|
|
Note: In the theoretical derivation of the loss function, we used the sum of all samples. In the code, however, we use the average to avoid large numbers and to maintain consistent loss values for datasets of different sizes.
Train and validate one epoch
The script to train and validate the auto encoder model for one epoch can be implemented as follows:
|
|
Learning Results of a Variational Autoencoder
Finally, we can check the results produced by the variational autoencoder. Here is the distribution of the latent space random variables drawn from the Gaussian distribution of the MNIST testing dataset:
![](./Images/VAE_LatentSpace.png#center)
Learn latent space distribution. (image credit: Jian Zhong)
When we compare this to the latent space distribution from a conventional autoencoder (check my autoencoder blog post for the comparison result), we see that the variational autoencoder’s latent space distribution is more Gaussian. This is expected because we included Gaussian distribution modeling as prior knowledge when building the variational autoencoder.
We can also generate a manifold of the learned decoder by adjusting the latent space variable values continuously and using the decoder to produce their reconstructions in the dataset space.
![](./Images/VAE_Manifold_X-2_2_Y_-2_2.png#center)
learned manifold. (image credit: Jian Zhong)
Reference
[1] Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013).
Citation
If you found this article helpful, please cite it as:
Zhong, Jian (July 2024). A Gentle Introduction to Variational Autoencoders: Concept and PyTorch Implementation Guide. Vision Tech Insights. https://jianzhongdev.github.io/VisionTechInsights/posts/gentle_introduction_to_variational_autoencoders/.
Or
@article{zhong2024GentleIntroVAE,
title = "A Gentle Introduction to Variational Autoencoders: Concept and PyTorch Implementation Guide",
author = "Zhong, Jian",
journal = "jianzhongdev.github.io",
year = "2024",
month = "July",
url = "https://jianzhongdev.github.io/VisionTechInsights/posts/gentle_introduction_to_variational_autoencoders/"
}