An autoencoder is a type of artificial neural network that learns to create efficient codings, or representations, of unlabeled data, making it useful for unsupervised learning. Autoencoders can be used for tasks like reducing the number of dimensions in data, extracting important features, and removing noise. They’re also important for building semi-supervised learning models and generative models. The concept of autoencoders has inspired many advanced models.
In this blog post, we’ll start with a simple introduction to autoencoders. Then, we’ll show how to build an autoencoder using a fully-connected neural network. We’ll explain what sparsity constraints are and how to add them to neural networks. After that, we’ll go over how to build autoencoders with convolutional neural networks. Finally, we’ll talk about some common uses for autoencoders.
You can find all the source code and tutorial scripts mentioned in this blog post in my GitHub repository (URL: https://github.com/JianZhongDev/AutoencoderPyTorch/tree/main ).
Autoencoder Network
Redundancy of Data Representation
The key idea behind autoencoders is to reduce redundancy in data representation. Often, data is represented in a way that isn’t very efficient, leading to higher dimensions than necessary. This means many parts of the data are redundant. For example, the MNIST dataset contains 28x28 pixel images of handwritten digits from 0 to 9. Ideally, we only need one variable to represent these digits, but the image representation uses 784 (28x28) grayscale values.
Autoencoders work by compressing the features as the neural network processes the data and then reconstructing the original data from this compressed form. This process helps the network learn a more efficient way to represent the input data.
Typical Structure of an Autoencoder Network
An autoencoder network typically has two parts: an encoder and a decoder. The encoder compresses the input data into a smaller, lower-dimensional form. The decoder then takes this smaller form and reconstructs the original input data. This smaller form, created by the encoder, is often called the latent space or the “bottleneck.” The latent space usually has fewer dimensions than the original input data.

Architecture of autoencoder. (image credit: Jian Zhong)
Fully-Connected Autoencoder
Implementing an autoencoder using a fully connected network is straightforward. For the encoder, we use a fully connected network where the number of neurons decreases with each layer. For the decoder, we do the opposite, using a fully connected network where the number of neurons increases with each layer. This creates a “bottleneck” structure in the middle of the network.
Here is a code example demonstrating how to implement the encoder and decoder of a simple autoencoder network using fully-connected neural networks.
|  |  | 
The VGGStackedLinear module creates several fully-connected networks based on the input layer descriptors. For a detailed explanation, please refer to my blog post on building and training VGG network with PyTorch.
Here’s how the architecture of the encoder and decoder defined above looks:
click to expand simple fully-connected autoencoder printout
Encoder:
SimpleFCNetwork(
  (network): VGGStackedLinear(
    (network): Sequential(
      (0): Linear(in_features=784, out_features=64, bias=True)
      (1): LeakyReLU(negative_slope=0.01)
    )
  )
)
Decoder:
SimpleFCNetwork(
  (network): VGGStackedLinear(
    (network): Sequential(
      (0): Linear(in_features=64, out_features=784, bias=True)
      (1): LeakyReLU(negative_slope=0.01)
    )
  )
)
After training the fully-connected network, here are the results for an example data input/output, the latent representation of data in a batch of 512 samples, and the learned feature dictionary:

Training results of a simple fully-connected autoencoder (encoder: 784-64, decoder 64-784). a, example data input/output. b, latent representation of data in a batch of 512 samples. c, the learned (decoder) feature dictionary. (image credit: Jian Zhong)
Without additional constraints, each sample typically contains numerous non-zero latent features of similar amplitudes, and the learned feature dictionary tends to be highly localized.
For a comprehensive understanding of how the above network was implemented and trained, please refer to the TrainSimpleFCAutoencoder Jupyter notebook in my GitHub repository.
Sparsity and Sparse Autoencoder
In machine learning, sparsity suggests that in many high-dimensional datasets, only a small number of features or variables are meaningful or non-zero for each observation. In an optimal representation space, many features either have zero values or values that are negligible.
In the context of autoencoders, a sparse latent representation of the data is often preferred. This sparse representation can be achieved by incorporating sparse constraints into the network. Adding these constraints helps the autoencoder focus on learning more meaningful features.
Hard Sparsity in Latent Representation
Implementing hard sparsity in the latent space involves adding a sparsity layer at the end of the encoder network along the feature dimension. To create a hard sparsity layer, we specify a number k of features to retain in the latent space. During the forward pass, this layer keeps only the top k largest features of the encoded representation for each sample, setting the rest to 0. During backward propagation, the hard sparsity layer only propagates gradients for these top k features.
Here’s how the hard sparsity layer is implemented:
|  |  | 
First, we created our own operation FeatureTopKFunction for hard sparsity and defined its functions for both forward and backward passes. During the forward pass, a mask is generated to identify the top k features of each input sample, which is then stored for later use in the backward pass. This mask ensures that only the top k values are kept, while the rest are set to zero for both value and gradient calculations. In the hard sparsity layer, we specify the number k and incorporate the hard sparsity operation into the forward() method.
To implement hard sparsity in an autoencoder, simply add a hard sparsity layer at the end of the encoder network as follows:
|  |  | 
After training the fully-connected network with these hard sparsity constraints, here are the outcomes for a sample data input/output, the latent representations of data in a batch of 512 samples, and the learned feature dictionary:

Training results of a simple fully-connected autoencoder with hard sparsity (encoder: 784-64-sparsity, decoder 64-784). a-c, results of autoencoder trained with top 16 sparsity. d-f, results of autoencoder trained with top 5 sparsity. a,d, example data input/output. b,e, latent representation of data in a batch of 512 samples. c,f, the learned (decoder) feature dictionary. (image credit: Jian Zhong)
From the results above, we observe that increasing the required sparsity with hard constraints reduces the number of non-zero features in the latent space. This encourages the network to learn more global features.
For a detailed understanding of how this network was implemented and trained, please refer to the TrainSimpleSparseFCAutoencoder Jupyter notebook in my GitHub repository.
Soft Sparsity in Latent Representation
We can also encourage sparsity in the encoded features of the latent space by applying a soft constraint. This involves adding an additional penalty term to the loss function. The modified loss function with the sparsity penalty appears as follows:
$$ H_{\theta}(pred,gt) = J_{\theta}(pred,gt) + \lambda \cdot L_{\theta}(code) $$
Here, \(\theta, pred, gt\) represents the parameters of the autoencoder network, the output prediction of autoencoder, and the ground truth data, respectively. \(H_{\theta}(pred,gt)\)  is the loss function with sparsity constraints, where \(J_{\theta}(pred,gt)\) is the original loss function, which measures the difference between the network prediction and ground truth. \(L_{\theta}(pred,gt)\)  denotes the penalty term for enforcing sparsity. The parameter \(\lambda\) controls the strength of this penalty.
The L1 loss of the encoded features is commonly used as a sparsity loss. This loss function is readily available in PyTorch.
Another approach to implementing sparsity loss is through a penalty based on KL divergence. The penalty term for this KL divergence-based sparsity can be defined as follows:
$$ L_{\theta} = \frac{1}{s} \sum^{s}_{j=1} KL(\rho||\hat{\rho_j}) $$
Here,  \(s\) represents the number of features in the encoded representation, which corresponds to the dimension of the latent space.  \(j\) is index for the features in the latent space. \(KL(\rho||\hat{\rho_j})\) is calculated as follows:
$$ KL(\rho||\hat{\rho_j}) = \rho \cdot log(\frac{\rho}{\hat{\rho}_j}) + (1 - \rho) \cdot log(\frac{1-\rho}{1-\hat{\rho}_j}) $$
Here, \(\rho\) is a sparsity parameter, typically a small value close to zero that is provided during training. \(\hat{\rho}_j\)  is computed from the j-th latent features of the samples within the mini-batch as follows:
$$ \hat{\rho_{j}} = \frac{1}{m} \sum^{m}_{i=1} l_i $$
Here, \(m\) denotes the batch size. \(j\) indexes the features within the latent space. \(i\) indexes the samples within the minibatch. \(l\) represents each individual feature within the latent space.
Note that for the KL divergence expression, the values of \(\rho\) and \(\hat{\rho}_j\)  must fall within the range \((0,1)\) . This range should be ensured by using suitable activation functions (such as sigmoid) for the output layer of the encoder, or by appropriately normalizing the latent space features before computing the sparsity loss.
Below is the PyTorch code implementation for the KL-divergence based sparsity loss:
|  |  | 
After training a basic fully-connected autoencoder model with soft sparsity constraints, the results are as follows:

Training results of a simple fully-connected autoencoder with soft sparsity (encoder: 784-64, decoder 64-784, KL-divergence soft sparsity loss \(\rho = 0.05\) ). a-c, results of autoencoder trained with \(\lambda = 10^{-2}\) . d-f, results of autoencoder trained with \(\lambda = 10^{-1}\) . a,d, example data input/output. b,e, latent representation of data in a batch of 512 samples. c,f, the learned (decoder) feature dictionary. (image credit: Jian Zhong)
Increasing the strength of the sparsity penalty decreases the number of non-zero features in the latent space.
For a comprehensive understanding of how this network was implemented and trained, please refer to the TrainSimpleFCAutoencoderWithSparseLoss Jupyter notebook in my GitHub repository.
Lifetime (Winner-Takes-All) Sparsity
Unlike conventional sparsity constraints that aim to increase sparsity within each individual sample, lifetime sparsity enforces sparsity across minibatch samples for each feature. Here’s how lifetime sparsity can be implemented:
During training, in the forward propagation phase, for each feature in the latent space, we retain the top k largest values across all minibatch samples and set the remaining values of that feature to zero. During backward propagation, gradients are propagated only for these k non-zero values.
During testing, we disable the lifetime sparsity constraints, allowing the encoder network to output the final representation of the input. The implementation of lifetime sparsity operations is as follows:
|  |  | 
In the forward pass, we create a mask that identifies the top k values across the minibatch dimension for each feature in the latent space. This mask is saved for use during the backward pass. During both forward and backward passes, this mask ensures that only the top k values of each feature are retained, while the rest are set to zero.
With these lifetime sparsity operations, we can implement a neural network layer that enforces lifetime sparsity as follows:
|  |  | 
In the lifetime sparsity layer, we store the k values within the network object. During training, this layer implements lifetime sparsity operations. During testing, the layer simply passes the input directly to the output.
To implement lifetime sparsity in an autoencoder, we add the lifetime sparsity layer at the end of the encoder network as follows:
|  |  | 
After training a simple fully-connected autoencoder model with a lifetime sparsity layer, the results are as follows:

Training results of a simple fully-connected autoencoder with life time sparsity (encoder: 784-64-sparsity, decoder 64-784). a-c, results of autoencoder trained with top 25% sparsity. d-f, results of autoencoder trained with top 5% sparsity. a,d, example data input/output. b,e, latent representation of data in a batch of 512 samples. c,f, the learned (decoder) feature dictionary. (image credit: Jian Zhong)
Increasing the strength of the lifetime sparsity constraint reduces the number of non-zero features in the latent space. This encourages the network to learn more global features.
For detailed insights into how this network was implemented and trained, please refer to the TrainSimpleSparseFCAutoencoder Jupyter notebook in my GitHub repository.
Convolutional Autoencoder
For image data, the encoder network can also be implemented using a convolutional network, where the feature dimensions decrease as the encoder becomes deeper. Max pooling layers can be added to further reduce feature dimensions and induce sparsity in the encoded features. Here’s an example of a convolutional encoder network:
|  |  | 
The VGGStacked2DConv module generates multiple convolutional networks based on the input layer descriptors. For a detailed explanation, please refer to my blog post on building and training VGG network with PyTorch.
Here’s a visualization of the architecture of the encoder and decoder described above:
click to expand convolutional encoder printout
Encoder:
SimpleCovEncoder(
  (network): Sequential(
    (0): VGGStacked2DConv(
      (network): Sequential(
        (0): Conv2d(1, 8, kernel_size=(6, 6), stride=(1, 1))
        (1): LeakyReLU(negative_slope=0.01)
        (2): Conv2d(8, 8, kernel_size=(6, 6), stride=(1, 1))
        (3): LeakyReLU(negative_slope=0.01)
        (4): Conv2d(8, 8, kernel_size=(6, 6), stride=(1, 1))
        (5): LeakyReLU(negative_slope=0.01)
        (6): Conv2d(8, 8, kernel_size=(6, 6), stride=(1, 1))
        (7): LeakyReLU(negative_slope=0.01)
      )
    )
    (1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (2): Flatten(start_dim=1, end_dim=-1)
  )
)
After training the fully-connected network, here are the results for a sample data input/output, the latent representation of data in a batch of 512 samples, and the learned feature dictionary:

Training results of a simple autoencoder with convolutional encoder and fully-connected decoder (encoder: Conv6x6-Conv6x6-Conv6x6-Conv6x6-MaxPool2x2, decoder 128-784). a, example data input/output. b, latent representation of data in a batch of 512 samples. c, the learned (decoder) feature dictionary. (image credit: Jian Zhong)
For a detailed understanding of how this network was implemented and trained, please see the TrainSimpleConvAutoencoder Jupyter notebook in my GitHub repository.
Training and Validation
During training, the optimal encoding of input data is generally unknown. In an autoencoder network, the encoder and decoder are trained concurrently. The encoder processes input data to generate compressed representations, while during testing, the decoder reconstructs the input from these representations. The objective of training is to minimize the discrepancy between the decoder’s output and the original input data. Typically, Mean Squared Error (MSE) loss is selected as the optimization loss function for this purpose.
Training Dataset
When training an autoencoder with image datasets, both the input data and the ground truth are images. Depending on the application of the autoencoder, the input data and ground truth images may not necessarily be identical.
In this blog post, we will use the MNIST dataset for our demonstration. In PyTorch, the MNIST dataset provides handwritten digit images as input data and the corresponding digits as ground truth. To train the autoencoder with MNIST and potentially apply various transformations to both input and ground truth images, we implement the following dataset class. This class converts conventional supervised learning datasets into datasets suitable for autoencoder training.
|  |  | 
Training and Validation Process
The training process for one epoch is implemented as follows:
|  |  | 
Mean Squared Error (MSE) loss is typically used as the loss function during training. For sparse autoencoder training, where a sparsity penalty needs to be incorporated into the loss function, the train for one epoch function accepts inputs for the sparsity penalty and its weight.
The validation process for one epoch can be implemented as follows:
|  |  | 
Tying and Untying Layer Weights
When training a fully-connected network with symmetrical encoder and decoder structures, it is recommended to initially share the same weight matrix between corresponding layers of the encoder and decoder. Later, for fine-tuning, the weight matrices are separated. This operation is referred to as ’tying the weights’ when they are shared, and ‘untying the weights’ when they are separated.
In PyTorch, we can implement the operations to tie and untie the encoder-decoder matrices as follows:
|  |  | 
When tying a decoder layer to an encoder layer, we create a dummy linear layer that uses the weight of the corresponding encoder layer for forward and backward propagation. When untying the decoder layer, we create a new linear layer and update its weight and bias based on the dummy linear layer.
Using these tying and untying functions, we can tie and untie corresponding linear layers in the encoder and decoder as follows:
|  |  | 
Training Deep Autoencoder
For deeper autoencoder networks, unsupervised training can be done in a greedy, layer-wise manner. We start by training the first layer of the encoder and the last layer of the decoder using the input and ground truth images. Once these layers are trained, we freeze them (disable their weight updates) and add the second layer of the encoder and the second-to-last layer of the decoder. We then train these new layers. This process is repeated until all the layers in the encoder and decoder have been trained. Finally, we fine-tune the entire network by training with weight updates enabled for all layers.
The layer state update, freezing, and unfreezing operations can be implemented using the following:
|  |  | 
Using these functions, we can update a deep autoencoder network from a shallower pre trained autoencoder network and manage the freezing and unfreezing of layers as follows:
|  |  | 
The complete script for training the deep autoencoder can be found in the TrainDeepSimpleFCAutoencoder notebook in my GitHub repository.
Tips for Autoencoder Training
- Choosing the right activation function is crucial. When using the ReLU function without careful optimization, it can lead to the ‘dead ReLU’ problem, causing inactive neurons in the autoencoder models. 
- Avoiding a high learning rate during training, even with a scheduler (especially for autoencoders with lifetime sparsity constraints), is important. A large learning rate can cause gradient updates to overshoot in the initial epochs, potentially leading to undesired local minima during optimization. 
- For training deep autoencoder networks, especially those with sparse constraints, it’s beneficial to adopt a layer-by-layer iterative training approach. Training the network in stacked layers all at once can result in too few meaningful features in the latent space. 
Applications
Compression and Dimension Reduction
The dimension reduction application of the autoencoder network is straightforward. We use the encoder network to convert high-dimensional input data into low-dimensional representations. The decoder network then reconstructs the encoded information.
After dimension reduction using the encoder, we can analyze the distribution of data in the latent space.

The two-dimensional codes found by a 784-128-64-32-2 fully-connected autoencoder. (image credit: Jian Zhong)
Denoise
Pixel-level noise and defects cannot efficiently be represented in the much lower-dimensional latent space, so autoencoders can also be applied for noise reduction and correcting pixel defects. To train an autoencoder network for denoising, we use images with added noise as input and clean images as ground truth.
For denoising with autoencoders, we apply Gaussian noise and masking noise as data transformations in PyTorch.
The Gaussian noise transformation can be implemented as follows:
|  |  | 
Here’s an example of denoising Gaussian noise using an autoencoder:

Gaussian denoise result of a simple fully-connected autoencoder (encoder: 784-64, decoder 64-784). (image credit: Jian Zhong)
Masking noise involves randomly setting a fraction of pixels in the input image to zero.
|  |  | 
Here’s an example of using a simple fully-connected autoencoder to denoise masked noise:

Mask denoise result of a simple fully-connected autoencoder (encoder: 784-64, decoder 64-784). (image credit: Jian Zhong)
Refer to the TrainSimpleDenoiseFCAutoencoder Jupyter notebook in my GitHub repository for more details.
Feature extraction and semi-supervised learning
When training an autoencoder to transform input data into a low-dimensional space, the encoder and decoder learn to map input data to a latent space and reconstruct it back. The encoder and decoder inherently capture essential features from the data through these transformations.
This feature extraction capability of autoencoders makes them highly effective for semi-supervised learning scenarios. In semi-supervised learning for classification networks, for instance, we can first train an autoencoder using the abundant unlabeled data. Subsequently, we connect a shallow fully-connected network after the encoder of the autoencoder. We then use the limited labeled data to fine-tune this shallow network.
Reference
[1] Hinton, G. E. & Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science 313, 504–507 (2006).
[2] Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal 37, 233–243 (1991).
[3] Masci, J., Meier, U., Cireşan, D. & Schmidhuber, J. Stacked Convolutional Auto-Encoders for hierarchical feature extraction. in Lecture notes in computer science 52–59 (2011). doi:10.1007/978-3-642-21735-7_7.
[4] Makhzani, A. & Frey, B. J. A Winner-Take-All method for training sparse convolutional autoencoders. arXiv (Cornell University) (2014).
[5] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, 2011.
Citation
If you found this article helpful, please cite it as:
Zhong, Jian (June 2024). Autoencoders with PyTorch: Full Code Guide. Vision Tech Insights. https://jianzhongdev.github.io/VisionTechInsights/posts/autoencoders_with_pytorch_full_code_guide/.
Or
@article{zhong2024buildtrainAutoencoderPyTorch,
  title   = "Autoencoders with PyTorch: Full Code Guide",
  author  = "Zhong, Jian",
  journal = "jianzhongdev.github.io",
  year    = "2024",
  month   = "June",
  url     = "https://jianzhongdev.github.io/VisionTechInsights/posts/autoencoders_with_pytorch_full_code_guide/"
}
![[cover image] Architecture of Autoencoder (image credit: Jian Zhong)](https://jianzhongdev.github.io/VisionTechInsights/images/autoencoders_with_pytorch_full_code_guide/AutoencoderCoverImage.png)