High fidelity synthetic images using GAN: Generative Adversarial Networks
Maarten Fish
ML Engineer
When I first started my machine-learning journey, adversarial architectures immediately caught my attention. This fascinating setup allows two-or-more models to compete in opposing roles, resulting in supplementary data to enhance the deep learning process. This style is often unsupervised as the generator doesn’t get to see the actual training images. Instead, it analyzes and interprets learned patterns to synthesize new data closely resembling the target data distribution. In our search for robust anomaly detection here at Faktion, we came across the AnoGAN architecture which utilizes an informed generator to create a near perfect synthetic version of the target image. The real magic happens when the Generative Adversarial Network (GAN) is trained on a dataset containing exclusively healthy samples. This causes the generator to ignore any damage when presented with an image, reproducing a healthy sample that looks almost identical, aside from the removed anomaly. We can use this amazing transformation to measure how much any subject deviates from the norm in comparison to its perfect synthetic counterpart. Obtaining high fidelity generated images is therefore the number one priority when building effective anomaly detection models.
A short introduction to deep learning & computer vision.
To better understand image-generation let’s begin with a simple encoder-decoder flow for image processing while covering some crucial techniques. This transformation compresses data to a smaller representation in terms of dimensionality, captures essential features through image convolution, and then intelligently reconstructs the original or alternatively, an interpretation. Image convolution is a feature extraction technique that moves a small 2D filter, called the kernel, across visual data and performs a function on the underlying pixels. In each new position the filter’s values are multiplied with the corresponding pixels found within the window’s bounds. This product then sets a single value on the destination image – the result of our convolution. A kernel's stÂÂride refers to the number of pixels moved each step while scanning the entire surface. Depending on the convolution’s kernel, padding and channels (I/O) this transforms the image’s dimensionality and size. Convolution was traditionally used for image processing such as edge-detection which multiplies the yields from both a horizontal and vertical edge-detection kernel. Deep learning kernels have since parameterized resulting in filter-values and patterns that are trainable – specializing over time.
Neural networks traditionally apply an activation function to each neuron’s calculated value before passing them to the next layer. This weighted sum is squished to a certain range allowing individual neurons to switch on/off based on their trained bias. This helps the model achieve non-linearity by shaping and folding the hyper-space, a principle that can be extended to the learnable parameters associated with convolution filters – bringing activation to the 2nd dimension. These signals can be further enforced through dropout layers designed to combat overfitting during training, deactivating random pixels on the intermediate feature-maps. Batch-normalization is an additional trick used to ensure that individual neurons and filters stay within normal expectations, centered around the current batch’s mean & variance. Another fundamental technique is max pooling, much like image-convolution this function walks across the data but passes only the maximum value found within each pool. This transformation has no additional deep learning parameters; it is therefore significantly cheaper to perform and used for size reduction. In a later section we’ll further discuss why pooling is not ideal for preserving quality of information.
Auto-encoders: Combining these techniques into building blocks and modules.
Applying these techniques, the auto-encoder is an image-to-image (IZI) transformation model that changes 2D images into intermediate 1D feature-vectors using an encoder module. This compressed information can be reconstructed by the decoder module or stored for other downstream tasks such as image-classification or simply as data-compression. Reusable building blocks incorporate all previously mentioned techniques into handy abstractions that we can combine into the encoder and decoder modules. These units apply convolution, normalization, activation, pooling and dropout to the data in sequential order. Arranging an assortment of these units together allows us to move data from the first unit’s input all the way through to the final one’s output. This dictates the network’s inherent structure and flow, laying the foundation for a convolutional neural net. Let’s further configure how each block transforms the data within the network and solidify the auto-encoder’s objective – image compression and reconstruction using kernel driven feature-map stacking.
The Conv2D kernels are configured for same-size image transformation, each layer in the network increasing or decreasing the number of channels in the resulting activation-maps encoded and decoded respectively. Most will be familiar with the RGB channels in digital images that split pictures into three overlapping filters for red, green, and blue light – RGBA adding the extra channel for transparency. It’s an increase in this breadth from 4 up to 64/256/512/2^(n²) alongside the changing surface’s width and height that will enhance our model’s deeper understanding of the underlying patterns. Applying 2x2 max pooling with a stride of two subdivides the feature-maps with each layer, effectively reducing stacked grids of pixels into a large stack of individual activations. The N resulting number of channels can then easily be flattened or further encoded to obtain the latent-vector Z – now existing in an N dimensional space for the model to map and explore. Next up, the decoder is presented with the encoded image in compressed form. Its task is the inverse of the encoder: transforming the latent vector back into an image with the initial dimensions. The pooling function is replaced by up-sampling, doubling the size rather than halving it.
The real input image and generated fake image require matching shape: width, height & channels; after the forward pass through the network. Error calculation will look at the differences between the reconstructed image and the ground truth or adjusted target, estimating the likeness through a mean squared error – MSE. This will calculate the squared difference between individual pixel values and return the error across the entire image. When this loss is propagated back-to-front, from output to input in a process called the backpropagation, it tells each component how to perform better next time around. Proportionally calculated for each filter-value in each convolution kernel in each layer of the network, squaring the error prioritizes large errors and simplifies the partial derivatives for gradient descent during backpropagation.
How is GAN different? Original DC-GAN trials and common issues.
The GAN architecture does things a little differently by reorganizing the same key components in an adversarial framework. In this setup the encoder & decoder take on the role of Discriminator and Generator respectively. While the Discriminator is tasked with distinguishing between real images and generated fakes. The Generator’s score is dependent on how authentic it can synthesize images in an attempt to fool its opponent. So, this friendly competition forces both models to try and outplay each other in a zero-sum game. Unlike with the auto-encoder this pipeline starts by drawing random samples from our familiar latent-vector Z. This randomized list of features guides the generator to uniformly map the latent space during training, this sampling consequently severs the direct connection to the training data – no longer digesting images directly into the feature-vector. Learning is now fully unsupervised, so we’ll need to find an appropriate replacement for the loss-function. The Discriminator then takes this batch of synthetic samples and authentic training data for comparison. Instead of producing a feature-vector, the output comes in the form of a single value: 0 for FAKE to 1 for REAL.
The GAN-loss can be simplified as follows: the Generator aims to trick the Discriminator as much as possible, all synthetic fakes ideally pass as authentic images and are indistinguishable from the real data. The Discriminator's task is therefore twofold, rejecting these fakes by discriminating them as zero and accurately classifying the real samples with a high score. Maximizing this loss indicates both these goals are met, expressed by D(x) and 1 - D(G(z)) also containing the Generator's objective: minimizing 1 - D(G(z)). The complete expression encapsulates the GAN-loss, capable of fitting our model to the provided data distribution, extensively mapped during training. Each randomized sample is a new coordinate in a vast high-dimensional space. Iteratively translated to the latent manifold Z – our model's abstraction aimed at capturing the underlying data structure found within the training dataset.
The DCGAN paper specifies replacing all pooling operations with strided Conv2D layers for image downscaling instead – adding scale to the scope of the featurization. This drastically improves the model’s grasp on the data as all transformations now have trainable parameters. For The Discriminator’s up-sampling transposed convolution will gradually increase feature-map size into increasingly fewer but larger image channels, reconstructing the original RGBA. This requires some additional engineering on our part as the Conv2D layers need to be configured for specific down-sizing operations. Luckily there are some handy formulas that can calculate the resulting image sizes of intermediate feature-maps or output signals/images.
Stabilized GAN training using Wasserstein Distance and Gradient Penalty
When testing the original DCGAN implementation, the results were exciting but far from ideal. The delicate minmax dance is easily disturbed during training, often refusing to converge. Whenever the Discriminator maintains a significant lead, it fails to propagate valuable new information to the Generator – usually revealing exactly how it gets fooled. On the contrary, in case the Generator vastly outperforms the Discriminator, this also stops them both from progressing. Confused, its opponent can no longer discriminate between right and wrong, and the game comes to a stale ending. An imbalance in the delicate updates to these adversarial networks additionally risks exploding gradients and results in total mode collapse. Once recovered the Generator stops trying and stalls training like a one-trick pony – run ruined.
To balance out these shortcomings, researchers propose various enhancements to the GAN framework. The Wasserstein Distance-GAN introduces clipping of the Generator weights, which aims at minimizing exploding gradients. A measure of distance between the real and generated data distribution replaces the previous binary output providing more stable training dynamics. The WGAN Critic’s loss now has a deeper meaning as it can be trained towards 0 – indicating both distributions are overlapping. The Generator’s training is also given less priority through critic-iteration. By training the Critic multiple times in a row this offers a catch-up mechanic and sends even better feedback to the Generator when it’s the next turn to produce new samples.
The WGAN clipping technique however has some unwanted consequences on the generation of realistic images which lead to the invention of the improved WGAN-GP (Wasserstein GAN with Gradient Penalty) further enforcing the Lipschitz continuity of the Generator’s weights, this continuity is a mathematical property that describes how quickly a function is allowed to change. This alignment incorporates a constant: binding the ratio of the function's change to the change in its corresponding input. Interpolated images randomly mix real and fake data creating a difficult but informative task for the Critic, perfect for solidifying what distinguishes generated fakes. The gradient penalty term is added to the loss function of the Critic, encouraging gradients to not deviate significantly from a calculated norm.
The W-GAN Critic is ideal for semi-supervised learning: Conditional-GANs.
Now that we’ve thoroughly established a stable baseline for our image-generator, let’s further enhance the model to produce high-fidelity images. The Critic’s role evolves beyond binary discrimination, now measuring the distance between the training dataset and the internal model of the overall data-distribution extracted from generated synths. In place of a single distance measure in the output layer, let’s split the data into distinct sub-classes and expand the output to match this class-count. The Critic additionally learns which characteristics distinguish the subjects in each category. This has consequently updated the learning to a semi-supervised approach as the additional class embeddings force the model to focus on class-specific features and patterns found within a subset of the total data-distribution. Previously our synthetic images looked rather uncanny, displaying features unusually combined on single entities, the results now better reflecting goal-oriented learning. The focus has shifted to secondary featurization as the primary feature is handled through data engineering and categorization. A third level of abstraction can go even further with a GANsemble of networks, each focused on a different group of visually similar subjects with their own sub-types and Conv2D-filters specializing on more relevant features.
To imbue this class-specific information into the data for both the Generator and Critic, we can use a parameterized embeddings-lookup that trains additional noise. This allows us to include a representation of the class into our module’s inputs. Label in, noise out, adding a memory component to each module that allows them to store primary features. The Generator takes in a stack of pixels which we will now double in length by concatenating this embedding to the feature-vector. For the Critic we can have the dictionary produce values totalling the number of pixels, which we can then restructure as an additional channel overlayed to the RGBA – adding a 5th layer of information that encodes the class-specific insights.
The Encoder’s guidance to the perfect match on the manifold.
High fidelity at last, our generated fakes are almost indistinguishable from the real deal. With near-zero Critic-loss we have successfully fitted our model to the training data. Supplying a desired label and accompanying feature-vector prompts the Generator to create an appropriate synthetic response. Certain visual aspects in the image might vary with each randomized request but the main subject should be consistent with the provided class-label. A fun exercise is manually searching the Z-space for interesting coordinates, exploring which features correspond to specific aesthetics. Some dice beautifully lit, have split sheen on their surface while others wear a crescent moon. Interpolating between these two styles creates satisfying transitional animations.
An Encoder component can be taught where to find visual similarity in a secondary round of training on top of the pretrained GAN – as proposed by the f-AnoGAN paper. This third network guides the Generator through the latent-space, in search of the perfect feature-vector Z. Convolution is once again applied, mapping the data's features during training while the GAN modules are locked at this stage. The original AnoGAN paper defines the anomaly score as a composite residual loss for pixel-wise differences added to an additional feature-wise loss using the Critic's feature-maps for direct comparison to the Encoder. Calculated not through MSE but opting instead for MAE, the mean-absolute error, boosting the model's spatial awareness and local anomaly detection. Aside from measuring the Wasserstein distance directly, we'll access the Critic's intermediate feature-layer just preceding the output – unleashing a vast collection of tiny feature-maps used in the Critic's final decision-making processes. This helps the Encoder understand the subject better by not only visually guiding the Generator but also structurally, inheriting the relevant feature-maps, mimicking the Critic's way of thinking.
Anomaly Detection: Putting it all together.
Data engineering is the final and most crucial step in the pursuit of robust anomaly detection. When carefully curated, the dataset can reflect an ideal representation of the subjects into our training GAN. Using images only depicting healthy samples, those without any apparent anomalies, single out what the subjects should look like. Any unwanted dirt, defects and other irregularities are completely ignored as the Generator reproduces the only thing it knows – inpainting any anomalous zones. The secondary phase where we train the Encoder’s kernels solidifies this further as it learns to guide the Generator through the anomaly-free dataset.
Unlike damage detection in other computer vision models this architecture doesn’t require annotated training data for each type of irregularity we want to pick up. Trained on healthy tissue data, medical samples can be scanned for dangerous anomalies. Applied to a production line’s quality assessment, can spot defects not yet discovered. GAN driven anomaly detection offers a wide range of use-cases so don’t hesitate to contact us here at Faktion. Our team of engineers is here to help you explore how these techniques can be applied to your ideas and initiatives.
Appendix
- Generative Adversarial Networks
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
- Wasserstein GAN
- Improved Training of Wasserstein GANs
- Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery
- f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks