Project 5: Fun with Diffusion Models

Albert Wang

In this project, I implemented and deployed diffusion models for image generation.

Part A: The Power of Diffusion Models

Part 0: Setup

First, we load the Deepfloyd IF diffusion model from HuggingFace. Deepfloyd was trained as a text-to-image model, so we also load some precomputed text embeddings. These are generated with a seed of 42. I also experimented with different inference steps (50 instead of 20). Here are a few sample images generated by those prompts:

Part 0 Sample Images

Part 1.1: Implementing the Forward Process

First, we implement the forward process of diffusion, which takes a clean image and adds noise to it. The process runs these equations:

Forward Process Equations

Here are images of the campanile at noise levels 250, 500, and 750:

Noisy Images

Part 1.2: Classical Denoising

Next, we will try to use Gaussian blur filtering to remove the noise. We can see that it is not very effective.

Gaussian Blur Denoising

Part 1.3: One-Step Denoising

Next, we will use a pretrained diffusion model to denoise the image. For each of the three noisy images from 1.2, we denoised the image using the U-Net. This model predicted the noise in the image and attempts to remove it. Compared to Gaussian blur, this approach yielded much better results, though it struggled at higher noise levels.

One-Step Denoising Results

Part 1.4: Iterative Denoising

Single-step denoising was effective, but we take it further by implementing an iterative denoising process. Starting with a highly noisy image, I gradually reduced the noise over multiple steps.

We use this equation:

Iterative Denoising Equation

where:

Noise Scaling Coefficients

At each step, we remove some noise based on specific calculations using noise scaling coefficients. We denoise from t = 990 to t = 0 with a stride of 30. The results were significantly better, especially for heavily noised images.

Iterative Denoising Part 1 Iterative Denoising Part 2

Part 1.5: Diffusion Model Sampling

After understanding denoising, I used the diffusion model to generate images from scratch. Starting with pure noise, I iteratively denoised to create "high-quality photos". Here are the results:

High-Quality Photos from Diffusion Sampling

Part 1.6: Classifier-Free Guidance (CFG)

Next, we use classifier-free guidance (CFG), which combines conditional and unconditional noise estimates to improve image quality. By emphasizing relevant details while retaining some randomness, CFG enhanced the coherence and structure of the generated images. Intuitively, it can be thought of as noise moving more in the direction of the condition given in the model. We combine CFG on top of iterative denoising, so the noise is progressively refined until we get a structured image. Here are the results:

Classifier-Free Guidance Results

Part 1.7: Image-to-Image Translation

Recall in part 1.4 we took a real image, added noise to it, and denoised it. This essentially caused us to make edits to those images. The more noise added, the more the diffusion model "hallucinated" new things. Here, we do a similar process, this time using CFG, following the SDEdit algorithm. We run the forward process to get a noisy test image, and then run iterative_denoise_cfg using a starting index of [1, 3, 5, 7, 10, 20] steps. Here are the results:

Image-to-Image Translation Series 1 Image-to-Image Translation Series 2 Image-to-Image Translation Series 3

1.7.1 Edits and Variations

Next, I tried this on some web images and some hand-drawn images:

Corgi Series Panda Series Tree Series

1.7.2 Inpainting

Using binary masks, I inpainted missing regions in images by combining the forward process and diffusion denoising loop. At every step, after obtaining x_t, we ensure that x_t has the same pixels as x_{origin} in areas that are not masked out. This technique fills in the gaps from the mask while preserving unmasked areas. Here are the results:

Campanile Inpaint Goku Inpaint Dino Inpaint

1.7.3 Text-Conditional Image-to-Image Translation

Next, we do the same thing as SDEdit, but we guide the projection with different text prompts that are transformed into the input images:

Rocket to Campanile Dog to Pickle Hat to Cloud

Part 1.8: Visual Anagrams

Next, we create images that flip between two interpretations depending on orientation. For example, one image shows "an oil painting of people around a campfire" but reveals "an oil painting of an old man" when flipped upside down.

To do this, we will denoise an image x_t at each step normally with the prompt "an oil painting of an old man" to obtain noise estimate ϵ₁. But at the same time, we will flip the image upside down and denoise with the prompt "an oil painting of people around a campfire" to get noise estimate ϵ₂. We can flip ϵ₂ back to make it right-side up and average the two noise estimates. We can then perform a reverse/denoising diffusion step with the averaged noise estimate. The algorithm is as follows:

Visual Anagrams Algorithm

Here are the results:

Old Man Campfire
Old Man / Campfire
Waterfall Skull
Waterfall / Skull
Dog Man with Hat
Dog / Man with Hat

Part 1.9: Hybrid Images

Just like in project 2, we can also create hybrid images using frequency-based blending. We create a composite noise estimate by estimating the noise with two different text prompts and then combining low frequencies from one noise estimate with high frequencies of the other. The algorithm is as follows:

Hybrid Images Algorithm

Here are the results:

Skull and Waterfall Skull / Waterfall
Skull and Dog Dog / Skull
Man and Waterfall Old Man / Waterfall

Part 5B: Diffusion Models From Scratch

In this part, I will train my own diffusion model on MNIST.

Part 1: Training a Single-Step Denoising Net

First, we start by training a one-step denoiser. In other words, this model will map a noisy image to a clean image by optimizing over L2 Loss.

1.1: Implementing the UNet

To implement the UNet, we use downsampling and upsampling blocks with skip connections. This diagram describes the procedure:

UNet Architecture Diagram

It uses a number of standard tensor operations defined as follows:

Standard Tensor Operations

Our UNet architecture utilizes convolutional layers with GELU activation functions in the encoder for downsampling and transposed convolutional layers in the decoder for upsampling. Skip connections are integrated between the encoder and decoder layers. The model is trained on image pairs using MSE loss and optimized with the Adam optimizer.

1.2: Using the UNet to Train a Denoiser

Here is a visualization of the different noising processes over sigma values of [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:

Noising Process Visualization
1.2.1: Training

Next, we train on noisy images with sigma = 0.5 and dynamically add image batches using a batch size of 256 for 5 epochs. We define our model with hidden dimension 128 and use a fixed learning rate of 0.0001. This yields the following loss curve:

Training Loss Curve

We visualize the denoised results on the test set for the model after the first and fifth epoch:

Denoised Results After Epoch 1 Denoised Results After Epoch 5
1.2.2: Out-of-Distribution Testing

Our model was trained on MNIST digits noised with sigma = 0.5. Next, we test the performance of our denoiser on sigmas = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]. Here are the results:

Out-of-Distribution Testing Results

Part 2: Training a Diffusion Model

2.1: Adding Time Conditioning to UNet

Next, we train a time-conditioned UNet for iterative denoising within a DDPM framework. Instead of predicting clean images, the model predicts noise, using a variance schedule that increases noise from 0.0001 to 0.02 over 300 timesteps. By conditioning the UNet on the timestep, a single model can handle varying noise levels during the diffusion process. The training objective minimizes the mean squared error between the predicted and actual noise at randomly sampled timesteps.

Time-Conditioned UNet Architecture Variance Schedule and Loss Function

2.2: Training the UNet

We use batch size 128 for 20 epochs, learning rate = 0.001, and hidden dimension of 64. We follow this training algorithm:

Training Algorithm

This generates the following time-conditioned UNet training loss graph:

Time-Conditioned UNet Training Loss

2.3: Sampling from the Diffusion Model

Next, we sample from our diffusion model. This yields these results for epochs 1, 5, and 20:

Samples at Epoch 1 Samples at Epoch 5 Samples at Epoch 20

2.4: Adding Class-Conditioning to UNet

Next, to add class-conditioning to our model, we add two additional FCBlocks for processing class as well as time, where we use a one-hot encoding for class. Training for this section will be the same as time-only, with the only difference being the conditioning vector c and doing unconditional generation periodically. We follow this algorithm:

Class-Conditioned Training Algorithm

Here is the class-conditioned UNet training loss graph:

Class-Conditioned UNet Training Loss

2.5: Sampling from the Class-Conditioned UNet

Here are the results of sampling from epochs 5 and 20 from the class-conditioned UNet:

Class-Conditioned Samples at Epoch 5 Epoch 5
Class-Conditioned Samples at Epoch 20 Epoch 20