Project 5: Fun With Diffusion Models!

By Ethan Zhang

Overview

In two parts, this project explores the usage and implementation of Diffusion Models. Specifically, I experimented with diffusion sampling loops and the creation of a Class conditioned, Time conditioned, Denoising UNet.

Part A

Part 0: Setup

For the first portion of this project, I mainly used DeepFloyd which is a two stage model trained by Stability AI. The first stage generally deals with denoising whilst the second stage is used to upsample the output. In short, the model is trained to nudge a noisy image into the manifold of real images.

For three text prompts:

'an oil painting of a snowy mountain village'
'a man wearing a hat'
'a rocket ship'

I used num_inference_steps = 20, 50, 100, respectively to see how the model behaves with more inference iterations.

num_inference_steps = 20

num_inference_steps = 50

num_inference_steps = 100

From this experiment, I noticed that with more inference iterations, the model seemed to generate more texture and curves where the lesser iterations gave more of a cartoon appearance.

Part 1: Sampling Loops

1.1 Implementing the Forward Process

Here, I practiced implementing the forward process of a diffusion model which adds noise to a clean image based on a noise schedule. Starting from a picture of the campanile, I added noise at the scheduled time of 250, 500, and 750.

1.2 Classical Denoising

Next, I tried the classical approach to denoising which is to apply a gaussian blur filter. Specifically, I used a kernel of size 5 and sigma of 3. Applying this filter on the previous 3 noised images, I get the following:

t = 250	t = 500	t = 750
blurred, t = 250	blurred, t = 500	blurred, t = 750

1.3 One-Step Denoising

Next, I will try to use the pre-trained diffusion model to see the difference. Since it requires a prompt, I used "a high quality photo". Then, I used the unet to estimate noise and removed it from the noisy image using a scalar (alpha bar from the schedule). These results were much better than the gaussian but still seemed very smooth.

t = 250	t = 500	t = 750
one-step, t = 250	one-step, t = 500	one-step, t = 750

1.4 Iterative Denoising

Even though the results were very natural, it seemed like it could be possible to almost sharpen it by re-running the noise removal until a sharper result can be achieved. Specifically, around 30 iterations of smaller time steps were ran to achieve a much better denoising result:

iterative, t = 690	iterative, t = 540	iterative, t = 390
iterative, t = 240	iterative, t = 90	iterative, t = 0

As a point of comparison, here are the various method results side by side:

1.5 Diffusion Model Sampling

In the previous part, the sampling was done on a noisy image of the campanile. From this point, I tried to sample on a completely noisy image and see what type of realistic image it would try to produce. Here are 5 sampled images:

1.6 Classifier-Free Guidance (CFG)

In the previous part, the quality still wasn't quite great and they didn't seem like real images either. In an effort to produce better image quality, I implemented classifier-free guidance (CFG) which attempts to introduce an unconditional noise estimation in addition to the conditioned noise estimation ("A high quality image") so that the model would have some variance when it comes to sampling. With this addition, the results seem a bit better and resemble coherent objects:

1.7 Image-to-image Translation

Next, it is possible to introduce some artificial noise into the image and try to generate completely new images from that point with the null prompt. For example, with the campanile picture, here are the results given the i_start scheduling of [1, 3, 5, 7, 10, 20] respectively. As i_start gets closer to 20, less of the image is noised and thus resembles a tower most closely. At 1, on the other end, there is a lot more noise in the original image, and without text guidance, it is more likely to generate completely unrelated images:

For fun, I also tried this on a Spam musubi and a soft drink cup:

1.7.1 Editing Hand-Drawn and Web Images

Next, I tried some origin images that are more cartoon like and clipart-esque to nudge it back into the realm of realism. Again, using various starting point (and noise levels), I created the following from an image of tulips:

Then, I tried some stick-figure drawings to see how close it can get to a realistic image:

1.7.2 Inpainting

Here, I experimented with masking out a portion of the original image such that the denoiser model is more so tasked with filling in the hole. In the original campanile photo, I blocked out the top clock portion to generate the following result:

It seemed pretty cool so I also ran it on some dimsum and a cow in the woods:

1.7.3 Text-Conditional Image-to-image Translation

Wrapping up the image-to-image translation, I also tried using a different condition than the generic "A high quality photo". First, I showed the different noise levels (1, 3, 5, 7, 10, 20) at which the campanile could begin to look like a rocket:

Next, I tried making the spam musubi into a water fall and the soft drink cup into a pencil:

1.8 Visual Anagrams

In this part, I tried to create some cool visual anagrams where flipping the image would reveal a totally new image. To do this, I simply ran the denoising model twice, once with one prompt on the upright image and once with the other prompt on a flipped image. By mixing the two denoised results, it was possible to sample a visual anagram. For example, here is a visual anagram of "an oil painting of people around a campfire" and "an oil painting of an old man":

Result:

Upscaled:

Next, I tried to create anagrams out of "a rocket ship" and "a photo of a man" & "a photo of the amalfi cost" (sic) and "an oil painting of a snowy mountain village", respectively:

Result:

Upscaled:

Result:

Upscaled:

1.9 Hybrid Images

For this last part, I used pretty much the same concepts as previously but instead of using a flip, I used frequency filtering to select the high frequencies of one image and the low frequencies of another. By blending them fractionally, I am able to keep the fine details for one prompt and the high-level details for another. This causes a viewer to interpret the image differently based on their distance to the screen since the eye generally perceives fine details better at a closer range. First, I used the prompts "a lithograph of a skull" and "a lithograph of waterfalls"

Then, I used the prompts "a rocket ship" and "an oil painting of an old man" & "a rocket ship" and "a photo of a hipster barista", respectively:

Part B

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

This portion was quite hard since I haven't had much experience with neural nets before. However, with the discussion's Torch tutorial, it was very manageable and I had a trainable denoising unet.

1.2 Using the UNet to Train a Denoiser

In order to decide on a noise level in which to train the denoiser, I implemented forward process from the previous part to add some noise at certain noise levels. From left to right, noise factor = 0 to noise factor = 1:

1.2.1 Training

Choose a medium noise level of 0.5, I achieved the following training losses over 5 epochs:

To sanity check the results, here are some samples after the 1st and last epoch (where the left column is the noised image, middle column is the predicted noise, and the last column is the result of the unet):

After the last epoch:

1.2.2 Out-of-Distribution Testing

Since the model is only tested with a noise level of 0.5, it was interesting to see how it would handle given less or more noise inside the initial images. On the first row is the images with noise level 0.0 to 1.0 added. On the second row is the result of the model:

Part 2: Training a Diffusion Model

After the warm up, I began implementing a full diffusion model (DDPM) with scheduling, forwarding, and sampling algorithms. For the most part, it is very similar to the unet constructed before. However, the objective is a little changed in the way that it should now predict noise rather than the underlying numerical digit.

2.1 Adding Time Conditioning to UNet

One major thing that changes is adding the time conditioning since it was discovered in part a that single step generally had worse performance than iterative denoising. Similar here, I used Fully Connected layers with linear activation to kind of inject a timestep into each forward call / the model in general.

2.2 Training the UNet

The training loop is also slightly modified at this point to minimize the loss on noise and also to normalize the timestep by the total number of steps in the iterative denoising algorithm. In the end, the time-conditioned unet resulted in the following training loss curve:

2.3 Sampling from the UNet

Similarly, I also sampled some results after every 5 epochs. The sampling algorithm follows from t = 299 to t = 1 and iteratively refines the noise estimation with some variance added via the beta scheduling list.

After the first epoch:

After the 5th epoch:

After the 20th epoch:

2.4 Adding Class-Conditioning to UNet

Even with the time conditioning (which produced much more legible results), it wasn't quite possible to indicate which number the noise should result in yet. Thus, I added another conditioning parameter to the model which was class. This class was also injected via a FC block and subsequently trained on the MNIST training set. Following this training loop, I got the following training loss curve:

2.5 Sampling from the Class-Conditioned UNet

Using the same sampling technique as previously (but with the additional condition of 4 rows of 0-9 digits), I got the following result.

After the first epoch:

After the 5th epoch:

After the 20th epoch:

Conclusion

This was a super fun project and I learned a ton about neural nets and diffusion models. I wish I had more time to experiment with the bells & whistles!