In two parts, this project explores the usage and implementation of Diffusion Models. Specifically, I experimented
with diffusion sampling loops and the creation of a Class conditioned, Time conditioned, Denoising UNet.
Part A
Part 0: Setup
For the first portion of this project, I mainly used DeepFloyd which is a two stage model trained by Stability AI.
The first stage generally deals with denoising whilst the second stage is used to upsample the output. In short, the
model is trained to nudge a noisy image into the manifold of real images.
For three text prompts:
'an oil painting of a snowy mountain village'
'a man wearing a hat'
'a rocket ship'
I used num_inference_steps = 20, 50, 100, respectively to see how the model behaves with more inference
iterations.
num_inference_steps = 20
num_inference_steps = 50
num_inference_steps = 100
From this experiment, I noticed that with more inference iterations, the model seemed to generate more texture and curves
where the lesser iterations gave more of a cartoon appearance.
Part 1: Sampling Loops
1.1 Implementing the Forward Process
Here, I practiced implementing the forward process of a diffusion model which adds noise to a clean image based on a noise schedule.
Starting from a picture of the campanile, I added noise at the scheduled time of 250, 500, and 750.
1.2 Classical Denoising
Next, I tried the classical approach to denoising which is to apply a gaussian blur filter. Specifically, I used a
kernel of size 5 and sigma of 3. Applying this filter on the previous 3 noised images, I get the following:
1.3 One-Step Denoising
Next, I will try to use the pre-trained diffusion model to see the difference. Since it requires a prompt, I used
"a high quality photo". Then, I used the unet to estimate noise and removed it from the noisy image using a scalar (alpha bar from the schedule).
These results were much better than the gaussian but still seemed very smooth.
1.4 Iterative Denoising
Even though the results were very natural, it seemed like it could be possible to almost sharpen it by re-running the
noise removal until a sharper result can be achieved. Specifically, around 30 iterations of smaller time steps were ran
to achieve a much better denoising result:
As a point of comparison, here are the various method results side by side:
1.5 Diffusion Model Sampling
In the previous part, the sampling was done on a noisy image of the campanile. From this point, I tried to sample on a
completely noisy image and see what type of realistic image it would try to produce. Here are 5 sampled images:
1.6 Classifier-Free Guidance (CFG)
In the previous part, the quality still wasn't quite great and they didn't seem like real images either. In an effort
to produce better image quality, I implemented classifier-free guidance (CFG) which attempts to introduce an unconditional
noise estimation in addition to the conditioned noise estimation ("A high quality image") so that the model would have
some variance when it comes to sampling. With this addition, the results seem a bit better and resemble coherent
objects:
1.7 Image-to-image Translation
Next, it is possible to introduce some artificial noise into the image and try to generate completely new images from
that point with the null prompt. For example, with the campanile picture, here are the results given the i_start scheduling of
[1, 3, 5, 7, 10, 20] respectively. As i_start gets closer to 20, less of the image is noised and thus resembles a tower most closely.
At 1, on the other end, there is a lot more noise in the original image, and without text guidance, it is more likely to generate completely
unrelated images:
For fun, I also tried this on a Spam musubi and a soft drink cup:
1.7.1 Editing Hand-Drawn and Web Images
Next, I tried some origin images that are more cartoon like and clipart-esque to nudge it back into the realm of realism. Again, using
various starting point (and noise levels), I created the following from an image of tulips:
Then, I tried some stick-figure drawings to see how close it can get to a realistic image:
1.7.2 Inpainting
Here, I experimented with masking out a portion of the original image such that the denoiser model is more so tasked
with filling in the hole. In the original campanile photo, I blocked out the top clock portion to generate the following
result:
It seemed pretty cool so I also ran it on some dimsum and a cow in the woods:
1.7.3 Text-Conditional Image-to-image Translation
Wrapping up the image-to-image translation, I also tried using a different condition than the generic "A high quality photo". First,
I showed the different noise levels (1, 3, 5, 7, 10, 20) at which the campanile could begin to look like a rocket:
Next, I tried making the spam musubi into a water fall and the soft drink cup into a pencil:
1.8 Visual Anagrams
In this part, I tried to create some cool visual anagrams where flipping the image would reveal a totally new image.
To do this, I simply ran the denoising model twice, once with one prompt on the upright image and once with the other prompt
on a flipped image. By mixing the two denoised results, it was possible to sample a visual anagram. For example, here
is a visual anagram of "an oil painting of people around a campfire" and "an oil painting of an old man":
Result:
Upscaled:
Next, I tried to create anagrams out of "a rocket ship" and "a photo of a man" & "a photo of the amalfi cost" (sic) and "an oil painting of a snowy mountain village",
respectively:
Result:
Upscaled:
Result:
Upscaled:
1.9 Hybrid Images
For this last part, I used pretty much the same concepts as previously but instead of using a flip, I used frequency filtering
to select the high frequencies of one image and the low frequencies of another. By blending them fractionally, I am able
to keep the fine details for one prompt and the high-level details for another. This causes a viewer to interpret the image
differently based on their distance to the screen since the eye generally perceives fine details better at a closer range.
First, I used the prompts "a lithograph of a skull" and "a lithograph of waterfalls"
Then, I used the prompts "a rocket ship" and "an oil painting of an old man" & "a rocket ship" and "a photo of a hipster barista", respectively:
Part B
Part 1: Training a Single-Step Denoising UNet
1.1 Implementing the UNet
This portion was quite hard since I haven't had much experience with neural nets before. However, with the discussion's
Torch tutorial, it was very manageable and I had a trainable denoising unet.
1.2 Using the UNet to Train a Denoiser
In order to decide on a noise level in which to train the denoiser, I implemented forward process from the previous part
to add some noise at certain noise levels. From left to right, noise factor = 0 to noise factor = 1:
1.2.1 Training
Choose a medium noise level of 0.5, I achieved the following training losses over 5 epochs:
To sanity check the results, here are some samples after the 1st and last epoch (where the left column is the noised image, middle column is the
predicted noise, and the last column is the result of the unet):
After the last epoch:
1.2.2 Out-of-Distribution Testing
Since the model is only tested with a noise level of 0.5, it was interesting to see how it would handle given less or more noise inside the initial images.
On the first row is the images with noise level 0.0 to 1.0 added. On the second row is the result of the model:
Part 2: Training a Diffusion Model
After the warm up, I began implementing a full diffusion model (DDPM) with scheduling, forwarding, and sampling algorithms. For the most part,
it is very similar to the unet constructed before. However, the objective is a little changed in the way that it should now predict noise rather than the
underlying numerical digit.
2.1 Adding Time Conditioning to UNet
One major thing that changes is adding the time conditioning since it was discovered in part a that single step generally had
worse performance than iterative denoising. Similar here, I used Fully Connected layers with linear activation to kind of inject a timestep
into each forward call / the model in general.
2.2 Training the UNet
The training loop is also slightly modified at this point to minimize the loss on noise and also to normalize the timestep by the
total number of steps in the iterative denoising algorithm. In the end, the time-conditioned unet resulted in the following training loss curve:
2.3 Sampling from the UNet
Similarly, I also sampled some results after every 5 epochs. The sampling algorithm follows from t = 299 to t = 1 and iteratively
refines the noise estimation with some variance added via the beta scheduling list.
After the first epoch:
After the 5th epoch:
After the 20th epoch:
2.4 Adding Class-Conditioning to UNet
Even with the time conditioning (which produced much more legible results), it wasn't quite possible to indicate which number the noise should result in yet.
Thus, I added another conditioning parameter to the model which was class. This class was also injected via a FC block and subsequently trained on the MNIST
training set. Following this training loop, I got the following training loss curve:
2.5 Sampling from the Class-Conditioned UNet
Using the same sampling technique as previously (but with the additional condition of 4 rows of 0-9 digits), I got the following result.
After the first epoch:
After the 5th epoch:
After the 20th epoch:
Conclusion
This was a super fun project and I learned a ton about neural nets and diffusion models. I wish I had more time to experiment with the bells & whistles!