CS180 Project 5

Artem Shumay

Project tasks:

A) Generate various images with diffusion models, including visual anagrams and hybrid images.

B) Create our own diffusion model and train it on MNIST dataset.

Part A0: Generating images with given prompt

I chose the seed 508312 for this project. These are some example images generated with given model and num_inference_steps=20:

with num_inference_steps=50 with num_inference_steps=100

As we can see the higher num_inference_steps the better are the generated results. We can also see a rough correspondence between detailness of a picture and the length of the prompt: the lengthier the prompt the more detailed is the image.

Part A1.1: Implementing the Forward Process.

Forward process is adding noise to an image to some level t, I used the following formula to implement it . Here are some results of campanile at different noise levels:

Part A1.2: Classical denoising.

I tried to denoise images by applying Gaussian filter, here are some(bad looking) results:

Part A1.3: One step denoising.

Here we remove the noise predicted by a model from a noisy image to get a decent looking denoised results, to remove the noise we can rearange the equation above to solve for x_0:

Notice how at large noise levels the Campanile starts to look like a completely different tower.

Part A1.4: Iterative denoising.

By implementing the following equation we create an iterative step denoising, where x_0 is prediction of a clean image. Now we can go through our strided_timesteps - list of noise levels t=[990,960,...,30,0] to progressively denoise the image and get clean result:

As we can see iterative denoising produces the cleanest and most detailed result, however it has also created some new data which wasnt in the original image.

Part A1.5: Model sampling.

Now if we denoise not from an image with noise applied to it, but from a pure noise, we can get some arbitrairly generated images like those below:

Part A1.6: Classifier free guidance.

Previous results were alright, but we can make them better by reducing our image diversity. In previous part we only had conditioned noise, now we can mix some unconditional noise into that noise to get better looking results. I used the gamma 7 for the following equation to generate noise: noise = uncond_noise + gamma * (cond_noise - uncond_noise). Here are the generated images of higher quality:

Part A1.7.0: Image-to-image Translation.

If we add noise to some original image we can make edits to that existing image by iteratively denoising, the amount of noise will determine the amount of generated information/how far the denoised image is from original.

Notice how the image at i_start=1 doesnt match at all, because at i_start=1 it is just denoising nearly fully noise.

img2img when run on me:

img2img when run on lego figuringe:

Part A1.7.1: Hand drawn image to image.

We can provide the sketch of an image and denoise it using cfg, it will create an image roughly trying to match the sketch.

img2img when run on image from the web(from research paper):

img2img when run on hand drawn image:

img2img when run on hand drawn image:

Part A1.7.2: Inpainting.

Now if we mask out some area of the image and only denoise that, we can practically generate only that part of the image, redraw some part of it. Here are some of my results doing that:

Test image:

Me(lol):

Lego figurine:

Part A1.7.3: Text conditional image to image.

We can generate images starting from some image but conditioned on some prompt, to make the image look somewhat like a prompt:

when run on test image with rocket ship prompt:

when run on flower with skull prompt:

when run on lego figurine with a man in hat prompt:

Part A1.8: Visual anagrams.

We can generate two noises and then combine them in a way to generate visual anagrams. Here we are generating one noise regularly and another one on the flipped image, we when flip the second noise and average the two noises. One noise is generating image normally and another generating noise for flipped image, we end up with a nice visual anagram:

Anagram of "an oil painting of people around a campfire" and "an oil painting of an old man"

Anagram of "an oil painting of a snowy mountain village" and "an oil painting of an old man"

Anagram of "a photo of a man" and "a photo of a dog"

Part A1.9: Hybrid images.

By doing the same thing we did last time but running our noises through low and pass filter before combining them we can create hybrid image effect:

Hybrid of litograph of a skull and litograph of a waterfall

Hybrid of a man and a dog

Hybrid of a pencil and a rocket

Hybrid of a dog and hipster barista

Part B: creating our own diffusion model.

Part B1: Training single step denoiser.

By implementing unet given in the spec we can create a one step denoiser by training it on MNIST dataset, we grab and image from the dataset, noise it to sigma=0.5 level and let the model try to predict that, using l2 norm as our metric.

Here is what the noise process for dataset images looks like:

Now we train on sigma=0.5:

Here is how it performs on in-distrubution tests at 1st epoch

Here is how it performs on in-distrubution tests at 5th epoch

Here is how it performs on out of distrubution tests at 5th epoch

Part B2: Time conditioning.

Now we implement time conditioning, and using equations from part A of the project we train the model to detect noise, we then sample like we did in part A.

Here is the loss curve:

Here are the results after training:

Part B3: Class conditioning.

We can now pass the digit associated with the image for the model to learn to generate specific digits. 10% of the time we pass an empty conditioning array to make unet work without it being conditioned as well as make it be able to do classifier free guidance.

Here is the loss curve:

Here are the results after training: