Project 5 report

A) Generate various images with diffusion models, including visual anagrams and hybrid images.

Part A0: Generating images with given prompt

I chose the seed 508312 for this project. These are some example images generated with given model and num_inference_steps=20:

"a man wearing a hat"

"an oil painting of a snowy mountain village"

"a rocket ship"

"a man wearing a hat"

"an oil painting of a snowy mountain village"

"a rocket ship"

"a man wearing a hat"

"an oil painting of a snowy mountain village"

"a rocket ship"

As we can see the higher num_inference_steps the better are the generated results. We can also see a rough correspondence between detailness of a picture and the length of the prompt: the lengthier the prompt the more detailed is the image.

Part A1.1: Implementing the Forward Process.

Forward process is adding noise to an image to some level t, I used the following formula to implement it

. Here are some results of campanile at different noise levels:

original(t=0)

t=250

t=500

t=750

Part A1.2: Classical denoising.

I tried to denoise images by applying Gaussian filter, here are some(bad looking) results:

t=250

t=500

t=750

Gaussian denoising at t=250

Gaussian denoising at t=500

Gaussian denoising at t=750

Part A1.3: One step denoising.

Here we remove the noise predicted by a model from a noisy image to get a decent looking denoised results, to remove the noise we can rearange the equation above to solve for x_0:

t=250

t=500

t=750

One step denoising at t=250

One step denoising at t=500

One step denoising at t=750

Notice how at large noise levels the Campanile starts to look like a completely different tower.

Part A1.4: Iterative denoising.

By implementing the following equation

we create an iterative step denoising, where x_0 is prediction of a clean image. Now we can go through our strided_timesteps - list of noise levels t=[990,960,...,30,0] to progressively denoise the image and get clean result:

iterative denoising at t=90

iterative denoising at t=240

iterative denoising at t=390

iterative denoising at t=540

iterative denoising at t=690

Original

Iteratively denoised

One step denoised

Gaussian denoised

As we can see iterative denoising produces the cleanest and most detailed result, however it has also created some new data which wasnt in the original image.

Part A1.5: Model sampling.

Now if we denoise not from an image with noise applied to it, but from a pure noise, we can get some arbitrairly generated images like those below:

generated image 1

generated image 2

generated image 3

generated image 4

generated image 5

Part A1.6: Classifier free guidance.

Previous results were alright, but we can make them better by reducing our image diversity. In previous part we only had conditioned noise, now we can mix some unconditional noise into that noise to get better looking results. I used the gamma 7 for the following equation to generate noise: noise = uncond_noise + gamma * (cond_noise - uncond_noise). Here are the generated images of higher quality:

cfg generated image 1

cfg generated image 2

cfg generated image 3

cfg generated image 4

cfg generated image 5

Part A1.7.0: Image-to-image Translation.

If we add noise to some original image we can make edits to that existing image by iteratively denoising, the amount of noise will determine the amount of generated information/how far the denoised image is from original.

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

Notice how the image at i_start=1 doesnt match at all, because at i_start=1 it is just denoising nearly fully noise.

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

Part A1.7.1: Hand drawn image to image.

We can provide the sketch of an image and denoise it using cfg, it will create an image roughly trying to match the sketch.

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

iterative denoise with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

Part A1.7.2: Inpainting.

Now if we mask out some area of the image and only denoise that, we can practically generate only that part of the image, redraw some part of it. Here are some of my results doing that:

Original

Mask

To redraw

Inpainted

Original

Mask

To redraw

Inpainted

Original

Mask

To redraw

Inpainted

Inpainted2

Original

Mask

To redraw

Inpainted

Part A1.7.3: Text conditional image to image.

We can generate images starting from some image but conditioned on some prompt, to make the image look somewhat like a prompt:

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

denoised with i_start=1

denoised with i_start=3

denoised with i_start=5

denoised with i_start=7

denoised with i_start=10

denoised with i_start=20

original

Part A1.8: Visual anagrams.

We can generate two noises and then combine them in a way to generate visual anagrams. Here we are generating one noise regularly and another one on the flipped image, we when flip the second noise and average the two noises. One noise is generating image normally and another generating noise for flipped image, we end up with a nice visual anagram:

Anagram of "an oil painting of people around a campfire" and "an oil painting of an old man"

an oil painting of an old man

an oil painting of people around a campfire

Anagram of "an oil painting of a snowy mountain village" and "an oil painting of an old man"

an oil painting of an old man

an oil painting of a snowy mountain village

a photo of a man

a photo of a dog

Part A1.9: Hybrid images.

By doing the same thing we did last time but running our noises through low and pass filter before combining them we can create hybrid image effect:

squint for skull

squint and move away from the screen for a man in a suit

Part B: creating our own diffusion model.

Part B1: Training single step denoiser.

By implementing unet given in the spec we can create a one step denoiser by training it on MNIST dataset, we grab and image from the dataset, noise it to sigma=0.5 level and let the model try to predict that, using l2 norm as our metric.

sigma=0.0

sigma=0.2

sigma=0.4

sigma=0.5

sigma=0.6

sigma=0.8

sigma=1.0

Original

Noisy at sigma=0.5

Denoised

Original

Noisy at sigma=0.5

Denoised

Noisy at sigma=0

Noisy at sigma=0.2

Noisy at sigma=0.4

Noisy at sigma=0.5

Noisy at sigma=0.6

Noisy at sigma=0.8

Noisy at sigma=1

Denoised at sigma=0.0

Denoised at sigma=0.2

Denoised at sigma=0.4

Denoised at sigma=0.5

Denoised at sigma=0.6

Denoised at sigma=0.8

Denoised at sigma=1.0

Part B2: Time conditioning.

Now we implement time conditioning, and using equations from part A of the project we train the model to detect noise, we then sample like we did in part A.

Sampled images at epoch 5

Sampled images at epoch 20

Part B3: Class conditioning.

We can now pass the digit associated with the image for the model to learn to generate specific digits. 10% of the time we pass an empty conditioning array to make unet work without it being conditioned as well as make it be able to do classifier free guidance.

CS180 Project 5

Artem Shumay

Project tasks:

Part A0: Generating images with given prompt

Part A1.1: Implementing the Forward Process.

Part A1.2: Classical denoising.

Part A1.3: One step denoising.

Part A1.4: Iterative denoising.

Part A1.5: Model sampling.

Part A1.6: Classifier free guidance.

Part A1.7.0: Image-to-image Translation.

Part A1.7.1: Hand drawn image to image.

Part A1.7.2: Inpainting.

Part A1.7.3: Text conditional image to image.

Part A1.8: Visual anagrams.

Part A1.9: Hybrid images.

Part B: creating our own diffusion model.

Part B1: Training single step denoiser.

Part B2: Time conditioning.

Part B3: Class conditioning.