- The AI Solopreneur
- Posts
- Stable diffusion- everything you need to know
Stable diffusion- everything you need to know
read time - 6 mins

unless you are living under rock, you should have seen the cool and weird AI video animations. Some of them make sense, some of them don’t but all of them are enticing to watch. This recent ad for coca-cola which used AI will blow your mind. This is to a promising future.
Now, what is this AI that every is talking about?. What are the core modules of this AI software that makes video editing a thing of past?. Well, this post will break down each piece of the AI, and at the end, I will also give a quick demo on how to create videos like below:
What are diffusion models?
Diffusion models are class of probabilistic generative neural networks that can transform noise into a representative data. Based on what the model is trained on, It can generate images from noise. You can also add “prompts”(text or images) to guide the model in generation specific set of images . There is some similarity to GAN’s.
Unlike GAN’s, diffusion models require only one model to generate data. The models tries to generate images that mimic statistical properties of the training data.
Check out my IG post where I explained diffusion models with a very simple example. Even a 5 year old can understand diffusion models with my post:
There are two key processes in diffusion process:
Data corruption: The model takes the input image, which can be provided by you using prompts, gradually adds noise to it in each step until the image becomes pure noise.
Data reconstruction: This is reverse of data corruption. The model takes the noise, and progressively refines it by removing noise, until it generate an image similar to the input image.
The model is trained to learn the process of data reconstruction which should be generalized and yet specific. This multi-step process of data reconstruction makes these models slower than GAN’s but evidently, diffusion models can generate high quality results much better than GAN’s. You can also use GAN’s to generate images.
I can go on about about diffusion models but I think going further wont make sense if you don’t have experience in machine learning. But the simplified description above is enough for you to use diffusion models or softwares to generate images.
Now, these diffusion models are widely used to generate images, videos, edit existing images or videos, create anime etc. But one particular diffusion model is more popular than any other. That is stable diffusion(SD) model.
Stable diffusion like any other diffusion model, is trained on huge data for months but what make is so popular is it’s ability to run on modest GPU’s of 8GB VRAM (compared to other models) and yet generate high quality, photo-realistic images.
All these websites like Mid-journey, Dalle-2, Eluna use stable diffusion models to generate images based on prompts or other images.
There are many ways to use stable diffusion but the most popular are
Through Automation1111 web UI - locally: GPU with atleast 8GB of VRAM (NVDIA prefered)
Colab notebook - Using colab pro subscription.
Now, stable diffusion is old. It was released on Aug 22, 2022. By now, It had many updates, additions, competitors. In the following part of the post, we will take a brief look at the most prominent features, extensions and competitors of stable diffusion.
1. ControlNet
You may/may not have heard about control net, but this made stable diffusion way popular than It already was. ControlNet is pretty self-explanatory. It gives more control to the user on the generated images. If you were using any AI image generating software a few months back, you would have noticed that most of the times, the images are not exactly the way we want them to. Some aspects may stick to the user prompts but many are generated randomly. Users had to generate a lot of images for the same prompt until one of them is good enough.
ControlNet achieves this by adding other conditions to the image generation. To use controlNet, you need to input an image, along with the prompt. This image is used for 1. edge detection and 2. pose detection.
In both cases, controlNet creates a new image from your reference image which contains the detected edge (canny filtering) or pose (openPose) of the subject in the input image. Now, this is passed as an additional condition to the stable diffusion model. This helps generating images which are based in your prompts but are fine-tuned using the added conditions but the added cost is high run time. Every time you generate an image, there is an additional model, i.e., controlNet, running along with stable diffusion.
Recently, There has been a breakthrough in stable-diffusion where there is no longer need of another model. This new version is called “reference-only control” and the diffusion model itself creates a control-map based on the input image.
Most of the videos you see in Instagram and Tiktok are made using stable-diffusion + controlNet. The best one I saw so far :
2. Deforum
The videos with infinite-images like the one I posted in instagram, are made using SD and Deforum. Unlike general image generation, here a series of images are generated based on the input image, and these are played together to form a video. But if each image was something completely random and unrelated to the previous image, It would not be captivating. That is where Deforum comes to picture. It uses SD to generate images that are similar to the previous generation yet follow the prompts so that these can be stitched together to create a seamless video. You can also do some camera movements like translating horizontally or vertically, or zooming in or out. These videos mostly don’t make any sense but are trippy and fun to watch.
Some of my favorite videos using deforum:
Disco diffusion is another diffusion model similar to stable diffusion. It is used to generate images from text or other images. There were many diffusion models that tried to compete stable diffusion but only disco diffusion was deemed worthy. You can use Deforum with disco diffusion too. The most prominent version of disco diffusion was with warp diffusion which is used to create stylized videos. Now, with SD, no matter how similar, any two consecutive frames in a video are not stylized to the same as their statistical properties. So, this would cause a flickering effect on videos making them a bit jaggy.
But in the latest version of disco diffusion, named warp diffusion, the first frame of a video is diffused as usual as an image input with fixed skip steps. Then It is warped in with its flow map into the 2nd frame and blended it with the original raw video 2nd frame. This way we get the style from heavily stylized 1st frame (warped accordingly) and content from 2nd frame (to reduce warping artifacts and prevent overexposure).
Some of my favorite videos made using warp-diffusion:
There are many other extensions and features of SD that you can check at their official website.
SD with Deforum: step-by-step tutorial
Download the Ipynb notebook at this link.
subscribe to colab pro with these steps.
You can either choose a model or have your custom model downloaded from civitai. The model I use is this. If you decide to use a custom model, upload it to your google drive and paste the path in the notebook. follow the steps from here.
Now it’s time to change the settings. The first thing you need to do is change the animation type to 3D. The guide from Deforum v06 can be found here. The guide has detailed description for each setting option. You can find my settings here.
You can check out the full tutorial on using SD with Deforum.
If you don’t want to spend any money on colab pro and have a PC with GPU VRAM > 6 GB, then check ou this tutorial using Automation 1111.
That’s it for this post. I will make more posts on any new features or versions of SD. I will also start posting short-form videos on instagram. follow my IG to stay updated with AI.
Reply