Diffusion is the process of something spreading out—like dropping ink in water and watching it disperse. In image generation, we run this process in reverse. Start with pure noise (random pixels), then iteratively remove noise step by step until a coherent image emerges. It’s like having a block of marble covered in mud: each step wipes away a bit more mud until Michelangelo’s David appears.
Stable Diffusion works in latent space, a compressed mathematical representation of images rather than raw pixels. This makes it fast enough to run on consumer GPUs. During training, the model sees millions of images with captions, learning what “cat” or “sunset” looks like at each noise level. When you type “a cat in a spacesuit,” the model knows which patterns to un-noise toward that concept.
The “stable” part comes from techniques that keep the process controllable and consistent. Without stabilization, diffusion can wander off into weird outputs. The result is a generator that can produce anything from photorealistic scenes to anime art—all by intelligently reversing noise, one step at a time.
