Stable Diffusion, in plain English

The model behind most modern open-source AI image tools — what it is, how it works, how it compares to Midjourney and FLUX, and how to use it for free in your browser.

What is Stable Diffusion?

Stable Diffusion is a family of open-source AI models that turn text prompts into images. It was first released in August 2022 by Stability AI in collaboration with researchers at LMU Munich (the CompVis group) and Runway, and it's the model that kicked off the modern wave of consumer AI image generation.

What makes Stable Diffusion different from Midjourney, DALL·E, Imagen and the other big-name image generators is that the model weights are public. Anyone can download them, run them on their own hardware, fine-tune them on their own data, and ship products built on top of them. That single decision — open weights — created an entire ecosystem of fine-tunes (like Juggernaut XL), LoRAs, ControlNets, and front-ends that no closed model has matched.

How does it work?

Stable Diffusion is a latent diffusion model. The short version: it starts with pure random noise and, over a series of denoising steps (usually 20–30), gradually shapes that noise into an image that matches your prompt.

In more detail, the pipeline has three parts. A text encoder (CLIP, or in newer versions a T5 transformer) turns your prompt into a vector that captures its meaning. A U-Net (or in SD3, a Diffusion Transformer) takes random noise plus that text vector and predicts what noise to remove at each step. Repeat that 20–30 times and the noise resolves into a coherent image — but in a compressed latent space, not at full resolution. Finally, a VAE decoder turns the final latent into the full-resolution PNG you see.

Doing the heavy lifting in latent space is the trick that makes Stable Diffusion fast enough to run on a single consumer GPU. The latents are roughly 8× smaller per dimension than the output image, so each denoising step is around 64× cheaper than working at full resolution.

The Stable Diffusion versions

There have been four major generations. Each is a different architecture, not just a bigger version of the last.

VersionReleasedNative res.ArchitectureStatus
SD 1.5Oct 2022512×512U-Net + CLIPMature, huge fine-tune ecosystem, still widely used
SD 2.xNov 2022768×768U-Net + OpenCLIPEffectively skipped — tightened filters broke prompt compatibility
SDXLJuly 20231024×1024Larger U-Net + dual text encodersCurrent commercial standard for open-weight image tools
SD3 / 3.52024–20251024×1024Diffusion Transformer (MMDiT)Stronger prompt comprehension and text rendering; smaller ecosystem so far

The version that matters most in practice is SDXL. It's the resolution and architecture the bulk of the open-source ecosystem is built on, including Juggernaut XL. SD 1.5 is still everywhere despite being three years old, because the fine-tune library on Civitai is enormous and it runs on almost any GPU. SD 2.x was effectively skipped by the community. SD3 is impressive on prompt comprehension and text rendering, but the ecosystem is smaller and the original licence terms made some commercial users cautious.

Stable Diffusion vs Midjourney, DALL·E and FLUX

Each of the major image models makes different tradeoffs.

Midjourneyis closed and runs on subscription. The output is consistently beautiful out of the box with very little prompt engineering needed — its default aesthetic is the strongest of any model. The cost: you can't run it yourself, you can't fine-tune it, and the content licence depends on your tier.

DALL·E 3(via ChatGPT) is closed, integrated into ChatGPT's prompt-rewriting layer, and excellent at following long, complex prompts. Weak on photorealism compared to SDXL fine-tunes and FLUX.

FLUX(by Black Forest Labs, founded by ex-Stability researchers) is the newest serious contender. FLUX Pro is closed and API-only; FLUX.1 [schnell] and [dev] are open-weight. FLUX has better text rendering and stronger prompt comprehension than SDXL, at the cost of larger model size — typically 16+ GB of VRAM versus SDXL's 8 GB.

Stable Diffusion (and SDXL fine-tunes specifically) wins on photorealism per dollar, ecosystem maturity, and the fact that you can actually own and modify the model. For photographic-style work at sensible hardware requirements, an SDXL fine-tune like Juggernaut XL is hard to beat.

Fine-tunes: where Juggernaut XL fits

The base SDXL model is competent at everything and excellent at nothing. The interesting work in the SDXL ecosystem happens in fine-tunes: people take the base model and continue training it on a specialised dataset to push its output in a particular direction.

The major SDXL fine-tunes you'll see referenced:

  • Juggernaut XL (by RunDiffusion) — photorealism. The most-downloaded SDXL fine-tune by a large margin.
  • RealVisXL — also photorealism, slightly different aesthetic.
  • DreamShaper XL — semi-realistic, artistic illustration, fantasy.
  • AnimagineXL — anime and stylised character art.
  • Pony Diffusion XL — character work with strong prompt adherence.

We run Juggernaut XL Lightning on the homepage generator because it's the best general-purpose photorealistic option in the SDXL ecosystem and the Lightning variant lets us serve generations in 6–12 seconds without giving up much quality. See the full tool roadmap for what's coming next.

What is Juggernaut XL?

Juggernaut XL is a fine-tune of SDXL by RunDiffusion, trained specifically to fix the things stock SDXL gets wrong in photographic work — plasticky skin, flat lighting, lifeless eyes, broken hands. Among open-source SDXL models, it's widely considered the strongest general-purpose option for photorealism. Current version at time of writing is v10 / Ragnarok.

For the full breakdown and a free playground, see the Juggernaut XL generator.

Lightning, Turbo, LCM: the speed variants

Standard SDXL needs 25–30 diffusion steps to converge on a clean image. That's slow for an interactive web tool — eight to fifteen seconds even on fast GPUs.

Several distillation techniques produce variants that converge in 4–8 steps instead:

  • SDXL Lightning (ByteDance) — distillation using progressive adversarial training. Quality holds up well at 4 steps, nearly identical at 8.
  • SDXL Turbo (Stability AI) — adversarial distillation aimed at single-step inference. Faster but visibly lower quality than Lightning at equivalent step counts.
  • LCM (Latent Consistency Models) — a different distillation method that works as a LoRA you can apply to any SDXL checkpoint, including fine-tunes.

We run the Juggernaut Lightning SDXL variant, which combines the Juggernaut XL fine-tune with Lightning distillation. Same Juggernaut look, generated in roughly a third of the time.

How to write good Stable Diffusion prompts

Stable Diffusion responds well to specific, structured prompts. Vague prompts get vague results; specific prompts get usable ones.

A workable template for photorealistic work:

[subject], [pose or action], [setting], [camera/lens], [lighting], [style references]

For example: “A woman in her 40s laughing in a sunlit kitchen, mid-action, 50mm lens at f/1.8, golden hour through a window, photojournalism style, sharp focus on eyes.”

Things SDXL-family models respond well to: focal lengths (35mm, 50mm, 85mm), film stocks (Portra 400, Kodak Gold), lighting descriptors (golden hour, soft window light, harsh midday sun), and shot types (wide, medium, close-up). Things to avoid: long strings of vague adjectives (“beautiful, amazing, stunning, gorgeous”), and contradictions (“cartoonish realistic photograph”).

A short negative prompt usually helps for portraits: “blurry, low quality, deformed hands, extra fingers, watermark, text.” Don't stuff it — long negative prompts hurt more than they help. If faces come out slightly blurry or warped even after prompting, our upcoming Face Restoration tool is designed to fix exactly that in one click.

Hardware: do you need a GPU?

To run Stable Diffusion locally: yes, and a reasonable one. SDXL needs around 8 GB of VRAM to run comfortably. SD 1.5 will run on 4 GB. SD3.5 Large wants 16+ GB. FLUX is even heavier.

If you don't have a suitable GPU, you have three options: pay for cloud GPU time (RunPod, Vast.ai), use a free demo on Hugging Face Spaces (limited, often queued), or use a hosted wrapper like this one — free trial, no setup, no waiting for downloads. We also have a step-by-step guide on how to use Stable Diffusion if you're just getting started.

Is Stable Diffusion free? Licensing explained

The model itself is free to download and use. Stable Diffusion is released under variants of the CreativeML Open RAIL-M licence, which allows commercial use, modification, and redistribution. The restrictions in the licence are around harmful use (CSAM, defamation, etc.) rather than commercial terms.

SD3 was initially released under a more restrictive Stability AI Community Licence with a $1M revenue cap, which slowed its adoption; the terms have since been relaxed.

Juggernaut XL specifically is under CreativeML Open RAIL++-M and is free for commercial use. You own what you generate.

Try Stable Diffusion in your browser

You don't need a GPU or a Python install to try Stable Diffusion. Open the Text2Pixel playground and generate your first image in your browser. The trial works with no signup. After that, free credits are granted on signup; pay-as-you-go credit packs are available if you need more — no subscription. If you want a deeper dive, our guide to what Stable Diffusion is covers the technical side in more detail.

Try it now — no install, no signup required

Our free Juggernaut XL playground runs in your browser. One free generation with no account; 10 free credits when you sign up.

Open the Juggernaut XL generator →

Frequently asked questions

Is Stable Diffusion free to use?

Yes. The model weights are released under permissive open-source licences (CreativeML Open RAIL-M and variants) that allow commercial use. Running the model costs compute — either your own GPU electricity or a hosted service.

Is Stable Diffusion better than Midjourney?

Different tradeoffs. Midjourney has a stronger out-of-the-box aesthetic and needs less prompt engineering. Stable Diffusion (and SDXL fine-tunes like Juggernaut) win on photorealism, customisation, ownership of the model, and price.

What's the difference between Stable Diffusion and SDXL?

SDXL is the third major generation of Stable Diffusion, released in July 2023. It generates images at 1024×1024 natively (versus 512×512 for SD 1.5) and produces significantly better anatomy, hands, and text. 'Stable Diffusion' is the family; SDXL is one version of it.

Can I use Stable Diffusion images commercially?

Yes. The CreativeML Open RAIL-M licence and its variants permit commercial use. You own what you generate; standard restrictions on illegal content still apply.

Who made Stable Diffusion?

The original model was developed by researchers at LMU Munich (the CompVis group) and Runway, with compute and funding from Stability AI. It was released in August 2022. Subsequent versions have been led by Stability AI.

Why does Stable Diffusion struggle with hands?

Hands are visually complex and underrepresented in many training datasets. SDXL improved on this substantially over SD 1.5, and modern fine-tunes like Juggernaut XL improve it further, but it's still the most common failure mode. Negative prompts that include 'deformed hands, extra fingers' help.

Can Stable Diffusion generate text in images?

SDXL is poor at it; SD3 and FLUX are much better. If you specifically need legible text in an image, FLUX is the current open-source option to try.

What hardware do I need to run Stable Diffusion locally?

For SDXL: ideally 8 GB of VRAM (RTX 3060, RTX 4060, or an M-series Mac with 16 GB+ unified memory). For SD 1.5: 4 GB is enough. For SD3.5 Large: 16+ GB. If you don't have a suitable GPU, a hosted service removes the requirement entirely.

What is a fine-tune?

A model that started from a base Stable Diffusion checkpoint and was trained further on a more specific dataset to push its output in a particular direction. Juggernaut XL, for example, is SDXL fine-tuned for photorealism.

What is a LoRA?

A LoRA (Low-Rank Adaptation) is a small file (typically 50–200 MB) that modifies a base model's behaviour without requiring a full re-train. LoRAs are how you add specific characters, styles, or concepts to a model without distributing a full new checkpoint.

How many steps should I use?

For standard SDXL: 25–30. For SDXL Lightning: 4–8. For SDXL Turbo: 1–4. Going beyond the recommended range gives diminishing returns and can introduce artefacts.

Can I run Stable Diffusion on a Mac?

Yes — Apple Silicon Macs run SDXL well, especially with 16+ GB of unified memory. Tools like Draw Things, DiffusionBee and ComfyUI all have Mac builds.