On this tutorial, we design a sensible image-generation workflow utilizing the Diffusers library. We begin by stabilizing the atmosphere, then generate high-quality photos from textual content prompts utilizing Steady Diffusion with an optimized scheduler. We speed up inference with a LoRA-based latent consistency method, information composition with ControlNet beneath edge conditioning, and at last carry out localized edits through inpainting. Additionally, we concentrate on real-world strategies that stability picture high quality, pace, and controllability.
!pip -q uninstall -y pillow Pillow || true
!pip -q set up --upgrade --force-reinstall "pillow<12.0"
!pip -q set up --upgrade diffusers transformers speed up safetensors huggingface_hub opencv-python
import os, math, random
import torch
import numpy as np
import cv2
from PIL import Picture, ImageDraw, ImageFilter
from diffusers import (
StableDiffusionPipeline,
StableDiffusionInpaintPipeline,
ControlNetModel,
StableDiffusionControlNetPipeline,
UniPCMultistepScheduler,
)
We put together a clear and appropriate runtime by resolving dependency conflicts and putting in all required libraries. We guarantee picture processing works reliably by pinning the proper Pillow model and loading the Diffusers ecosystem. We additionally import all core modules wanted for era, management, and inpainting workflows.
def seed_everything(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
def to_grid(photos, cols=2, bg=255):
if isinstance(photos, Picture.Picture):
photos = [images]
w, h = photos[0].measurement
rows = math.ceil(len(photos) / cols)
grid = Picture.new("RGB", (cols*w, rows*h), (bg, bg, bg))
for i, im in enumerate(photos):
grid.paste(im, ((i % cols)*w, (i // cols)*h))
return grid
system = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if system == "cuda" else torch.float32
print("system:", system, "| dtype:", dtype)
We outline utility features to make sure reproducibility and to prepare visible outputs effectively. We set world random seeds so our generations stay constant throughout runs. We additionally detect the accessible {hardware} and configure precision to optimize efficiency on the GPU or CPU.
seed_everything(7)
BASE_MODEL = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(
BASE_MODEL,
torch_dtype=dtype,
safety_checker=None,
).to(system)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
if system == "cuda":
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
immediate = "a cinematic picture of a futuristic road market at nightfall, ultra-detailed, 35mm, volumetric lighting"
negative_prompt = "blurry, low high quality, deformed, watermark, textual content"
img_text = pipe(
immediate=immediate,
negative_prompt=negative_prompt,
num_inference_steps=25,
guidance_scale=6.5,
width=768,
peak=512,
).photos[0]
We initialize the bottom Steady Diffusion pipeline and swap to a extra environment friendly UniPC scheduler. We generate a high-quality picture immediately from a textual content immediate utilizing rigorously chosen steerage and backbone settings. This establishes a powerful baseline for subsequent enhancements in pace and management.
LCM_LORA = "latent-consistency/lcm-lora-sdv1-5"
pipe.load_lora_weights(LCM_LORA)
strive:
pipe.fuse_lora()
lora_fused = True
besides Exception as e:
lora_fused = False
print("LoRA fuse skipped:", e)
fast_prompt = "a clear product picture of a minimal smartwatch on a reflective floor, studio lighting"
fast_images = []
for steps in [4, 6, 8]:
fast_images.append(
pipe(
immediate=fast_prompt,
negative_prompt=negative_prompt,
num_inference_steps=steps,
guidance_scale=1.5,
width=768,
peak=512,
).photos[0]
)
grid_fast = to_grid(fast_images, cols=3)
print("LoRA fused:", lora_fused)
W, H = 768, 512
structure = Picture.new("RGB", (W, H), "white")
draw = ImageDraw.Draw(structure)
draw.rectangle([40, 80, 340, 460], define="black", width=6)
draw.ellipse([430, 110, 720, 400], define="black", width=6)
draw.line([0, 420, W, 420], fill="black", width=5)
edges = cv2.Canny(np.array(structure), 80, 160)
edges = np.stack([edges]*3, axis=-1)
canny_image = Picture.fromarray(edges)
CONTROLNET = "lllyasviel/sd-controlnet-canny"
controlnet = ControlNetModel.from_pretrained(
CONTROLNET,
torch_dtype=dtype,
).to(system)
cn_pipe = StableDiffusionControlNetPipeline.from_pretrained(
BASE_MODEL,
controlnet=controlnet,
torch_dtype=dtype,
safety_checker=None,
).to(system)
cn_pipe.scheduler = UniPCMultistepScheduler.from_config(cn_pipe.scheduler.config)
if system == "cuda":
cn_pipe.enable_attention_slicing()
cn_pipe.enable_vae_slicing()
cn_prompt = "a contemporary cafe inside, architectural render, mushy daylight, excessive element"
img_controlnet = cn_pipe(
immediate=cn_prompt,
negative_prompt=negative_prompt,
picture=canny_image,
num_inference_steps=25,
guidance_scale=6.5,
controlnet_conditioning_scale=1.0,
).photos[0]
We speed up inference by loading and fusing a LoRA adapter and reveal quick sampling with only a few diffusion steps. We then assemble a structural conditioning picture and apply ControlNet to information the structure of the generated scene. This permits us to protect composition whereas nonetheless benefiting from inventive textual content steerage.
masks = Picture.new("L", img_controlnet.measurement, 0)
mask_draw = ImageDraw.Draw(masks)
mask_draw.rectangle([60, 90, 320, 170], fill=255)
masks = masks.filter(ImageFilter.GaussianBlur(2))
inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
BASE_MODEL,
torch_dtype=dtype,
safety_checker=None,
).to(system)
inpaint_pipe.scheduler = UniPCMultistepScheduler.from_config(inpaint_pipe.scheduler.config)
if system == "cuda":
inpaint_pipe.enable_attention_slicing()
inpaint_pipe.enable_vae_slicing()
inpaint_prompt = "a glowing neon signal that claims 'CAFÉ', cyberpunk fashion, lifelike lighting"
img_inpaint = inpaint_pipe(
immediate=inpaint_prompt,
negative_prompt=negative_prompt,
picture=img_controlnet,
mask_image=masks,
num_inference_steps=30,
guidance_scale=7.0,
).photos[0]
os.makedirs("outputs", exist_ok=True)
img_text.save("outputs/text2img.png")
grid_fast.save("outputs/lora_fast_grid.png")
structure.save("outputs/structure.png")
canny_image.save("outputs/canny.png")
img_controlnet.save("outputs/controlnet.png")
masks.save("outputs/masks.png")
img_inpaint.save("outputs/inpaint.png")
print("Saved outputs:", sorted(os.listdir("outputs")))
print("Accomplished.")
We create a masks to isolate a selected area and apply inpainting to change solely that a part of the picture. We refine the chosen space utilizing a focused immediate whereas holding the remainder intact. Lastly, we save all intermediate and remaining outputs to disk for inspection and reuse.
In conclusion, we demonstrated how a single Diffusers pipeline can evolve into a versatile, production-ready picture era system. We defined how one can transfer from pure text-to-image era to quick sampling, structural management, and focused picture enhancing with out altering frameworks or tooling. This tutorial highlights how we will mix schedulers, LoRA adapters, ControlNet, and inpainting to create controllable and environment friendly generative pipelines which are simple to increase for extra superior inventive or utilized use circumstances.
Try the Full Codes right here. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s tendencies right now: learn extra, subscribe to our e-newsletter, and turn out to be a part of the NextTech group at NextTech-news.com

