Stable Diffusion XL Just Leveled Up!
So essentially,
Segmind diffusion models are distilled and performant! 🚀
Paper: Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss (9 pages)
Researchers from Segmind and HuggingFace are interested in improving the performance and quality of diffusion models. Diffusion models have revolutionized the field of text-to-image (T2I) synthesis by enabling the generation of high-fidelity images from text descriptions. However, these models are computationally expensive, limiting their accessibility and applicability.
Hmm..What’s the background?
The architecture of Stable Diffusion is a U-Net that employs iterative sampling to progressively denoise a random latent code. This means with more sampling steps, we get a “clearer” diffused output.
Previously, distillation techniques have been applied to pre-trained diffusion models to curtail the number of denoising steps, resulting in identically structured models with reduced sampling requirements. Additionally, removing architectural elements in large diffusion models has also been investigated for the base U-Net model.
Ok, So what is proposed in the research paper?
To address this challenge, the researchers explored various techniques for compressing diffusion models, including architectural pruning, quantization, and knowledge distillation. They apply knowledge distillation to Stable Diffusion XL (SDXL), the largest and most powerful open-source text-to-image diffusion model, to create two compact and efficient variants: Segmind Stable Diffusion (SSD-1B) and Segmind-Vega.
These models achieve significant reductions in model size and latency while maintaining competitive generative quality with the original SDXL model.
And what’s next?
The model’s training time for the distilled models is still relatively long, although it is significantly shorter than the training time for the original SDXL model. Also, The distillation process requires a large amount of data, which may not be available to all users. The researchers acknowledge some of its limitations around diffusing text, hands, and full-body shots.
For the next steps, it is possible to extend these techniques to other large models such as LLMs, and MLMs.
So essentially,
Segmind diffusion models are distilled and performant! 🚀