BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Kim, Bo-Kyeong; Song, Hyoung-Kyu; Castells, Thibault; Choi, Shinkook

doi:10.1007/978-3-031-72949-2_22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15112))

Included in the following conference series:

European Conference on Computer Vision

329 Accesses

Abstract

Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high computing demands due to billion-scale parameters. To enhance efficiency, recent studies have reduced sampling steps and applied network quantization while retaining the original architectures. The lack of architectural reduction attempts may stem from worries over expensive retraining for such massive models. In this work, we uncover the surprising potential of block pruning and feature distillation for low-cost general-purpose T2I. By removing several residual and attention blocks from the U-Net of SDMs, we achieve 30%$\sim $50% reduction in model size, MACs, and latency. We show that distillation retraining is effective even under limited resources: using only 13 A100 days and a tiny dataset, our compact models can imitate the original SDMs (v1.4 and v2.1-base with over 6,000 A100 days). Benefiting from the transferred knowledge, our BK-SDMs deliver competitive results on zero-shot MS-COCO against larger multi-billion parameter models. We further demonstrate the applicability of our lightweight backbones in personalized generation and image-to-image translation. Deployment of our models on edge devices attains 4-second inference. Code and models can be found at: https://github.com/Nota-NetsPresso/BK-SDM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

PIXART- $$\Sigma $$ : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Notes

1.
Let the $(h_{\text {out}}, h_{\text {in}})$-size weight matrix in the b-th block be $\textbf{W}^{l,b} = \left[ W_{i,j}^{l,b}\right] $, where l denotes the layer type (e.g., convolution (flattened) or attention’s key projection). The scores at the output neuron level [1] are aggregated for the block-level importance criteria, $S_{\text {Magnitude}}^b = \mathbb {E}_{l,i}\left[ \sum _j \left| W_{i,j}^{l,b} \right| \right] $ and $S_{\text {Taylor}}^b = \mathbb {E}_{l,i}\left[ \sum _j \left| \frac{\partial \mathcal {L}(D)}{\partial W_{i,j}^{l,b}} W_{i,j}^{l,b} \right| \right] $. Here, $\mathcal {L}$ and D denote the denoising task loss and a calibration set of 1K samples. The final scores are then ranked to remove unimportant blocks, or to replace them with interpolation for unremovable blocks.

References

A Simple and Effective Pruning Approach for LLMs. ICLR (2024)
Google Scholar
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)
Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Castells, T., et al.: EdgeFusion: on-device text-to-image generation. In: CVPR Workshop (2024)
Google Scholar
Chen, Y.H., et al.: Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations. In: CVPR Workshop (2023)
Google Scholar
Choi, J., et al.: Squeezing large-scale diffusion models for mobile. In: ICML Workshop (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Google Scholar
Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NeurIPS (2021)
Google Scholar
Ding, M., Zheng, W., Hong, W., Tang, J.: CogView2: faster and better text-to-image generation via hierarchical transformers. In: NeurIPS (2022)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Google Scholar
Fang, G., Ma, X., Song, M., Mi, M.B., Wang, X.: DepGraph: towards any structural pruning. In: CVPR (2023)
Google Scholar
Fang, G., Ma, X., Wang, X.: Structural pruning for diffusion models. In: NeurIPS (2023)
Google Scholar
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. Lecture Notes in Computer Science, vol. 13675, pp. 89–106. Springer, Cham (2022)
Chapter Google Scholar
Hao, Z., et al.: Learning efficient vision transformers via fine-grained manifold distillation. In: NeurIPS (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: ICCV (2019)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NeurIPS Workshop (2014)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshop (2021)
Google Scholar
Hou, J., Asghar, Z.: World’s first on-device demonstration of stable diffusion on an android phone (2023). https://www.qualcomm.com/news
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)
Google Scholar
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Findings of EMNLP (2020)
Google Scholar
Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: CVPR (2023)
Google Scholar
Kim, B.K., Choi, S., Park, H.: Cut inner layers: a structured pruning strategy for efficient U-net GANs. In: ICML Workshop (2022)
Google Scholar
Kim, B.K., et al.: Shortened LLaMA: a simple depth pruning for large language models. arXiv preprint arXiv:2402.02834 (2024)
LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. In: NeurIPS (1989)
Google Scholar
Lee, Y., Park, K., Cho, Y., Lee, Y.J., Hwang, S.J.: KOALA: self-attention matters in knowledge distillation of latent diffusion models for memory-efficient and fast image synthesis. arXiv preprint arXiv:2312.04005v1 (2023)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: ICLR (2017)
Google Scholar
Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J.Y., Han, S.: GAN compression: efficient architectures for interactive conditional GANs. In: CVPR (2020)
Google Scholar
Li, X., et al.: Q-diffusion: quantizing diffusion models. In: ICCV (2023)
Google Scholar
Li, Y., et al.: SnapFusion: text-to-image diffusion model on mobile devices within two seconds. In: NeurIPS (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models on manifolds. In: ICLR (2022)
Google Scholar
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: NeurIPS (2022)
Google Scholar
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S.: PEFT: state-of-the-art parameter-efficient fine-tuning methods (2022). https://github.com/huggingface/peft
Meng, C., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: NeurIPS Workshop (2022)
Google Scholar
Meng, C., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: CVPR (2023)
Google Scholar
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2022)
Google Scholar
Mo, S., Cho, M., Shin, J.: Freeze the discriminator: a simple baseline for fine-tuning GANs. In: CVPR Workshop (2020)
Google Scholar
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., Kautz, J.: Importance estimation for neural network pruning. In: CVPR (2019)
Google Scholar
Murti, C., Narshana, T., Bhattacharyya, C.: TVSPrune - pruning non-discriminative filters via total variation separability of intermediate representations without fine tuning. In: ICLR (2023)
Google Scholar
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
Google Scholar
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)
Google Scholar
Pernias, P., Rampas, D., Richter, M.L., Pal, C.J., Aubreville, M.: Wüerstchen: an efficient architecture for large-scale text-to-image diffusion models. In: ICLR (2024)
Google Scholar
Pinkney, J.: Small stable diffusion (2023). https://huggingface.co/OFA-Sys/small-stable-diffusion-v0
von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Ren, Y., Wu, J., Xiao, X., Yang, J.: Online multi-granularity distillation for GAN compression. In: ICCV (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: LDM on celeba-hq (2022). https://huggingface.co/CompVis/ldm-celebahq-256
Rombach, R., Esser, P.: Stable diffusion v1-4 (2022). https://huggingface.co/CompVis/stable-diffusion-v1-4
Rombach, R., Esser, P.: Stable diffusion v1-5 (2022). https://huggingface.co/runwayml/stable-diffusion-v1-5
Rombach, R., Esser, P., Ha, D.: Stable diffusion v2-1-base (2022). https://huggingface.co/stabilityai/stable-diffusion-2-1-base
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: ICLR (2015)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NeurIPS (2016)
Google Scholar
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2022)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: NeurIPS Workshop (2019)
Google Scholar
Schuhmann, C., Beaumont, R.: LAION-aesthetics (2022). https://laion.ai/blog/laion-aesthetics
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS Workshop (2022)
Google Scholar
Segmind: Segmind-distill-sd (2023). https://github.com/segmind/distill-sd/tree/c1e97a70d141df09e6fe5cc7dbd66e0cbeae3eeb
Segmind: SSD-1B (2023). https://github.com/segmind/SSD-1B/tree/d2ff723ea8ecf5dbd86f3aac0af1db30e88a2e2d
Shen, H., Cheng, P., Ye, X., Cheng, W., Abidi, H.: Accelerate stable diffusion with intel neural compressor (2022). https://medium.com/intel-analytics-software
Shu, C., Liu, Y., Gao, J., Yan, Z., Shen, C.: Channel-wise knowledge distillation for dense prediction. In: ICCV (2021)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Google Scholar
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: ACL (2020)
Google Scholar
Tang, R., et al.: What the DAAM: interpreting stable diffusion using cross attention. In: ACL (2023)
Google Scholar
Tao, M., Bao, B.K., Tang, H., Xu, C.: GALIP: generative adversarial clips for text-to-image synthesis. In: CVPR (2023)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: ICML (2021)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Google Scholar
Yu, L., Xiang, W.: X-pruner: explainable pruning for vision transformers. In: CVPR (2023)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)
Google Scholar
Zhang, L., Chen, X., Tu, X., Wan, P., Xu, N., Ma, K.: Wavelet knowledge distillation: towards efficient image-to-image translation. In: CVPR (2022)
Google Scholar
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023)
Google Scholar
Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. In: ICLR (2023)
Google Scholar
Zhao, Y., Xu, Y., Xiao, Z., Hou, T.: MobileDiffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567 (2023)
Zhou, Y., et al.: Towards language-free training for text-to-image generation. In: CVPR (2022)
Google Scholar
Zhu, L.: Thop: Pytorch-opcounter (2018). https://github.com/Lyken17/pytorch-OpCounter

Download references

Acknowledgments

We thank the Microsoft Startups Founders Hub program and the AI Industrial Convergence Cluster Development project funded by the Ministry of Science and ICT (MSIT, Korea) and Gwangju Metropolitan City for their generous support of GPU resources.

Author information

Authors and Affiliations

Nota Inc., Berlin, Germany
Bo-Kyeong Kim, Thibault Castells & Shinkook Choi
Captions Research, Seoul, South Korea
Hyoung-Kyu Song

Authors

Bo-Kyeong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Hyoung-Kyu Song
View author publications
You can also search for this author in PubMed Google Scholar
Thibault Castells
View author publications
You can also search for this author in PubMed Google Scholar
Shinkook Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo-Kyeong Kim .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13486 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, BK., Song, HK., Castells, T., Choi, S. (2025). BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15112. Springer, Cham. https://doi.org/10.1007/978-3-031-72949-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-72949-2_22
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72948-5
Online ISBN: 978-3-031-72949-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

PIXART- $$\Sigma $$ : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13486 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

PIXART- $$\Sigma $$ : Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 13486 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation