Abstract
Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high computing demands due to billion-scale parameters. To enhance efficiency, recent studies have reduced sampling steps and applied network quantization while retaining the original architectures. The lack of architectural reduction attempts may stem from worries over expensive retraining for such massive models. In this work, we uncover the surprising potential of block pruning and feature distillation for low-cost general-purpose T2I. By removing several residual and attention blocks from the U-Net of SDMs, we achieve 30%\(\sim \)50% reduction in model size, MACs, and latency. We show that distillation retraining is effective even under limited resources: using only 13 A100 days and a tiny dataset, our compact models can imitate the original SDMs (v1.4 and v2.1-base with over 6,000 A100 days). Benefiting from the transferred knowledge, our BK-SDMs deliver competitive results on zero-shot MS-COCO against larger multi-billion parameter models. We further demonstrate the applicability of our lightweight backbones in personalized generation and image-to-image translation. Deployment of our models on edge devices attains 4-second inference. Code and models can be found at: https://github.com/Nota-NetsPresso/BK-SDM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Let the \((h_{\text {out}}, h_{\text {in}})\)-size weight matrix in the b-th block be \(\textbf{W}^{l,b} = \left[ W_{i,j}^{l,b}\right] \), where l denotes the layer type (e.g., convolution (flattened) or attention’s key projection). The scores at the output neuron level [1] are aggregated for the block-level importance criteria, \(S_{\text {Magnitude}}^b = \mathbb {E}_{l,i}\left[ \sum _j \left| W_{i,j}^{l,b} \right| \right] \) and \(S_{\text {Taylor}}^b = \mathbb {E}_{l,i}\left[ \sum _j \left| \frac{\partial \mathcal {L}(D)}{\partial W_{i,j}^{l,b}} W_{i,j}^{l,b} \right| \right] \). Here, \(\mathcal {L}\) and D denote the denoising task loss and a calibration set of 1K samples. The final scores are then ranked to remove unimportant blocks, or to replace them with interpolation for unremovable blocks.
References
A Simple and Effective Pruning Approach for LLMs. ICLR (2024)
Blattmann, A., et al.: Align your latents: high-resolution video synthesis with latent diffusion models. In: CVPR (2023)
Brooks, T., Holynski, A., Efros, A.A.: InstructPix2Pix: learning to follow image editing instructions. In: CVPR (2023)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Castells, T., et al.: EdgeFusion: on-device text-to-image generation. In: CVPR Workshop (2024)
Chen, Y.H., et al.: Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations. In: CVPR Workshop (2023)
Choi, J., et al.: Squeezing large-scale diffusion models for mobile. In: ICML Workshop (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
Ding, M., et al.: CogView: mastering text-to-image generation via transformers. In: NeurIPS (2021)
Ding, M., Zheng, W., Hong, W., Tang, J.: CogView2: faster and better text-to-image generation via hierarchical transformers. In: NeurIPS (2022)
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Fang, G., Ma, X., Song, M., Mi, M.B., Wang, X.: DepGraph: towards any structural pruning. In: CVPR (2023)
Fang, G., Ma, X., Wang, X.: Structural pruning for diffusion models. In: NeurIPS (2023)
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: scene-based text-to-image generation with human priors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. Lecture Notes in Computer Science, vol. 13675, pp. 89–106. Springer, Cham (2022)
Hao, Z., et al.: Learning efficient vision transformers via fine-grained manifold distillation. In: NeurIPS (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heo, B., Kim, J., Yun, S., Park, H., Kwak, N., Choi, J.Y.: A comprehensive overhaul of feature distillation. In: ICCV (2019)
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NeurIPS Workshop (2014)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS Workshop (2021)
Hou, J., Asghar, Z.: World’s first on-device demonstration of stable diffusion on an android phone (2023). https://www.qualcomm.com/news
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Findings of EMNLP (2020)
Kang, M., et al.: Scaling up GANs for text-to-image synthesis. In: CVPR (2023)
Kim, B.K., Choi, S., Park, H.: Cut inner layers: a structured pruning strategy for efficient U-net GANs. In: ICML Workshop (2022)
Kim, B.K., et al.: Shortened LLaMA: a simple depth pruning for large language models. arXiv preprint arXiv:2402.02834 (2024)
LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. In: NeurIPS (1989)
Lee, Y., Park, K., Cho, Y., Lee, Y.J., Hwang, S.J.: KOALA: self-attention matters in knowledge distillation of latent diffusion models for memory-efficient and fast image synthesis. arXiv preprint arXiv:2312.04005v1 (2023)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: ICLR (2017)
Li, M., Lin, J., Ding, Y., Liu, Z., Zhu, J.Y., Han, S.: GAN compression: efficient architectures for interactive conditional GANs. In: CVPR (2020)
Li, X., et al.: Q-diffusion: quantizing diffusion models. In: ICCV (2023)
Li, Y., et al.: SnapFusion: text-to-image diffusion model on mobile devices within two seconds. In: NeurIPS (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models on manifolds. In: ICLR (2022)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: NeurIPS (2022)
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: DPM-solver++: fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022)
Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models: synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S.: PEFT: state-of-the-art parameter-efficient fine-tuning methods (2022). https://github.com/huggingface/peft
Meng, C., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: NeurIPS Workshop (2022)
Meng, C., Gao, R., Kingma, D.P., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: CVPR (2023)
Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: ICLR (2022)
Mo, S., Cho, M., Shin, J.: Freeze the discriminator: a simple baseline for fine-tuning GANs. In: CVPR Workshop (2020)
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., Kautz, J.: Importance estimation for neural network pruning. In: CVPR (2019)
Murti, C., Narshana, T., Bhattacharyya, C.: TVSPrune - pruning non-discriminative filters via total variation separability of intermediate representations without fine tuning. In: ICLR (2023)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)
Pernias, P., Rampas, D., Richter, M.L., Pal, C.J., Aubreville, M.: Wüerstchen: an efficient architecture for large-scale text-to-image diffusion models. In: ICLR (2024)
Pinkney, J.: Small stable diffusion (2023). https://huggingface.co/OFA-Sys/small-stable-diffusion-v0
von Platen, P., et al.: Diffusers: state-of-the-art diffusion models (2022). https://github.com/huggingface/diffusers
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Ren, Y., Wu, J., Xiao, X., Yang, J.: Online multi-granularity distillation for GAN compression. In: ICCV (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: LDM on celeba-hq (2022). https://huggingface.co/CompVis/ldm-celebahq-256
Rombach, R., Esser, P.: Stable diffusion v1-4 (2022). https://huggingface.co/CompVis/stable-diffusion-v1-4
Rombach, R., Esser, P.: Stable diffusion v1-5 (2022). https://huggingface.co/runwayml/stable-diffusion-v1-5
Rombach, R., Esser, P., Ha, D.: Stable diffusion v2-1-base (2022). https://huggingface.co/stabilityai/stable-diffusion-2-1-base
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. In: ICLR (2015)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NeurIPS (2016)
Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: ICLR (2022)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: NeurIPS Workshop (2019)
Schuhmann, C., Beaumont, R.: LAION-aesthetics (2022). https://laion.ai/blog/laion-aesthetics
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS Workshop (2022)
Segmind: Segmind-distill-sd (2023). https://github.com/segmind/distill-sd/tree/c1e97a70d141df09e6fe5cc7dbd66e0cbeae3eeb
Segmind: SSD-1B (2023). https://github.com/segmind/SSD-1B/tree/d2ff723ea8ecf5dbd86f3aac0af1db30e88a2e2d
Shen, H., Cheng, P., Ye, X., Cheng, W., Abidi, H.: Accelerate stable diffusion with intel neural compressor (2022). https://medium.com/intel-analytics-software
Shu, C., Liu, Y., Gao, J., Yan, Z., Shen, C.: Channel-wise knowledge distillation for dense prediction. In: ICCV (2021)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: ACL (2020)
Tang, R., et al.: What the DAAM: interpreting stable diffusion using cross attention. In: ACL (2023)
Tao, M., Bao, B.K., Tang, H., Xu, C.: GALIP: generative adversarial clips for text-to-image synthesis. In: CVPR (2023)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: ICML (2021)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: NeurIPS (2017)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score Jacobian chaining: lifting pretrained 2D diffusion models for 3D generation. In: CVPR (2023)
Yu, L., Xiang, W.: X-pruner: explainable pruning for vision transformers. In: CVPR (2023)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)
Zhang, L., Chen, X., Tu, X., Wan, P., Xu, N., Ma, K.: Wavelet knowledge distillation: towards efficient image-to-image translation. In: CVPR (2022)
Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: ICCV (2023)
Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. In: ICLR (2023)
Zhao, Y., Xu, Y., Xiao, Z., Hou, T.: MobileDiffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567 (2023)
Zhou, Y., et al.: Towards language-free training for text-to-image generation. In: CVPR (2022)
Zhu, L.: Thop: Pytorch-opcounter (2018). https://github.com/Lyken17/pytorch-OpCounter
Acknowledgments
We thank the Microsoft Startups Founders Hub program and the AI Industrial Convergence Cluster Development project funded by the Ministry of Science and ICT (MSIT, Korea) and Gwangju Metropolitan City for their generous support of GPU resources.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kim, BK., Song, HK., Castells, T., Choi, S. (2025). BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15112. Springer, Cham. https://doi.org/10.1007/978-3-031-72949-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-72949-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72948-5
Online ISBN: 978-3-031-72949-2
eBook Packages: Computer ScienceComputer Science (R0)