Skip to main content

Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediction Tasks

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15100))

Included in the following conference series:

  • 395 Accesses

Abstract

In this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as semantic segmentation and depth estimation. We focus on fine-tuning the Stable Diffusion model, which has demonstrated impressive abilities in modeling image details and high-level semantics. Through our experiments, we have three key insights. Firstly, we demonstrate that for dense prediction tasks, the denoiser of Stable Diffusion can serve as a stronger feature encoder compared to visual-language models pre-trained with contrastive training (e.g., CLIP). Secondly, we show that the quality of extracted features is influenced by the diffusion sampling step t, sampling layer, cross-attention map, model generation capacity, and textual input. Features from Stable Diffusion UNet’s upsampling layers and earlier denoising steps lead to more discriminative features for transfer learning to downstream tasks. Thirdly, we find that tuning Stable Diffusion to downstream tasks in a parameter-efficient way is feasible. We first extensively investigate currently popular parameter-efficient tuning methods. Then we search for the best protocol for effective tuning via reinforcement learning and achieve better tuning results with fewer tunable parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abstreiter, K., Mittal, S., Bauer, S., Schölkopf, B., Mehrjou, A.: Diffusion-based representation learning. arXiv preprint arXiv:2105.14257 (2021)

  2. Anciukevičius, T., et al.: Renderdiffusion: image diffusion for 3d reconstruction, inpainting and generation. arXiv preprint arXiv:2211.09869 (2022)

  3. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016)

  4. Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126 (2021)

  5. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2) (2012)

    Google Scholar 

  6. Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

  7. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  8. Chen, Z., et al.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)

    MathSciNet  Google Scholar 

  12. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  13. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  14. Ho, J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)

  15. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)

    Google Scholar 

  16. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)

    Google Scholar 

  17. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  18. Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364 (2022)

  19. Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2426–2435 (2022)

    Google Scholar 

  20. Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2019)

    Google Scholar 

  21. Kondapaneni, N., Marks, M., Knott, M., Guimaraes, R., Perona, P.: Text-image alignment for diffusion-based perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13883–13893 (2024)

    Google Scholar 

  22. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023)

    Google Scholar 

  23. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  24. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

  25. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)

    Google Scholar 

  26. Liu, X., et al.: P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021)

  27. Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: CVPR, pp. 12009–12019 (2022)

    Google Scholar 

  28. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  29. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)

    Google Scholar 

  30. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  31. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  32. Luo, G., Dunlap, L., Park, D.H., Holynski, A., Darrell, T.: Diffusion hyperfeatures: searching through time and space for semantic correspondence. In:Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  33. Mou, C., et al.: T2i-adapter: learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)

  34. Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)

    Google Scholar 

  35. Ning, J., et al.: All in tokens: unifying output space of visual tasks via soft token. arXiv preprint arXiv:2301.02229 (2023)

  36. Patil, V., Sakaridis, C., Liniger, A., Van Gool, L.: P3depth: monocular depth estimation with a piecewise planarity prior. In: CVPR, pp. 1610–1621 (2022)

    Google Scholar 

  37. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  38. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  39. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  40. Rao, Y., et al.: Denseclip: language-guided dense prediction with context-aware prompting. In: CVPR (2022)

    Google Scholar 

  41. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)

    Google Scholar 

  42. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2022)

  43. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)

  44. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022)

  45. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  46. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

    Google Scholar 

  47. Shuai, Z., Chen, Y., Mao, S., Zho, Y., Zhang, X.: Diffseg: a segmentation model for skin lesions based on diffusion difference. arXiv preprint arXiv:2404.16474 (2024)

  48. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54

    Chapter  Google Scholar 

  49. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv:2010.02502 (2020). https://arxiv.org/abs/2010.02502

  50. Tan, M., et al.: Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)

    Google Scholar 

  51. Wang, W., et al.: Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442 (2022)

  52. Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628 (2022)

  53. Wu, W., Zhao, Y., Shou, M.Z., Zhou, H., Shen, C.: Diffumask: synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1206–1217 (2023)

    Google Scholar 

  54. Yang, X., Wang, X.: Diffusion model as representation learner. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18938–18949 (2023)

    Google Scholar 

  55. Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: New CRFs: neural window fully-connected CRFs for monocular depth estimation. arXiv preprint arXiv:2203.01502 (2022)

  56. Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)

  57. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)

    Google Scholar 

  58. Zhao, W., Rao, Y., Liu, Z., Liu, B., Zhou, J., Lu, J.: Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153 (2023)

  59. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR (2017)

    Google Scholar 

  60. Zhou, Z., Tulsiani, S.: Sparsefusion: distilling view-conditioned diffusion for 3d reconstruction. arXiv preprint arXiv:2212.00792 (2022)

  61. Zimmermann, R.S., Schott, L., Song, Y., Dunn, B.A., Klindt, D.A.: Score-based generative classifiers. arXiv preprint arXiv:2110.00473 (2021)

  62. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)

Download references

Acknowlegement

This project is funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manyuan Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, M., Song, G., Shi, X., Liu, Y., Li, H. (2025). Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediction Tasks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15100. Springer, Cham. https://doi.org/10.1007/978-3-031-72946-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72946-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72945-4

  • Online ISBN: 978-3-031-72946-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics