Skip to main content

External Prompt Features Enhanced Parameter-Efficient Fine-Tuning for Salient Object Detection

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15323))

Included in the following conference series:

  • 110 Accesses

Abstract

Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks. Transformer-based methods achieve promising performance due to their global semantic understanding, crucial for identifying salient objects. However, these models tend to be large and require numerous training parameters. To better harness the potential of transformers for SOD, we propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters while enhancing the salient object detection capability. Our model, termed EXternal Prompt features Enhanced adapteR Tuning (ExPert), features an encoder-decoder structure with adapters and injectors interspersed between the layers of a frozen transformer encoder. The adapter modules adapt the pre-trained backbone to SOD while the injector modules incorporate external prompt features to enhance the awareness of salient objects. Comprehensive experiments demonstrate the superiority of our method. Surpassing former state-of-the-art (SOTA) models across five SOD datasets, ExPert achieves 0.215 mean absolute error (MAE) in the ECSSD dataset with 80.2 M trained parameters, 21% better than SelfReformer [31] and 47% better than EGNet [33].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In [4], the adapter is a side connection of the feed-forward function inside the transformer block which is denoted as “FF-level”. In [17] and our ExPert, the adapter is a side connection between transformer blocks and is denoted as “block-level”.

  2. 2.

    More details can be found in the supplementary materials.

  3. 3.

    The \(P \cdot F\) in Eq. 3 represents the linear projection to features F with parameters P.

  4. 4.

    CLIP [20] is a well-known VLM model trained with millions of image-text pairs. However the text of CLIP is a simple sentence with the class name which is not the caption of the whole image. The resolution of CLIP’s training images is 224*224, the feature is 7*7 with the patch size of 32 which is too small to upsample. Therefore ExPert does not consider CLIP’s features.

  5. 5.

    More details can be found in the supplementary file.

  6. 6.

    We use the public codes of SOD_Evaluation_Metrics to compute the metrics.

  7. 7.

    Considering that EVP’s official mask is 352*352 which is not the original size, we resize the prediction map of EVP to the size of ground truth and then compute the metrics.

  8. 8.

    Due to the absence of codes or salient maps of some models, we directly use the metrics results in the published paper. The results of M3Net [30] are calculated using the official salient maps of the M3Net SwinB version.

  9. 9.

    The full fine-tuning method trains all the parameters of the backbone and decoder using the new datasets.

  10. 10.

    The head tuning method trains only the decoder while keeping the backbone frozen.

References

  1. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)

    Google Scholar 

  2. Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022)

  3. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)

    Google Scholar 

  4. Chen, S., et al.: Adaptformer: adapting vision transformers for scalable visual recognition. In: Advances in Neural Information Processing Systems, vol. 35, pp. 16664–16678 (2022)

    Google Scholar 

  5. Chen, Z., et al.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)

  6. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  7. Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)

    Google Scholar 

  8. Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 (2018)

  9. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)

    Google Scholar 

  10. Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41

  11. Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463 (2015)

    Google Scholar 

  12. Li, J., Qiao, S., Zhao, Z., Xie, C., Chen, X., Xia, C.: Rethinking lightweight salient object detection via network depth-width tradeoff. IEEE Trans. Image Process. 32, 5664–5677 (2023)

    Article  Google Scholar 

  13. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  14. Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287 (2014)

    Google Scholar 

  15. Liu, N., Zhang, N., Wan, K., Shao, L., Han, J.: Visual saliency transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4722–4732 (2021)

    Google Scholar 

  16. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)

    Article  Google Scholar 

  17. Liu, W., Shen, X., Pun, C.M., Cun, X.: Explicit visual prompting for universal foreground segmentations. arXiv preprint arXiv:2305.18476 (2023)

  18. Ma, M., Xia, C., Xie, C., Chen, X., Li, J.: Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process. 32, 1026–1038 (2023)

    Article  Google Scholar 

  19. Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-Net: going deeper with nested U-structure for salient object detection. Pattern Recogn. 106, 107404 (2020)

    Article  Google Scholar 

  20. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  21. Rebuffi, S.A., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  22. Ren, S., Wen, Q., Zhao, N., Han, G., He, S.: Unifying global-local representations in salient object detection with transformer. arXiv preprint arXiv:2108.02759 (2021)

  23. Song, X., Guo, F., Zhang, L., Lu, X., Hei, X.: Salient object detection with dual-branch stepwise feature fusion and edge refinement. IEEE Trans. Circuits Syst. Video Technol. 34, 2832–2844 (2023)

    Article  Google Scholar 

  24. Wang, L., et al.: Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145 (2017)

    Google Scholar 

  25. Xia, C., Sun, Y., Fang, X., Ge, B., Gao, X., Li, K.C.: IMSFNet: integrated multi-source feature network for salient object detection. Appl. Intell. 53(19), 22228–22248 (2023)

    Article  Google Scholar 

  26. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090 (2021)

    Google Scholar 

  27. Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155–1162 (2013)

    Google Scholar 

  28. Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3166–3173 (2013)

    Google Scholar 

  29. Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)

    Google Scholar 

  30. Yuan, Y., Gao, P., Tan, X.: M3Net: multilevel, mixed and multistage attention network for salient object detection. arXiv preprint arXiv:2309.08365 (2023)

  31. Yun, Y.K., Lin, W.: SelfReformer: self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283 (2022)

  32. Zhang, Q., Zhao, R., Zhang, L.: TCRNet: a trifurcated cascaded refinement network for salient object detection. IEEE Trans. Circuits Syst. Video Technol. 33(1), 298–311 (2022)

    Article  Google Scholar 

  33. Zhao, J.X., Liu, J.J., Fan, D.P., Cao, Y., Yang, J., Cheng, M.M.: EGNet: edge guidance network for salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8779–8788 (2019)

    Google Scholar 

  34. Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3085–3094 (2019)

    Google Scholar 

  35. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peiwu Qin .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 811 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liang, W. et al. (2025). External Prompt Features Enhanced Parameter-Efficient Fine-Tuning for Salient Object Detection. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15323. Springer, Cham. https://doi.org/10.1007/978-3-031-78347-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78347-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78346-3

  • Online ISBN: 978-3-031-78347-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics