External Prompt Features Enhanced Parameter-Efficient Fine-Tuning for Salient Object Detection

Liang, Wen; Ran, Peipei; Bai, Mengchao; Liu, Xiao; Githinji, P. Bilha; Zhao, Wei; Qin, Peiwu

doi:10.1007/978-3-031-78347-0_6

Wen Liang¹³,
Peipei Ran¹⁴,
Mengchao Bai¹⁴,
Xiao Liu¹⁴,
P. Bilha Githinji¹³,
Wei Zhao¹⁴ &
…
Peiwu Qin¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15323))

Included in the following conference series:

International Conference on Pattern Recognition

110 Accesses

Abstract

Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks. Transformer-based methods achieve promising performance due to their global semantic understanding, crucial for identifying salient objects. However, these models tend to be large and require numerous training parameters. To better harness the potential of transformers for SOD, we propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters while enhancing the salient object detection capability. Our model, termed EXternal Prompt features Enhanced adapteR Tuning (ExPert), features an encoder-decoder structure with adapters and injectors interspersed between the layers of a frozen transformer encoder. The adapter modules adapt the pre-trained backbone to SOD while the injector modules incorporate external prompt features to enhance the awareness of salient objects. Comprehensive experiments demonstrate the superiority of our method. Surpassing former state-of-the-art (SOTA) models across five SOD datasets, ExPert achieves 0.215 mean absolute error (MAE) in the ECSSD dataset with 80.2 M trained parameters, 21% better than SelfReformer [31] and 47% better than EGNet [33].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

EMSNet: Extremely multi-scale network for salient object detection

Article 11 June 2024

Dual-path Processing Network for High-resolution Salient Object Detection

Article 31 January 2022

Dense Attention-Guided Network for Boundary-Aware Salient Object Detection

Notes

1.
In [4], the adapter is a side connection of the feed-forward function inside the transformer block which is denoted as “FF-level”. In [17] and our ExPert, the adapter is a side connection between transformer blocks and is denoted as “block-level”.
2.
More details can be found in the supplementary materials.
3.
The $P \cdot F$ in Eq. 3 represents the linear projection to features F with parameters P.
4.
CLIP [20] is a well-known VLM model trained with millions of image-text pairs. However the text of CLIP is a simple sentence with the class name which is not the caption of the whole image. The resolution of CLIP’s training images is 224*224, the feature is 7*7 with the patch size of 32 which is too small to upsample. Therefore ExPert does not consider CLIP’s features.
5.
More details can be found in the supplementary file.
6.
We use the public codes of SOD_Evaluation_Metrics to compute the metrics.
7.
Considering that EVP’s official mask is 352*352 which is not the original size, we resize the prediction map of EVP to the size of ground truth and then compute the metrics.
8.
Due to the absence of codes or salient maps of some models, we directly use the metrics results in the published paper. The results of M3Net [30] are calculated using the official salient maps of the M3Net SwinB version.
9.
The full fine-tuning method trains all the parameters of the backbone and decoder using the new datasets.
10.
The head tuning method trains only the decoder while keeping the backbone frozen.

References

Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)
Google Scholar
Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Google Scholar
Chen, S., et al.: Adaptformer: adapting vision transformers for scalable visual recognition. In: Advances in Neural Information Processing Systems, vol. 35, pp. 16664–16678 (2022)
Google Scholar
Chen, Z., et al.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)
Dosovitskiy, A., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)
Google Scholar
Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 (2018)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)
Google Scholar
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463 (2015)
Google Scholar
Li, J., Qiao, S., Zhao, Z., Xie, C., Chen, X., Xia, C.: Rethinking lightweight salient object detection via network depth-width tradeoff. IEEE Trans. Image Process. 32, 5664–5677 (2023)
Article Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287 (2014)
Google Scholar
Liu, N., Zhang, N., Wan, K., Shao, L., Han, J.: Visual saliency transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4722–4732 (2021)
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Article Google Scholar
Liu, W., Shen, X., Pun, C.M., Cun, X.: Explicit visual prompting for universal foreground segmentations. arXiv preprint arXiv:2305.18476 (2023)
Ma, M., Xia, C., Xie, C., Chen, X., Li, J.: Boosting broader receptive fields for salient object detection. IEEE Trans. Image Process. 32, 1026–1038 (2023)
Article Google Scholar
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-Net: going deeper with nested U-structure for salient object detection. Pattern Recogn. 106, 107404 (2020)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Rebuffi, S.A., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Ren, S., Wen, Q., Zhao, N., Han, G., He, S.: Unifying global-local representations in salient object detection with transformer. arXiv preprint arXiv:2108.02759 (2021)
Song, X., Guo, F., Zhang, L., Lu, X., Hei, X.: Salient object detection with dual-branch stepwise feature fusion and edge refinement. IEEE Trans. Circuits Syst. Video Technol. 34, 2832–2844 (2023)
Article Google Scholar
Wang, L., et al.: Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 136–145 (2017)
Google Scholar
Xia, C., Sun, Y., Fang, X., Ge, B., Gao, X., Li, K.C.: IMSFNet: integrated multi-source feature network for salient object detection. Appl. Intell. 53(19), 22228–22248 (2023)
Article Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090 (2021)
Google Scholar
Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155–1162 (2013)
Google Scholar
Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3166–3173 (2013)
Google Scholar
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)
Google Scholar
Yuan, Y., Gao, P., Tan, X.: M3Net: multilevel, mixed and multistage attention network for salient object detection. arXiv preprint arXiv:2309.08365 (2023)
Yun, Y.K., Lin, W.: SelfReformer: self-refined network with transformer for salient object detection. arXiv preprint arXiv:2205.11283 (2022)
Zhang, Q., Zhao, R., Zhang, L.: TCRNet: a trifurcated cascaded refinement network for salient object detection. IEEE Trans. Circuits Syst. Video Technol. 33(1), 298–311 (2022)
Article Google Scholar
Zhao, J.X., Liu, J.J., Fan, D.P., Cao, Y., Yang, J., Cheng, M.M.: EGNet: edge guidance network for salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8779–8788 (2019)
Google Scholar
Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3085–3094 (2019)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
Wen Liang, P. Bilha Githinji & Peiwu Qin
Central Media Technology Institute, Huawei, Shenzhen, China
Peipei Ran, Mengchao Bai, Xiao Liu & Wei Zhao

Authors

Wen Liang
View author publications
You can also search for this author in PubMed Google Scholar
Peipei Ran
View author publications
You can also search for this author in PubMed Google Scholar
Mengchao Bai
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Liu
View author publications
You can also search for this author in PubMed Google Scholar
P. Bilha Githinji
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Peiwu Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peiwu Qin .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 811 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liang, W. et al. (2025). External Prompt Features Enhanced Parameter-Efficient Fine-Tuning for Salient Object Detection. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15323. Springer, Cham. https://doi.org/10.1007/978-3-031-78347-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-78347-0_6
Published: 02 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78346-3
Online ISBN: 978-3-031-78347-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

External Prompt Features Enhanced Parameter-Efficient Fine-Tuning for Salient Object Detection