Dual-Branch Task Residual Enhancement with Parameter-Free Attention for Zero-Shot Multi-label Image Recognition

Zhang, Shizhou; Dang, Kairui; Cheng, De; Xing, Yinghui; Wu, Qirui; Kong, Dexuan; Zhang, Yanning

doi:10.1007/978-3-031-78312-8_11

Shizhou Zhang¹³,
Kairui Dang¹³,
De Cheng¹⁴,
Yinghui Xing¹³,
Qirui Wu¹³,
Dexuan Kong¹³ &
…
Yanning Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15322))

Included in the following conference series:

International Conference on Pattern Recognition

221 Accesses

Abstract

Zero-shot multi-label image recognition involves the task of recognizing multi-label images while “zero” visual information has been input into the model during training. Recently, with the emergence of large pre-trained vision-language model, the visual and semantic features can be well aligned after being trained with billions of image-text pairs collected from the internet. In this paper, by utilizing the pre-trained CLIP model, we propose a dual-branch task residual enhancement with parameter-free attention module that enhances interaction of inter-modal information to tackle the problem of multi-label image recognition. The method employs a dual-branch structure, including global and local branches. The local branch mitigates global feature dominance, improving image content understanding ability of local regions. Our method shows superiority in zero-shot multi-label learning on VOC2007, MS-COCO, and NUS-WIDE datasets, surpassing the state-of-the-art methods. Additionally, it also has excellent performance in partial label settings. Code is available in the supplementary materials.

S. Zhang and K. Dang—The first two authors equally contributed to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Online Zero-Shot Classification with CLIP

Improving Few-Shot Image Classification with Self-supervised Learning

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

References

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., Learning transferable visual models from natural language supervision, in ICML, pp. 8748–8763, 2021
Google Scholar
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in ICML, pp. 4904–4916, 2021
Google Scholar
Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo, Texts as images in prompt tuning for multi-label image recognition, in CVPR, pp. 2808–2817, 2023
Google Scholar
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Learning to prompt for vision-language models, IJCV, vol. 130, no. 9, pp. 2337–2348, 2022, Springer
Google Scholar
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Conditional prompt learning for vision-language models, in CVPR, pp. 16816–16825, 2022
Google Scholar
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim, Visual prompt tuning, in ECCV, pp. 709–727, 2022
Google Scholar
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun Zhang, Prompt learning with optimal transport for vision-language models, arXiv preprint arXiv:2210.01253, 2022
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao, Clip-adapter: Better vision-language models with feature adapters, IJCV, pp. 1–15, 2023
Google Scholar
Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li, Tip-adapter: Training-free adaption of clip for few-shot classification, in ECCV, pp. 493–510, 2022
Google Scholar
Yi-Lin Sung, Jaemin Cho, Mohit Bansal, Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks, in CVPR, pp. 5227–5237, 2022
Google Scholar
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao, Vision transformer adapter for dense predictions, arXiv preprint arXiv:2205.08534, 2022
Haoyu Lu, Mingyu Ding, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Masayoshi Tomizuka, Wei Zhan, UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling, arXiv preprint arXiv:2302.06605, 2023
Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, Xinchao Wang, Task residual for tuning vision-language models, in CVPR, pp. 10899–10909, 2023
Google Scholar
Zixuan Ding, Ao Wang, Hui Chen, Qiang Zhang, Pengzhang Liu, Yongjun Bao, Weipeng Yan, Jungong Han, Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels, in CVPR, pp. 3398–3407, 2023
Google Scholar
Xing Cheng, Hezheng Lin, Xiangyu Wu, Dong Shen, Fan Yang, Honglin Liu, Nian Shi, Mltr: Multi-label classification with transformer, in ICME, pp. 1–6, 2022
Google Scholar
Wei Zhou, Peng Dou, Tao Su, Haifeng Hu, Zhijie Zheng, Feature learning network with transformer for multi-label image classification, PR, vol. 136, pp. 109203, 2023, Elsevier
Google Scholar
Sun, X., Ping, H., Saenko, K.: Dualcoop: Fast adaptation to multi-label recognition with limited annotations. NeurIPS 35, 30569–30582 (2022)
Google Scholar
Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, Sergey Ioffe, Deep convolutional ranking for multilabel image annotation, arXiv preprint arXiv:1312.4894, 2013
Chen, T., Tao, P., Hefeng, W., Xie, Y., Lin, L.: Structured semantic transfer for multi-label recognition with partial labels. AAAI 36(1), 339–346 (2022)
Article Google Scholar
Tao, P., Chen, T., Hefeng, W., Lin, L.: Semantic-aware representation blending for multi-label image recognition with partial labels. AAAI 36(2), 2091–2098 (2022)
Article Google Scholar
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, Andrew Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision, vol. 88, pp. 303–338, 2010, Springer
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, Yantao Zheng, Nus-wide: a real-world web image database from national university of singapore, in Proceedings of the ACM international conference on image and video retrieval, pp. 1–9, 2009
Google Scholar
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo, Multi-label image recognition with graph convolutional networks, in CVPR, pp. 5177–5186, 2019
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NFSC) under Grant 62101453 and 62201467; in part by the Project funded by China Postdoctoral Science Foundation under Grant 2022TQ0260 and Grant 2023M742842, in part by the Young Talent Fund of Xi’an Association for Science and Technology under Grant 959202313088, in part by Innovation Capability Support Program of Shaanxi (Program No. 2024ZC-KJXX-043), in part by the Fundamental Research Funds for the Central Universities No. HYGJZN202331 and in part by the Natural Science Basic Research Program of Shaanxi Province (No. 2022JC-DW-08).

Author information

Authors and Affiliations

Northwestern Polytechnical University, Xian, China
Shizhou Zhang, Kairui Dang, Yinghui Xing, Qirui Wu, Dexuan Kong & Yanning Zhang
Xidian University, Xian, China
De Cheng

Authors

Shizhou Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kairui Dang
View author publications
You can also search for this author in PubMed Google Scholar
De Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yinghui Xing
View author publications
You can also search for this author in PubMed Google Scholar
Qirui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Dexuan Kong
View author publications
You can also search for this author in PubMed Google Scholar
Yanning Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yinghui Xing .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 3326 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, S. et al. (2025). Dual-Branch Task Residual Enhancement with Parameter-Free Attention for Zero-Shot Multi-label Image Recognition. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15322. Springer, Cham. https://doi.org/10.1007/978-3-031-78312-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-78312-8_11
Published: 04 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78311-1
Online ISBN: 978-3-031-78312-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)