Abstract
Zero-shot multi-label image recognition involves the task of recognizing multi-label images while “zero” visual information has been input into the model during training. Recently, with the emergence of large pre-trained vision-language model, the visual and semantic features can be well aligned after being trained with billions of image-text pairs collected from the internet. In this paper, by utilizing the pre-trained CLIP model, we propose a dual-branch task residual enhancement with parameter-free attention module that enhances interaction of inter-modal information to tackle the problem of multi-label image recognition. The method employs a dual-branch structure, including global and local branches. The local branch mitigates global feature dominance, improving image content understanding ability of local regions. Our method shows superiority in zero-shot multi-label learning on VOC2007, MS-COCO, and NUS-WIDE datasets, surpassing the state-of-the-art methods. Additionally, it also has excellent performance in partial label settings. Code is available in the supplementary materials.
S. Zhang and K. Dang—The first two authors equally contributed to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., Learning transferable visual models from natural language supervision, in ICML, pp. 8748–8763, 2021
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in ICML, pp. 4904–4916, 2021
Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo, Texts as images in prompt tuning for multi-label image recognition, in CVPR, pp. 2808–2817, 2023
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Learning to prompt for vision-language models, IJCV, vol. 130, no. 9, pp. 2337–2348, 2022, Springer
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Conditional prompt learning for vision-language models, in CVPR, pp. 16816–16825, 2022
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim, Visual prompt tuning, in ECCV, pp. 709–727, 2022
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun Zhang, Prompt learning with optimal transport for vision-language models, arXiv preprint arXiv:2210.01253, 2022
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao, Clip-adapter: Better vision-language models with feature adapters, IJCV, pp. 1–15, 2023
Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li, Tip-adapter: Training-free adaption of clip for few-shot classification, in ECCV, pp. 493–510, 2022
Yi-Lin Sung, Jaemin Cho, Mohit Bansal, Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks, in CVPR, pp. 5227–5237, 2022
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao, Vision transformer adapter for dense predictions, arXiv preprint arXiv:2205.08534, 2022
Haoyu Lu, Mingyu Ding, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Masayoshi Tomizuka, Wei Zhan, UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling, arXiv preprint arXiv:2302.06605, 2023
Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, Xinchao Wang, Task residual for tuning vision-language models, in CVPR, pp. 10899–10909, 2023
Zixuan Ding, Ao Wang, Hui Chen, Qiang Zhang, Pengzhang Liu, Yongjun Bao, Weipeng Yan, Jungong Han, Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels, in CVPR, pp. 3398–3407, 2023
Xing Cheng, Hezheng Lin, Xiangyu Wu, Dong Shen, Fan Yang, Honglin Liu, Nian Shi, Mltr: Multi-label classification with transformer, in ICME, pp. 1–6, 2022
Wei Zhou, Peng Dou, Tao Su, Haifeng Hu, Zhijie Zheng, Feature learning network with transformer for multi-label image classification, PR, vol. 136, pp. 109203, 2023, Elsevier
Sun, X., Ping, H., Saenko, K.: Dualcoop: Fast adaptation to multi-label recognition with limited annotations. NeurIPS 35, 30569–30582 (2022)
Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, Sergey Ioffe, Deep convolutional ranking for multilabel image annotation, arXiv preprint arXiv:1312.4894, 2013
Chen, T., Tao, P., Hefeng, W., Xie, Y., Lin, L.: Structured semantic transfer for multi-label recognition with partial labels. AAAI 36(1), 339–346 (2022)
Tao, P., Chen, T., Hefeng, W., Lin, L.: Semantic-aware representation blending for multi-label image recognition with partial labels. AAAI 36(2), 2091–2098 (2022)
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, Andrew Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision, vol. 88, pp. 303–338, 2010, Springer
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, Yantao Zheng, Nus-wide: a real-world web image database from national university of singapore, in Proceedings of the ACM international conference on image and video retrieval, pp. 1–9, 2009
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo, Multi-label image recognition with graph convolutional networks, in CVPR, pp. 5177–5186, 2019
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (NFSC) under Grant 62101453 and 62201467; in part by the Project funded by China Postdoctoral Science Foundation under Grant 2022TQ0260 and Grant 2023M742842, in part by the Young Talent Fund of Xi’an Association for Science and Technology under Grant 959202313088, in part by Innovation Capability Support Program of Shaanxi (Program No. 2024ZC-KJXX-043), in part by the Fundamental Research Funds for the Central Universities No. HYGJZN202331 and in part by the Natural Science Basic Research Program of Shaanxi Province (No. 2022JC-DW-08).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, S. et al. (2025). Dual-Branch Task Residual Enhancement with Parameter-Free Attention for Zero-Shot Multi-label Image Recognition. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15322. Springer, Cham. https://doi.org/10.1007/978-3-031-78312-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-78312-8_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78311-1
Online ISBN: 978-3-031-78312-8
eBook Packages: Computer ScienceComputer Science (R0)