Skip to main content

Dual-Branch Task Residual Enhancement with Parameter-Free Attention for Zero-Shot Multi-label Image Recognition

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15322))

Included in the following conference series:

  • 221 Accesses

Abstract

Zero-shot multi-label image recognition involves the task of recognizing multi-label images while “zero” visual information has been input into the model during training. Recently, with the emergence of large pre-trained vision-language model, the visual and semantic features can be well aligned after being trained with billions of image-text pairs collected from the internet. In this paper, by utilizing the pre-trained CLIP model, we propose a dual-branch task residual enhancement with parameter-free attention module that enhances interaction of inter-modal information to tackle the problem of multi-label image recognition. The method employs a dual-branch structure, including global and local branches. The local branch mitigates global feature dominance, improving image content understanding ability of local regions. Our method shows superiority in zero-shot multi-label learning on VOC2007, MS-COCO, and NUS-WIDE datasets, surpassing the state-of-the-art methods. Additionally, it also has excellent performance in partial label settings. Code is available in the supplementary materials.

S. Zhang and K. Dang—The first two authors equally contributed to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., Learning transferable visual models from natural language supervision, in ICML, pp. 8748–8763, 2021

    Google Scholar 

  2. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig, Scaling up visual and vision-language representation learning with noisy text supervision, in ICML, pp. 4904–4916, 2021

    Google Scholar 

  3. Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo, Texts as images in prompt tuning for multi-label image recognition, in CVPR, pp. 2808–2817, 2023

    Google Scholar 

  4. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Learning to prompt for vision-language models, IJCV, vol. 130, no. 9, pp. 2337–2348, 2022, Springer

    Google Scholar 

  5. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu, Conditional prompt learning for vision-language models, in CVPR, pp. 16816–16825, 2022

    Google Scholar 

  6. Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim, Visual prompt tuning, in ECCV, pp. 709–727, 2022

    Google Scholar 

  7. Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, Kun Zhang, Prompt learning with optimal transport for vision-language models, arXiv preprint arXiv:2210.01253, 2022

  8. Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao, Clip-adapter: Better vision-language models with feature adapters, IJCV, pp. 1–15, 2023

    Google Scholar 

  9. Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li, Tip-adapter: Training-free adaption of clip for few-shot classification, in ECCV, pp. 493–510, 2022

    Google Scholar 

  10. Yi-Lin Sung, Jaemin Cho, Mohit Bansal, Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks, in CVPR, pp. 5227–5237, 2022

    Google Scholar 

  11. Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao, Vision transformer adapter for dense predictions, arXiv preprint arXiv:2205.08534, 2022

  12. Haoyu Lu, Mingyu Ding, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Masayoshi Tomizuka, Wei Zhan, UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling, arXiv preprint arXiv:2302.06605, 2023

  13. Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, Xinchao Wang, Task residual for tuning vision-language models, in CVPR, pp. 10899–10909, 2023

    Google Scholar 

  14. Zixuan Ding, Ao Wang, Hui Chen, Qiang Zhang, Pengzhang Liu, Yongjun Bao, Weipeng Yan, Jungong Han, Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels, in CVPR, pp. 3398–3407, 2023

    Google Scholar 

  15. Xing Cheng, Hezheng Lin, Xiangyu Wu, Dong Shen, Fan Yang, Honglin Liu, Nian Shi, Mltr: Multi-label classification with transformer, in ICME, pp. 1–6, 2022

    Google Scholar 

  16. Wei Zhou, Peng Dou, Tao Su, Haifeng Hu, Zhijie Zheng, Feature learning network with transformer for multi-label image classification, PR, vol. 136, pp. 109203, 2023, Elsevier

    Google Scholar 

  17. Sun, X., Ping, H., Saenko, K.: Dualcoop: Fast adaptation to multi-label recognition with limited annotations. NeurIPS 35, 30569–30582 (2022)

    Google Scholar 

  18. Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, Sergey Ioffe, Deep convolutional ranking for multilabel image annotation, arXiv preprint arXiv:1312.4894, 2013

  19. Chen, T., Tao, P., Hefeng, W., Xie, Y., Lin, L.: Structured semantic transfer for multi-label recognition with partial labels. AAAI 36(1), 339–346 (2022)

    Article  Google Scholar 

  20. Tao, P., Chen, T., Hefeng, W., Lin, L.: Semantic-aware representation blending for multi-label image recognition with partial labels. AAAI 36(2), 2091–2098 (2022)

    Article  Google Scholar 

  21. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, Andrew Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision, vol. 88, pp. 303–338, 2010, Springer

    Google Scholar 

  22. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  23. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, Yantao Zheng, Nus-wide: a real-world web image database from national university of singapore, in Proceedings of the ACM international conference on image and video retrieval, pp. 1–9, 2009

    Google Scholar 

  24. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, Yanwen Guo, Multi-label image recognition with graph convolutional networks, in CVPR, pp. 5177–5186, 2019

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NFSC) under Grant 62101453 and 62201467; in part by the Project funded by China Postdoctoral Science Foundation under Grant 2022TQ0260 and Grant 2023M742842, in part by the Young Talent Fund of Xi’an Association for Science and Technology under Grant 959202313088, in part by Innovation Capability Support Program of Shaanxi (Program No. 2024ZC-KJXX-043), in part by the Fundamental Research Funds for the Central Universities No. HYGJZN202331 and in part by the Natural Science Basic Research Program of Shaanxi Province (No. 2022JC-DW-08).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yinghui Xing .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 3326 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, S. et al. (2025). Dual-Branch Task Residual Enhancement with Parameter-Free Attention for Zero-Shot Multi-label Image Recognition. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15322. Springer, Cham. https://doi.org/10.1007/978-3-031-78312-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78312-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78311-1

  • Online ISBN: 978-3-031-78312-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics