skip to main content
10.1145/3664647.3681224acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Rethinking the Effect of Uninformative Class Name in Prompt Learning

Published: 28 October 2024 Publication History

Abstract

Large pre-trained vision-language models like CLIP have shown amazing zero-shot recognition performance. To adapt pre-trained vision-language models to downstream tasks, recent studies have focused on the learnable context + class name paradigm, which learns continuous prompt contexts on downstream datasets. In practice, the learned prompt context tends to overfit the base categories and cannot generalize well to novel categories out of the training data. Recent works have also noticed this problem and have proposed several improvements. In this work, we draw a new insight based on empirical analysis, that is, uninformative class names lead to degraded base-to-novel generalization performance in prompt learning, which is usually overlooked by existing works. Under this motivation, we advocate to improve the base-to-novel generalization performance of prompt learning by enhancing the semantic richness of class names. We coin our approach as the Information Disengagement based Associative Prompt Learning (IDAPL) mechanism which considers the associative, meanwhile, decoupled learning of prompt context and class name embedding. IDAPL can effectively alleviate the phenomenon of learnable context overfitting to base classes, meanwhile, learning more informative semantic representation of base classes by fine-tuning the class name embedding, leading to improved performance on both base and novel classes. Experimental results on eleven widely used few-shot learning benchmarks clearly validate the effectiveness of our proposed approach. Code is available at https://github.com/tiggers23/IDAPL

References

[1]
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 - Mining Discriminative Components with Random Forests. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part VI (Lecture Notes in Computer Science, Vol. 8694), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 446--461. https://doi.org/10.1007/978--3--319--10599--4_29
[2]
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. 2023. PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1--5, 2023. OpenReview.net. https://openreview.net/forum?id=zqwryBoXYnh
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML, Vol. 119. 1597--1607.
[4]
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 2014. Describing Textures in the Wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23--28, 2014. IEEE Computer Society, 3606--3613. https://doi.org/10.1109/CVPR.2014.461
[5]
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2004, Washington, DC, USA, June 27 - July 2, 2004. IEEE Computer Society, 178. https://doi.org/10.1109/CVPR.2004.383
[6]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. CoRR, Vol. abs/2110.04544 (2021).
[7]
Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, and Jiaping Zhao. 2023. Improving Zero-shot Generalization and Robustness of Multi-Modal Models. In CVPR. 11093--11101.
[8]
Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, and Wangmeng Zuo. 2023. Texts as Images in Prompt Tuning for Multi-Label Image Recognition. In CVPR. 2808--2817.
[9]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR. 9726--9735.
[10]
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2019. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., Vol. 12, 7 (2019), 2217--2226. https://doi.org/10.1109/JSTARS.2019.2918242
[11]
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. 2021. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10--17, 2021. IEEE, 8320--8329. https://doi.org/10.1109/ICCV48922.2021.00823
[12]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021. Natural Adversarial Examples. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19--25, 2021. Computer Vision Foundation / IEEE, 15262--15271. https://doi.org/10.1109/CVPR46437.2021.01501
[13]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, Vol. 139. 4904--4916.
[14]
Muhammad Uzair Khattak, Hanoona Abdul Rasheed, Muhammad Maaz, Salman H. Khan, and Fahad Shahbaz Khan. 2023. MaPLe: Multi-modal Prompt Learning. In CVPR. 19113--19122.
[15]
Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. 2023. Self-regulating Prompts: Foundational Model Adaptation without Forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 15190--15200.
[16]
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D Object Representations for Fine-Grained Categorization. In 2013 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2013, Sydney, Australia, December 1--8, 2013. IEEE Computer Society, 554--561. https://doi.org/10.1109/ICCVW.2013.77
[17]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In EMNLP. 3045--3059.
[18]
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL. 4582--4597.
[19]
Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. 2022. Prompt Distribution Learning. In CVPR. 5196--5205.
[20]
Chengcheng Ma, Yang Liu, Jiankang Deng, Lingxi Xie, Weiming Dong, and Changsheng Xu. 2023. Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models. IEEE Trans. Circuits Syst. Video Technol., Vol. 33, 9 (2023), 4616--4629.
[21]
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. CoRR, Vol. abs/1306.5151 (2013). showeprint[arXiv]1306.5151 http://arxiv.org/abs/1306.5151
[22]
Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin McGuinness, and Noel E. O'Connor. 2023. Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts. In ICCV. 262--271.
[23]
Sachit Menon and Carl Vondrick. 2023. Visual Classification via Description from Large Language Models. In ICLR.
[24]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16--19 December 2008. IEEE Computer Society, 722--729. https://doi.org/10.1109/ICVGIP.2008.47
[25]
Zachary Novack, Julian J. McAuley, Zachary Chase Lipton, and Saurabh Garg. 2023. CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets. In ICML, Vol. 202. 26342--26362.
[26]
OpenAI. 2023. GPT-4 Technical Report. CoRR, Vol. abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774 showeprint[arXiv]2303.08774
[27]
Sarah Parisot, Yongxin Yang, and Steven McDonagh. 2023. Learning to Name Classes for Vision and Language Models. In CVPR. 23477--23486.
[28]
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2012. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16--21, 2012. IEEE Computer Society, 3498--3505. https://doi.org/10.1109/CVPR.2012.6248092
[29]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML, Vol. 139. 8748--8763.
[30]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet Classifiers Generalize to ImageNet?. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5389--5400. http://proceedings.mlr.press/v97/recht19a.html
[31]
Shuvendu Roy and Ali Etemad. 2023. Consistency-guided Prompt Learning for Vision-Language Models. CoRR, Vol. abs/2306.01195 (2023). https://doi.org/10.48550/ARXIV.2306.01195 showeprint[arXiv]2306.01195
[32]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), Vol. 115, 3 (2015), 211--252. https://doi.org/10.1007/s11263-015-0816-y
[33]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR, Vol. abs/1212.0402 (2012). showeprint[arXiv]1212.0402 http://arxiv.org/abs/1212.0402
[34]
Hongbo Sun, Xiangteng He, Jiahuan Zhou, and Yuxin Peng. 2023. Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition. In Proceedings of the 31st ACM International Conference on Multimedia. 5828--5836.
[35]
Ximeng Sun, Ping Hu, and Kate Saenko. 2022. DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations. In NeurIPS.
[36]
Ao Wang, Hui Chen, Zijia Lin, Zixuan Ding, Pengzhang Liu, Yongjun Bao, Weipeng Yan, and Guiguang Ding. 2023. Hierarchical Prompt Learning Using CLIP for Multi-label Classification with Single Positive Labels. In Proceedings of the 31st ACM International Conference on Multimedia. 5594--5604.
[37]
Haohan Wang, Songwei Ge, Zachary C. Lipton, and Eric P. Xing. 2019. Learning Robust Global Representations by Penalizing Local Predictive Power. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 10506--10518. https://proceedings.neurips.cc/paper/2019/hash/3eefceb8087e964f89c2d59e8a249915-Abstract.html
[38]
Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. SUN database: Large-scale scene recognition from abbey to zoo. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13--18 June 2010. IEEE Computer Society, 3485--3492. https://doi.org/10.1109/CVPR.2010.5539970
[39]
Hantao Yao, Rui Zhang, and Changsheng Xu. 2023. Visual-Language Prompt Tuning with Knowledge-Guided Context Optimization. In CVPR. 6757--6767.
[40]
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2022. FILIP: Fine-grained Interactive Language-Image Pre-Training. In ICLR.
[41]
Junhui Yin, Xinyu Zhang, Lin Wu, Xianghua Xie, and Xiaojie Wang. 2024. In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model. CoRR, Vol. abs/2403.06126 (2024). https://doi.org/10.48550/ARXIV.2403.06126 showeprint[arXiv]2403.06126
[42]
Tao Yu, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. 2023. Task Residual for Tuning Vision-Language Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17--24, 2023. IEEE, 10899--10909. https://doi.org/10.1109/CVPR52729.2023.01049
[43]
Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zheng-Jun Zha, Wei Chen, and Yujun Shen. 2023. Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1--6, 2023. IEEE, 11629--11639. https://doi.org/10.1109/ICCV51070.2023.01071
[44]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional Prompt Learning for Vision-Language Models. In CVPR. 16795--16804.
[45]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to Prompt for Vision-Language Models. Int. J. Comput. Vis., Vol. 130, 9 (2022), 2337--2348.
[46]
Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. 2023. Prompt-aligned Gradient for Prompt Tuning. In ICCV. 15659--15669.

Index Terms

  1. Rethinking the Effect of Uninformative Class Name in Prompt Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-category generalization
    2. embedding disengagement
    3. prompt learning
    4. vision-language model

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 58
      Total Downloads
    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media