Skip to main content
Log in

Uncover the balanced geometry in long-tailed contrastive language-image pretraining

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

While Contrastive Language-Image Pretraining (CLIP) has become the de facto standard for vision-language pretraining tasks, the exploration on the inherent long-tailed pretraining data distribution remains limited. From a neural collapse perspective, we show in principle that the vanilla CLIP training can be vulnerable to the long-tailed distributions, which might distort the representations with reduced inter-class separation and poor discriminative ability. To combat this issue, we propose an improved method, termed as Geometry-Balanced CLIP (GeoCLIP), which automatically constructs pseudo clusters and aligns them with a predefined equiangular geometric structure, thereby enjoying the theoretical merits of better maintaining the uniformity at the semantic level. Furthermore, we enhance GeoCLIP’s generality for real-world complex distributions by incorporating harmonized clusters that integrate both empirically observed data structures and theoretically optimal geometry. Extensive experiments across various benchmarks demonstrate the consistent superiority of GeoCLIP in achieving robust and transferable representation under long-tailed distributions. The source code will be publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Equiangular means that vectors (or points) are arranged such that the angle between any pair is equal.

  2. Due to limited computational resources and the size of the labeled image-text dataset required for larger pretraining corpora, we use linear probing benchmarks to evaluate transfer classification performance instead of zero-shot evaluation. This benchmark allows for a clearer and more stable comparison between baseline methods and our method. Notably, our method also achieves a significant zero-shot performance gain, with a Top-1 Accuracy of 14.09%, compared to CLIP’s 10.33% on the same downstream classification datasets in Table 4. This trend is consistent with the other transfer evaluations detailed in Tables 4, 5.

References

  • Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Adv. Neural Inf. Proc. Syst., 33, 9758–9770.

    Google Scholar 

  • Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision, in neural information processing systems (NeurIPS) (2020)

  • Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning, in: international conference on learning representations (ICLR) (2020)

  • Assran, M., Balestriero, R., Duval, Q., Bordes, F., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Ballas, N.: The hidden uniform cluster prior in self-supervised learning. In: The Eleventh International Conference on Learning Representations (2023)

  • Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  • Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features, in proceedings of the European conference on computer vision (ECCV), pp. 132–149 (2018)

  • Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations, in international conference on machine learning, pp. 1597–1607 (2020). PMLR

  • Chen, D., Wu, Z., Liu, F., Yang, Z., Zheng, S., Tan, Y., & Zhou, E. (2023). Protoclip: Prototypical contrastive language image pretraining. IEEE Trans. Neural Netw. Learn. Syst., 22, 20.

    Google Scholar 

  • Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2818–2829 (2023)

  • Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of singapore, in proceedings of the ACM international conference on image and video retrieval, pp. 1–9 (2009)

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild, in proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2014)

  • Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning, in: proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223 (2011). JMLR Workshop and Conference Proceedings

  • Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703 (2020)

  • Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples, In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268–9277 (2019)

  • Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Proc. Syst., 26, 23.

    MATH  Google Scholar 

  • Dehdashtian, S., Wang, L., Boddeti, V.N.: Fairerclip: Debiasing clip’s zero-shot predictions using functions in rkhss. arXiv preprint arXiv:2403.15593 (2024)

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009). Ieee

  • Dong, B., Zhou, P., Yan, S., Zuo, W.: Lpt: Long-tailed prompt tuning for image classification, in the eleventh international conference on learning representations (2023)

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)

  • Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning, in international conference on machine learning, pp. 3015–3024 (2021). PMLR

  • Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., Schmidt, L.: Data determines distributional robustness in contrastive language image pre-training (clip), in international conference on machine learning, pp. 6216–6234 (2022). PMLR

  • Fang, C., He, H., Long, Q., & Su, W. J. (2021). Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proc. Natl. Acad. Sci., 118(43), 2103091118.

    Article  MathSciNet  Google Scholar 

  • Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Trans. Patt. Anal. Mach. Intell., 28(4), 594–611.

    Article  MATH  Google Scholar 

  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen., 7(2), 179–188.

    Article  MATH  Google Scholar 

  • Galanti, T., György, A., Hutter, M.: On the role of neural collapse in transfer learning. In: International Conference on Learning Representations (2021)

  • Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al.: Challenges in representation learning: A report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pp. 117–124 (2013). Springer

  • Graf, F., Hofer, C., Niethammer, M., Kwitt, R.: Dissecting supervised constrastive learning. In: international conference on machine learning, pp. 3821–3830 (2021). PMLR

  • Han, X., Papyan, V., Donoho, D.L.: Neural collapse under mse loss: Proximity to and dynamics on the central path. In: international conference on learning representations (2021)

  • He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  • Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens., 12(7), 2217–2226.

    Article  Google Scholar 

  • Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark, in international joint conference on neural networks (2013)

  • Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP. If you use this software, please cite it as below. https://doi.org/10.5281/zenodo.5143773

  • Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision, international conference on machine learning, pp. 4904–4916 (2021). PMLR

  • Jiang, Z., Chen, T., Mortazavi, B.J., Wang, Z.: Self-damaging contrastive learning, in international conference on machine learning, pp. 4927–4939 (2021). PMLR

  • Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3), 535–547.

    Article  Google Scholar 

  • Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: Decoupling representation and classifier for long-tailed recognition, in: international conference on learning representations (2019)

  • Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

  • Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Res. Log. Quart., 2(1–2), 83–97.

    Article  MathSciNet  MATH  Google Scholar 

  • Kukleva, A., Böhle, M., Schiele, B., Kuehne, H., Rupprecht, C.: Temperature schedules for self-supervised contrastive methods on long-tail data, in the eleventh international conference on learning representations (2023)

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324.

    Article  MATH  Google Scholar 

  • Li, T., Cao, P., Yuan, Y., Fan, L., Yang, Y., Feris, R.S., Indyk, P., Katabi, D.: Targeted supervised contrastive learning for long-tailed recognition, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6918–6928 (2022)

  • Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 23390–23400 (2023)

  • Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. in international conference on machine learning, pp. 19730–19742 (2023). PMLR

  • Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, in international conference on machine learning, pp. 12888–12900 (2022). PMLR

  • Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm, in international conference on learning representations (2022)

  • Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)

  • Liang, P.P., Zadeh, A., Morency, L.-P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys (2023)

  • Liang, X., Wu, Y., Han, J., Xu, H., Xu, C., & Liang, X. (2022). Effective adaptation in multi-task co-training for unified autonomous driving. Adv. Neural Inf. Process. Syst., 35, 19645–19658.

    MATH  Google Scholar 

  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection, in proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)

  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context, in computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part V 13, pp. 740–755 (2014). Springer

  • Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Contrastive language-image pre-training using biomedical documents. In: international conference on medical image computing and computer-assisted intervention, pp. 525–536 (2023). Springer

  • Liu, H., HaoChen, J.Z., Gaidon, A., Ma, T.: Self-supervised learning is more robust to dataset imbalance, in international conference on learning representations (2021)

  • Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s, proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2022)

  • Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2537–2546 (2019)

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  • Liu, X., Zhang, J., Hu, T., Cao, H., Yao, Y., Pan, L.: Inducing neural collapse in deep long-tailed learning, in international conference on artificial intelligence and statistics, pp. 11534–11544 (2023). PMLR

  • Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., Hengel, A.: Retrieval augmented classification for long-tail visual recognition, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6959–6969 (2022)

  • Loshchilov, I., Hutter, F.: Decoupled weight decay regularization, in international conference on learning representations (2019)

  • Lu, Z.: A theory of multimodal learning. Advances in Neural Information Processing Systems 36 (2024)

  • Ma, T., Geng, S., Wang, M., Shao, J., Lu, J., Li, H., Gao, P., Qiao, Y.: A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745 (2021)

  • Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment, in international conference on learning representations (2021)

  • Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al.: Mixed precision training, in international conference on learning representations (2018)

  • Mixon, D.G., Parshall, H., Pi, J.: Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619 (2020)

  • Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: Self-supervision meets language-image pre-training, in European conference on computer vision, pp. 529–544 (2022). Springer

  • Ng, E.G., Pang, B., Sharma, P., Soricut, R.: Understanding guided image captioning performance across domains. arXiv preprint arXiv:2012.02339 (2020)

  • Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing, pp. 722–729 (2008). IEEE

  • Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  • OpenAI, R.: Gpt-4 technical report. arxiv 2303.08774. View in Article 2(5) (2023)

  • Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2023)

  • Papyan, V., Han, X., & Donoho, D. L. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl. Acad. Sci., 117(40), 24652–24663.

    Article  MathSciNet  MATH  Google Scholar 

  • Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y., Ramanan, D., Caverlee, J., Kong, S.: The neglected tails of vision-language models. arXiv preprint arXiv:2401.12425 (2024)

  • Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: proceedings of the IEEE international conference on computer vision, pp. 2641–2649 (2015)

  • Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR

  • Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al. (2020). Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Proc. Syst., 33, 4175–4186.

    MATH  Google Scholar 

  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022)

  • Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search, in international conference on learning representations (2019)

  • Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

  • Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst., 35, 25278–25294.

    Google Scholar 

  • Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in proceedings of ACL (2018)

  • Shi, J.-X., Zhang, C., Wei, T., Li, Y.-F.: Efficient and long-tailed generalization for pre-trained vision-language model. In: proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 2663–2673 (2024)

  • Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic manipulation. In: conference on robot learning, pp. 894–906 (2022). PMLR

  • Strehl, A., & Ghosh, J. (2002). Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3, 583–617.

    MathSciNet  MATH  Google Scholar 

  • Tian, Y., Henaff, O.J., Oord, A.: Divide and contrast: Self-supervised learning from uncurated data, in proceedings of the IEEE/CVF international conference on computer vision, pp. 10063–10074 (2021)

  • Tian, C., Wang, W., Zhu, X., Dai, J., Qiao, Y.: Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition, in European conference on computer vision, pp. 73–91 (2022). Springer

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Proc. Syst. 30: (2017)

  • Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere, in international conference on machine learning, pp. 9929–9939 (2020). PMLR

  • Wang, Y., Zhang, Q., Wang, Y., Yang, J., Lin, Z.: Chaos is a ladder: a new theoretical understanding of contrastive learning via augmentation overlap, in international conference on learning representations (2021)

  • Wang, Y., Yu, Z., Wang, J., Heng, Q., Chen, H., Ye, W., Xie, R., Xie, X., & Zhang, S. (2024). Exploring vision-language models for imbalanced learning. Int. J. Comput. Vis., 132(1), 224–237.

    Article  MATH  Google Scholar 

  • Wu, B., Cheng, R., Zhang, P., Gao, T., Gonzalez, J.E., Vajda, P.: Data efficient language-supervised zero-shot recognition with optimal transport distillation, in international conference on learning representations (2022)

  • Wu, R., Papyan, V.: Linguistic collapse: Neural collapse in (large) language models. arXiv preprint arXiv:2405.17767 (2024)

  • Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In proceedings of the IEEE/CVF international conference on computer vision, pp. 21372–21383 (2023)

  • Xie, L., Yang, Y., Cai, D., & He, X. (2023). Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing, 527, 60–70.

    Article  Google Scholar 

  • Yang, Y., Chen, S., Li, X., Xie, L., Lin, Z., & Tao, D. (2022). Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? Adv. Neural Inf. Process. Syst., 35, 37991–38002.

    Google Scholar 

  • Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16793–16803 (2022)

  • Zhou, J., Dong, L., Gan, Z., Wang, L., Wei, F.: Non-contrastive learning meets language-image pre-training, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11028–11038 (2023)

  • Zhou, Z., Yao, J., Wang, Y.-F., Han, B., Zhang, Y.: Contrastive learning with boosted memorization, in international conference on machine learning, pp. 27367–27377 (2022). PMLR

  • Zhou, Z., Yao, J., Hong, F., Zhang, Y., Han, B., & Wang, Y. (2023). Combating representation learning disparity with geometric harmonization. Adv. Neural Inf. Process. Syst., 36, 20394.

    Google Scholar 

  • Zhu, D., Li, Z., Zhang, M., Yuan, J., Liu, J., Kuang, K., Wu, C.: Neural collapse anchored prompt tuning for generalizable vision-language models, in proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 4631–4640 (2024)

  • Zhu, Z., Ding, T., Zhou, J., Li, X., You, C., Sulam, J., & Qu, Q. (2021). A geometric analysis of neural collapse with unconstrained features. Adv. Neural Inf. Proc. Syst., 34, 29820–29834.

    Google Scholar 

  • Zhuang, Y., Wang, Y., Wu, F., Zhang, Y., Lu, W.: Supervised coupled dictionary learning with group structures for multi-modal retrieval, in proceedings of the AAAI conference on artificial intelligence, vol. 27, pp. 1070–1076 (2013)

Download references

Funding

This work is supported by National Natural Science Foundation of China (No. 62306178), STCSM (Nos. 22511106101, 22DZ2229005), 111 plan (No. BP0719010).

Author information

Authors and Affiliations

Authors

Contributions

Idea: Z. Z.; Methodology (including literature review): Z. Z.; Experiment: Z. Z., Y. Y.; Writing - original draft: Z. Z.; Writing - comments/edits: all; Supervision: P. Z., J. Y., Y. Z., Q. T. and Y. W..

Corresponding authors

Correspondence to Jiangchao Yao or Yanfeng Wang.

Ethics declarations

Conflict of interest

The authors have no financial or non-financial interests to disclose that are relevant to the content of this article.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Data availability

All datasets used in this work are available online and clearly cited. The data split for long-tailed sampling subsets will be available along with the code.

Materials availability

Not applicable.

Code availability

The code of this work will be available after publication.

Additional information

 Editor: Mingming Gong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Z., Ye, Y., Hong, F. et al. Uncover the balanced geometry in long-tailed contrastive language-image pretraining. Mach Learn 114, 106 (2025). https://doi.org/10.1007/s10994-025-06745-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-025-06745-w

Keywords