Uncover the balanced geometry in long-tailed contrastive language-image pretraining

Zhou, Zhihan; Ye, Yushi; Hong, Feng; Zhao, Peisen; Yao, Jiangchao; Zhang, Ya; Tian, Qi; Wang, Yanfeng

doi:10.1007/s10994-025-06745-w

Uncover the balanced geometry in long-tailed contrastive language-image pretraining

Published: 24 February 2025

Volume 114, article number 106, (2025)
Cite this article

Machine Learning Aims and scope Submit manuscript

Zhihan Zhou¹,
Yushi Ye¹,
Feng Hong¹,
Peisen Zhao³,
Jiangchao Yao¹,
Ya Zhang²,
Qi Tian³ &
…
Yanfeng Wang²

155 Accesses
1 Altmetric
Explore all metrics

Abstract

While Contrastive Language-Image Pretraining (CLIP) has become the de facto standard for vision-language pretraining tasks, the exploration on the inherent long-tailed pretraining data distribution remains limited. From a neural collapse perspective, we show in principle that the vanilla CLIP training can be vulnerable to the long-tailed distributions, which might distort the representations with reduced inter-class separation and poor discriminative ability. To combat this issue, we propose an improved method, termed as Geometry-Balanced CLIP (GeoCLIP), which automatically constructs pseudo clusters and aligns them with a predefined equiangular geometric structure, thereby enjoying the theoretical merits of better maintaining the uniformity at the semantic level. Furthermore, we enhance GeoCLIP’s generality for real-world complex distributions by incorporating harmonized clusters that integrate both empirically observed data structures and theoretically optimal geometry. Extensive experiments across various benchmarks demonstrate the consistent superiority of GeoCLIP in achieving robust and transferable representation under long-tailed distributions. The source code will be publicly available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Alignclip: navigating the misalignments for robust vision-language generalization

Article Open access 06 February 2025

Centered Masking for Language-Image Pre-training

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Equiangular means that vectors (or points) are arranged such that the angle between any pair is equal.
Due to limited computational resources and the size of the labeled image-text dataset required for larger pretraining corpora, we use linear probing benchmarks to evaluate transfer classification performance instead of zero-shot evaluation. This benchmark allows for a clearer and more stable comparison between baseline methods and our method. Notably, our method also achieves a significant zero-shot performance gain, with a Top-1 Accuracy of 14.09%, compared to CLIP’s 10.33% on the same downstream classification datasets in Table 4. This trend is consistent with the other transfer evaluations detailed in Tables 4, 5.

References

Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Adv. Neural Inf. Proc. Syst., 33, 9758–9770.
Google Scholar
Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision, in neural information processing systems (NeurIPS) (2020)
Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning, in: international conference on learning representations (ICLR) (2020)
Assran, M., Balestriero, R., Duval, Q., Bordes, F., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Ballas, N.: The hidden uniform cluster prior in self-supervised learning. In: The Eleventh International Conference on Learning Representations (2023)
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features, in proceedings of the European conference on computer vision (ECCV), pp. 132–149 (2018)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations, in international conference on machine learning, pp. 1597–1607 (2020). PMLR
Chen, D., Wu, Z., Liu, F., Yang, Z., Zheng, S., Tan, Y., & Zhou, E. (2023). Protoclip: Prototypical contrastive language image pretraining. IEEE Trans. Neural Netw. Learn. Syst., 22, 20.
Google Scholar
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2818–2829 (2023)
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of singapore, in proceedings of the ACM international conference on image and video retrieval, pp. 1–9 (2009)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild, in proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2014)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning, in: proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223 (2011). JMLR Workshop and Conference Proceedings
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703 (2020)
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples, In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268–9277 (2019)
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Proc. Syst., 26, 23.
MATH Google Scholar
Dehdashtian, S., Wang, L., Boddeti, V.N.: Fairerclip: Debiasing clip’s zero-shot predictions using functions in rkhss. arXiv preprint arXiv:2403.15593 (2024)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009). Ieee
Dong, B., Zhou, P., Yan, S., Zuo, W.: Lpt: Long-tailed prompt tuning for image classification, in the eleventh international conference on learning representations (2023)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)
Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning, in international conference on machine learning, pp. 3015–3024 (2021). PMLR
Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., Schmidt, L.: Data determines distributional robustness in contrastive language image pre-training (clip), in international conference on machine learning, pp. 6216–6234 (2022). PMLR
Fang, C., He, H., Long, Q., & Su, W. J. (2021). Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proc. Natl. Acad. Sci., 118(43), 2103091118.
Article MathSciNet Google Scholar
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Trans. Patt. Anal. Mach. Intell., 28(4), 594–611.
Article MATH Google Scholar
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen., 7(2), 179–188.
Article MATH Google Scholar
Galanti, T., György, A., Hutter, M.: On the role of neural collapse in transfer learning. In: International Conference on Learning Representations (2021)
Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al.: Challenges in representation learning: A report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pp. 117–124 (2013). Springer
Graf, F., Hofer, C., Niethammer, M., Kwitt, R.: Dissecting supervised constrastive learning. In: international conference on machine learning, pp. 3821–3830 (2021). PMLR
Han, X., Papyan, V., Donoho, D.L.: Neural collapse under mse loss: Proximity to and dynamics on the central path. In: international conference on learning representations (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens., 12(7), 2217–2226.
Article Google Scholar
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark, in international joint conference on neural networks (2013)
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP. If you use this software, please cite it as below. https://doi.org/10.5281/zenodo.5143773
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision, international conference on machine learning, pp. 4904–4916 (2021). PMLR
Jiang, Z., Chen, T., Mortazavi, B.J., Wang, Z.: Self-damaging contrastive learning, in international conference on machine learning, pp. 4927–4939 (2021). PMLR
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3), 535–547.
Article Google Scholar
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: Decoupling representation and classifier for long-tailed recognition, in: international conference on learning representations (2019)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Res. Log. Quart., 2(1–2), 83–97.
Article MathSciNet MATH Google Scholar
Kukleva, A., Böhle, M., Schiele, B., Kuehne, H., Rupprecht, C.: Temperature schedules for self-supervised contrastive methods on long-tail data, in the eleventh international conference on learning representations (2023)
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324.
Article MATH Google Scholar
Li, T., Cao, P., Yuan, Y., Fan, L., Yang, Y., Feris, R.S., Indyk, P., Katabi, D.: Targeted supervised contrastive learning for long-tailed recognition, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6918–6928 (2022)
Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 23390–23400 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. in international conference on machine learning, pp. 19730–19742 (2023). PMLR
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, in international conference on machine learning, pp. 12888–12900 (2022). PMLR
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm, in international conference on learning representations (2022)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
Liang, P.P., Zadeh, A., Morency, L.-P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys (2023)
Liang, X., Wu, Y., Han, J., Xu, H., Xu, C., & Liang, X. (2022). Effective adaptation in multi-task co-training for unified autonomous driving. Adv. Neural Inf. Process. Syst., 35, 19645–19658.
MATH Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection, in proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context, in computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part V 13, pp. 740–755 (2014). Springer
Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Contrastive language-image pre-training using biomedical documents. In: international conference on medical image computing and computer-assisted intervention, pp. 525–536 (2023). Springer
Liu, H., HaoChen, J.Z., Gaidon, A., Ma, T.: Self-supervised learning is more robust to dataset imbalance, in international conference on learning representations (2021)
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s, proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2022)
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2537–2546 (2019)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, X., Zhang, J., Hu, T., Cao, H., Yao, Y., Pan, L.: Inducing neural collapse in deep long-tailed learning, in international conference on artificial intelligence and statistics, pp. 11534–11544 (2023). PMLR
Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., Hengel, A.: Retrieval augmented classification for long-tail visual recognition, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6959–6969 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization, in international conference on learning representations (2019)
Lu, Z.: A theory of multimodal learning. Advances in Neural Information Processing Systems 36 (2024)
Ma, T., Geng, S., Wang, M., Shao, J., Lu, J., Li, H., Gao, P., Qiao, Y.: A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745 (2021)
Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment, in international conference on learning representations (2021)
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al.: Mixed precision training, in international conference on learning representations (2018)
Mixon, D.G., Parshall, H., Pi, J.: Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619 (2020)
Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: Self-supervision meets language-image pre-training, in European conference on computer vision, pp. 529–544 (2022). Springer
Ng, E.G., Pang, B., Sharma, P., Soricut, R.: Understanding guided image captioning performance across domains. arXiv preprint arXiv:2012.02339 (2020)
Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing, pp. 722–729 (2008). IEEE
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
OpenAI, R.: Gpt-4 technical report. arxiv 2303.08774. View in Article 2(5) (2023)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2023)
Papyan, V., Han, X., & Donoho, D. L. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl. Acad. Sci., 117(40), 24652–24663.
Article MathSciNet MATH Google Scholar
Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y., Ramanan, D., Caverlee, J., Kong, S.: The neglected tails of vision-language models. arXiv preprint arXiv:2401.12425 (2024)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: proceedings of the IEEE international conference on computer vision, pp. 2641–2649 (2015)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR
Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al. (2020). Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Proc. Syst., 33, 4175–4186.
MATH Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022)
Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search, in international conference on learning representations (2019)
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst., 35, 25278–25294.
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in proceedings of ACL (2018)
Shi, J.-X., Zhang, C., Wei, T., Li, Y.-F.: Efficient and long-tailed generalization for pre-trained vision-language model. In: proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 2663–2673 (2024)
Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic manipulation. In: conference on robot learning, pp. 894–906 (2022). PMLR
Strehl, A., & Ghosh, J. (2002). Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3, 583–617.
MathSciNet MATH Google Scholar
Tian, Y., Henaff, O.J., Oord, A.: Divide and contrast: Self-supervised learning from uncurated data, in proceedings of the IEEE/CVF international conference on computer vision, pp. 10063–10074 (2021)
Tian, C., Wang, W., Zhu, X., Dai, J., Qiao, Y.: Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition, in European conference on computer vision, pp. 73–91 (2022). Springer
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Proc. Syst. 30: (2017)
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere, in international conference on machine learning, pp. 9929–9939 (2020). PMLR
Wang, Y., Zhang, Q., Wang, Y., Yang, J., Lin, Z.: Chaos is a ladder: a new theoretical understanding of contrastive learning via augmentation overlap, in international conference on learning representations (2021)
Wang, Y., Yu, Z., Wang, J., Heng, Q., Chen, H., Ye, W., Xie, R., Xie, X., & Zhang, S. (2024). Exploring vision-language models for imbalanced learning. Int. J. Comput. Vis., 132(1), 224–237.
Article MATH Google Scholar
Wu, B., Cheng, R., Zhang, P., Gao, T., Gonzalez, J.E., Vajda, P.: Data efficient language-supervised zero-shot recognition with optimal transport distillation, in international conference on learning representations (2022)
Wu, R., Papyan, V.: Linguistic collapse: Neural collapse in (large) language models. arXiv preprint arXiv:2405.17767 (2024)
Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In proceedings of the IEEE/CVF international conference on computer vision, pp. 21372–21383 (2023)
Xie, L., Yang, Y., Cai, D., & He, X. (2023). Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing, 527, 60–70.
Article Google Scholar
Yang, Y., Chen, S., Li, X., Xie, L., Lin, Z., & Tao, D. (2022). Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? Adv. Neural Inf. Process. Syst., 35, 37991–38002.
Google Scholar
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16793–16803 (2022)
Zhou, J., Dong, L., Gan, Z., Wang, L., Wei, F.: Non-contrastive learning meets language-image pre-training, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11028–11038 (2023)
Zhou, Z., Yao, J., Wang, Y.-F., Han, B., Zhang, Y.: Contrastive learning with boosted memorization, in international conference on machine learning, pp. 27367–27377 (2022). PMLR
Zhou, Z., Yao, J., Hong, F., Zhang, Y., Han, B., & Wang, Y. (2023). Combating representation learning disparity with geometric harmonization. Adv. Neural Inf. Process. Syst., 36, 20394.
Google Scholar
Zhu, D., Li, Z., Zhang, M., Yuan, J., Liu, J., Kuang, K., Wu, C.: Neural collapse anchored prompt tuning for generalizable vision-language models, in proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 4631–4640 (2024)
Zhu, Z., Ding, T., Zhou, J., Li, X., You, C., Sulam, J., & Qu, Q. (2021). A geometric analysis of neural collapse with unconstrained features. Adv. Neural Inf. Proc. Syst., 34, 29820–29834.
Google Scholar
Zhuang, Y., Wang, Y., Wu, F., Zhang, Y., Lu, W.: Supervised coupled dictionary learning with group structures for multi-modal retrieval, in proceedings of the AAAI conference on artificial intelligence, vol. 27, pp. 1070–1076 (2013)

Download references

Funding

This work is supported by National Natural Science Foundation of China (No. 62306178), STCSM (Nos. 22511106101, 22DZ2229005), 111 plan (No. BP0719010).

Author information

Authors and Affiliations

Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai, 200240, China
Zhihan Zhou, Yushi Ye, Feng Hong & Jiangchao Yao
School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, 200230, China
Ya Zhang & Yanfeng Wang
Huawei, Guangdong, 518028, China
Peisen Zhao & Qi Tian

Authors

Zhihan Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Yushi Ye
View author publications
You can also search for this author inPubMed Google Scholar
Feng Hong
View author publications
You can also search for this author inPubMed Google Scholar
Peisen Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Jiangchao Yao
View author publications
You can also search for this author inPubMed Google Scholar
Ya Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Qi Tian
View author publications
You can also search for this author inPubMed Google Scholar
Yanfeng Wang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Idea: Z. Z.; Methodology (including literature review): Z. Z.; Experiment: Z. Z., Y. Y.; Writing - original draft: Z. Z.; Writing - comments/edits: all; Supervision: P. Z., J. Y., Y. Z., Q. T. and Y. W..

Corresponding authors

Correspondence to Jiangchao Yao or Yanfeng Wang.

Ethics declarations

Conflict of interest

The authors have no financial or non-financial interests to disclose that are relevant to the content of this article.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Data availability

All datasets used in this work are available online and clearly cited. The data split for long-tailed sampling subsets will be available along with the code.

Materials availability

Not applicable.

Code availability

The code of this work will be available after publication.

Additional information

Editor: Mingming Gong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, Z., Ye, Y., Hong, F. et al. Uncover the balanced geometry in long-tailed contrastive language-image pretraining. Mach Learn 114, 106 (2025). https://doi.org/10.1007/s10994-025-06745-w

Download citation

Received: 19 June 2024
Revised: 21 November 2024
Accepted: 17 January 2025
Published: 24 February 2025
DOI: https://doi.org/10.1007/s10994-025-06745-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncover the balanced geometry in long-tailed contrastive language-image pretraining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Alignclip: navigating the misalignments for robust vision-language generalization

Centered Masking for Language-Image Pre-training

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Explore related subjects

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Consent for publication

Data availability

Materials availability

Code availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now