Abstract
While Contrastive Language-Image Pretraining (CLIP) has become the de facto standard for vision-language pretraining tasks, the exploration on the inherent long-tailed pretraining data distribution remains limited. From a neural collapse perspective, we show in principle that the vanilla CLIP training can be vulnerable to the long-tailed distributions, which might distort the representations with reduced inter-class separation and poor discriminative ability. To combat this issue, we propose an improved method, termed as Geometry-Balanced CLIP (GeoCLIP), which automatically constructs pseudo clusters and aligns them with a predefined equiangular geometric structure, thereby enjoying the theoretical merits of better maintaining the uniformity at the semantic level. Furthermore, we enhance GeoCLIP’s generality for real-world complex distributions by incorporating harmonized clusters that integrate both empirically observed data structures and theoretically optimal geometry. Extensive experiments across various benchmarks demonstrate the consistent superiority of GeoCLIP in achieving robust and transferable representation under long-tailed distributions. The source code will be publicly available.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Equiangular means that vectors (or points) are arranged such that the angle between any pair is equal.
Due to limited computational resources and the size of the labeled image-text dataset required for larger pretraining corpora, we use linear probing benchmarks to evaluate transfer classification performance instead of zero-shot evaluation. This benchmark allows for a clearer and more stable comparison between baseline methods and our method. Notably, our method also achieves a significant zero-shot performance gain, with a Top-1 Accuracy of 14.09%, compared to CLIP’s 10.33% on the same downstream classification datasets in Table 4. This trend is consistent with the other transfer evaluations detailed in Tables 4, 5.
References
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. Adv. Neural Inf. Proc. Syst., 33, 9758–9770.
Asano, Y.M., Patrick, M., Rupprecht, C., Vedaldi, A.: Labelling unlabelled videos from scratch with multi-modal self-supervision, in neural information processing systems (NeurIPS) (2020)
Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning, in: international conference on learning representations (ICLR) (2020)
Assran, M., Balestriero, R., Duval, Q., Bordes, F., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., Ballas, N.: The hidden uniform cluster prior in self-supervised learning. In: The Eleventh International Conference on Learning Representations (2023)
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features, in proceedings of the European conference on computer vision (ECCV), pp. 132–149 (2018)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations, in international conference on machine learning, pp. 1597–1607 (2020). PMLR
Chen, D., Wu, Z., Liu, F., Yang, Z., Zheng, S., Tan, Y., & Zhou, E. (2023). Protoclip: Prototypical contrastive language image pretraining. IEEE Trans. Neural Netw. Learn. Syst., 22, 20.
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2818–2829 (2023)
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of singapore, in proceedings of the ACM international conference on image and video retrieval, pp. 1–9 (2009)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild, in proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (2014)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning, in: proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223 (2011). JMLR Workshop and Conference Proceedings
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703 (2020)
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples, In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268–9277 (2019)
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Proc. Syst., 26, 23.
Dehdashtian, S., Wang, L., Boddeti, V.N.: Fairerclip: Debiasing clip’s zero-shot predictions using functions in rkhss. arXiv preprint arXiv:2403.15593 (2024)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009). Ieee
Dong, B., Zhou, P., Yan, S., Zuo, W.: Lpt: Long-tailed prompt tuning for image classification, in the eleventh international conference on learning representations (2023)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR (2021)
Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning, in international conference on machine learning, pp. 3015–3024 (2021). PMLR
Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., Schmidt, L.: Data determines distributional robustness in contrastive language image pre-training (clip), in international conference on machine learning, pp. 6216–6234 (2022). PMLR
Fang, C., He, H., Long, Q., & Su, W. J. (2021). Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proc. Natl. Acad. Sci., 118(43), 2103091118.
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Trans. Patt. Anal. Mach. Intell., 28(4), 594–611.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen., 7(2), 179–188.
Galanti, T., György, A., Hutter, M.: On the role of neural collapse in transfer learning. In: International Conference on Learning Representations (2021)
Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al.: Challenges in representation learning: A report on three machine learning contests. In: Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pp. 117–124 (2013). Springer
Graf, F., Hofer, C., Niethammer, M., Kwitt, R.: Dissecting supervised constrastive learning. In: international conference on machine learning, pp. 3821–3830 (2021). PMLR
Han, X., Papyan, V., Donoho, D.L.: Neural collapse under mse loss: Proximity to and dynamics on the central path. In: international conference on learning representations (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, in proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens., 12(7), 2217–2226.
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark, in international joint conference on neural networks (2013)
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP. If you use this software, please cite it as below. https://doi.org/10.5281/zenodo.5143773
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision, international conference on machine learning, pp. 4904–4916 (2021). PMLR
Jiang, Z., Chen, T., Mortazavi, B.J., Wang, Z.: Self-damaging contrastive learning, in international conference on machine learning, pp. 4927–4939 (2021). PMLR
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3), 535–547.
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: Decoupling representation and classifier for long-tailed recognition, in: international conference on learning representations (2019)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Res. Log. Quart., 2(1–2), 83–97.
Kukleva, A., Böhle, M., Schiele, B., Kuehne, H., Rupprecht, C.: Temperature schedules for self-supervised contrastive methods on long-tail data, in the eleventh international conference on learning representations (2023)
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324.
Li, T., Cao, P., Yuan, Y., Fan, L., Yang, Y., Feris, R.S., Indyk, P., Katabi, D.: Targeted supervised contrastive learning for long-tailed recognition, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6918–6928 (2022)
Li, Y., Fan, H., Hu, R., Feichtenhofer, C., He, K.: Scaling language-image pre-training via masking, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 23390–23400 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. in international conference on machine learning, pp. 19730–19742 (2023). PMLR
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, in international conference on machine learning, pp. 12888–12900 (2022). PMLR
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., Yan, J.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm, in international conference on learning representations (2022)
Li, B., Zhang, Y., Chen, L., Wang, J., Yang, J., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726 (2023)
Liang, P.P., Zadeh, A., Morency, L.-P.: Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys (2023)
Liang, X., Wu, Y., Han, J., Xu, H., Xu, C., & Liang, X. (2022). Effective adaptation in multi-task co-training for unified autonomous driving. Adv. Neural Inf. Process. Syst., 35, 19645–19658.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection, in proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context, in computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, proceedings, part V 13, pp. 740–755 (2014). Springer
Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Contrastive language-image pre-training using biomedical documents. In: international conference on medical image computing and computer-assisted intervention, pp. 525–536 (2023). Springer
Liu, H., HaoChen, J.Z., Gaidon, A., Ma, T.: Self-supervised learning is more robust to dataset imbalance, in international conference on learning representations (2021)
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s, proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2022)
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2537–2546 (2019)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, X., Zhang, J., Hu, T., Cao, H., Yao, Y., Pan, L.: Inducing neural collapse in deep long-tailed learning, in international conference on artificial intelligence and statistics, pp. 11534–11544 (2023). PMLR
Long, A., Yin, W., Ajanthan, T., Nguyen, V., Purkait, P., Garg, R., Blair, A., Shen, C., Hengel, A.: Retrieval augmented classification for long-tail visual recognition, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6959–6969 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization, in international conference on learning representations (2019)
Lu, Z.: A theory of multimodal learning. Advances in Neural Information Processing Systems 36 (2024)
Ma, T., Geng, S., Wang, M., Shao, J., Lu, J., Li, H., Gao, P., Qiao, Y.: A simple long-tailed recognition baseline via vision-language model. arXiv preprint arXiv:2111.14745 (2021)
Menon, A.K., Jayasumana, S., Rawat, A.S., Jain, H., Veit, A., Kumar, S.: Long-tail learning via logit adjustment, in international conference on learning representations (2021)
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al.: Mixed precision training, in international conference on learning representations (2018)
Mixon, D.G., Parshall, H., Pi, J.: Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619 (2020)
Mu, N., Kirillov, A., Wagner, D., Xie, S.: Slip: Self-supervision meets language-image pre-training, in European conference on computer vision, pp. 529–544 (2022). Springer
Ng, E.G., Pang, B., Sharma, P., Soricut, R.: Understanding guided image captioning performance across domains. arXiv preprint arXiv:2012.02339 (2020)
Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, graphics & image processing, pp. 722–729 (2008). IEEE
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
OpenAI, R.: Gpt-4 technical report. arxiv 2303.08774. View in Article 2(5) (2023)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research (2023)
Papyan, V., Han, X., & Donoho, D. L. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl. Acad. Sci., 117(40), 24652–24663.
Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y., Ramanan, D., Caverlee, J., Kong, S.: The neglected tails of vision-language models. arXiv preprint arXiv:2401.12425 (2024)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: proceedings of the IEEE international conference on computer vision, pp. 2641–2649 (2015)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR
Ren, J., Yu, C., Ma, X., Zhao, H., Yi, S., et al. (2020). Balanced meta-softmax for long-tailed visual recognition. Adv. Neural Inf. Proc. Syst., 33, 4175–4186.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695 (2022)
Sablayrolles, A., Douze, M., Schmid, C., Jégou, H.: Spreading vectors for similarity search, in international conference on learning representations (2019)
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst., 35, 25278–25294.
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in proceedings of ACL (2018)
Shi, J.-X., Zhang, C., Wei, T., Li, Y.-F.: Efficient and long-tailed generalization for pre-trained vision-language model. In: proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 2663–2673 (2024)
Shridhar, M., Manuelli, L., Fox, D.: Cliport: What and where pathways for robotic manipulation. In: conference on robot learning, pp. 894–906 (2022). PMLR
Strehl, A., & Ghosh, J. (2002). Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3, 583–617.
Tian, Y., Henaff, O.J., Oord, A.: Divide and contrast: Self-supervised learning from uncurated data, in proceedings of the IEEE/CVF international conference on computer vision, pp. 10063–10074 (2021)
Tian, C., Wang, W., Zhu, X., Dai, J., Qiao, Y.: Vl-ltr: Learning class-wise visual-linguistic representation for long-tailed visual recognition, in European conference on computer vision, pp. 73–91 (2022). Springer
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Proc. Syst. 30: (2017)
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere, in international conference on machine learning, pp. 9929–9939 (2020). PMLR
Wang, Y., Zhang, Q., Wang, Y., Yang, J., Lin, Z.: Chaos is a ladder: a new theoretical understanding of contrastive learning via augmentation overlap, in international conference on learning representations (2021)
Wang, Y., Yu, Z., Wang, J., Heng, Q., Chen, H., Ye, W., Xie, R., Xie, X., & Zhang, S. (2024). Exploring vision-language models for imbalanced learning. Int. J. Comput. Vis., 132(1), 224–237.
Wu, B., Cheng, R., Zhang, P., Gao, T., Gonzalez, J.E., Vajda, P.: Data efficient language-supervised zero-shot recognition with optimal transport distillation, in international conference on learning representations (2022)
Wu, R., Papyan, V.: Linguistic collapse: Neural collapse in (large) language models. arXiv preprint arXiv:2405.17767 (2024)
Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In proceedings of the IEEE/CVF international conference on computer vision, pp. 21372–21383 (2023)
Xie, L., Yang, Y., Cai, D., & He, X. (2023). Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing, 527, 60–70.
Yang, Y., Chen, S., Li, X., Xie, L., Lin, Z., & Tao, D. (2022). Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? Adv. Neural Inf. Process. Syst., 35, 37991–38002.
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., et al.: Regionclip: Region-based language-image pretraining, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16793–16803 (2022)
Zhou, J., Dong, L., Gan, Z., Wang, L., Wei, F.: Non-contrastive learning meets language-image pre-training, in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11028–11038 (2023)
Zhou, Z., Yao, J., Wang, Y.-F., Han, B., Zhang, Y.: Contrastive learning with boosted memorization, in international conference on machine learning, pp. 27367–27377 (2022). PMLR
Zhou, Z., Yao, J., Hong, F., Zhang, Y., Han, B., & Wang, Y. (2023). Combating representation learning disparity with geometric harmonization. Adv. Neural Inf. Process. Syst., 36, 20394.
Zhu, D., Li, Z., Zhang, M., Yuan, J., Liu, J., Kuang, K., Wu, C.: Neural collapse anchored prompt tuning for generalizable vision-language models, in proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pp. 4631–4640 (2024)
Zhu, Z., Ding, T., Zhou, J., Li, X., You, C., Sulam, J., & Qu, Q. (2021). A geometric analysis of neural collapse with unconstrained features. Adv. Neural Inf. Proc. Syst., 34, 29820–29834.
Zhuang, Y., Wang, Y., Wu, F., Zhang, Y., Lu, W.: Supervised coupled dictionary learning with group structures for multi-modal retrieval, in proceedings of the AAAI conference on artificial intelligence, vol. 27, pp. 1070–1076 (2013)
Funding
This work is supported by National Natural Science Foundation of China (No. 62306178), STCSM (Nos. 22511106101, 22DZ2229005), 111 plan (No. BP0719010).
Author information
Authors and Affiliations
Contributions
Idea: Z. Z.; Methodology (including literature review): Z. Z.; Experiment: Z. Z., Y. Y.; Writing - original draft: Z. Z.; Writing - comments/edits: all; Supervision: P. Z., J. Y., Y. Z., Q. T. and Y. W..
Corresponding authors
Ethics declarations
Conflict of interest
The authors have no financial or non-financial interests to disclose that are relevant to the content of this article.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Data availability
All datasets used in this work are available online and clearly cited. The data split for long-tailed sampling subsets will be available along with the code.
Materials availability
Not applicable.
Code availability
The code of this work will be available after publication.
Additional information
Editor: Mingming Gong.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Z., Ye, Y., Hong, F. et al. Uncover the balanced geometry in long-tailed contrastive language-image pretraining. Mach Learn 114, 106 (2025). https://doi.org/10.1007/s10994-025-06745-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-025-06745-w