Skip to main content

Advertisement

Log in

Equiangular Basis Vectors: A Novel Paradigm for Classification Tasks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we propose Equiangular Basis Vectors (EBVs) as a novel training paradigm of deep learning for image classification tasks. Differing from prominent training paradigms, e.g., k-way classification layers (mapping the learned representations to the label space) and deep metric learning (quantifying sample similarity), our method generates normalized vector embeddings as "predefined classifiers", which act as the fixed learning targets corresponding to different categories. By minimizing the spherical distance of the embedding of an input between its categorical EBV in training, the predictions can be obtained by identifying the categorical EBV with the smallest distance during inference. More importantly, by directly adding EBVs corresponding to newly added categories of equal status on the basis of existing EBVs, our method exhibits strong scalability to deal with the large increase of training categories in open-environment machine learning. In experiments, we evaluate EBVs on diverse computer vision tasks with large-scale real-world datasets, including classification on ImageNet-1K, object detection on COCO, semantic segmentation on ADE20K, etc. We further collected a dataset consisting of 100,000 categories to validate the superior performance of EBVs when handling a large number of categories. Comprehensive experiments validate both the effectiveness and scalability of our EBVs. Our method won the first place in the 2022 DIGIX Global AI Challenge, code along with all associated logs are open-source and available at https://github.com/aassxun/Equiangular-Basis-Vectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Image data in Tables 1, 3 and 4 were extracted from ImageNet-1K (Deng et al., 2009). Image data in Table 2 were extracted from ImageNet-1K (Deng et al., 2009), CUB200-2011 (Wah et al., 2011) and Aircraft (Maji et al., 2013). Image data in Tables 5 and 6 were extracted from COCO 2017 (Lin et al., 2014). Image data in Tables 7 and 8 were extracted from ADE20K (Zhou et al., 2019). Image data in Table 9 were extracted from the citizen science website \({\textrm{iNaturalist}}\) (www.inaturalist.org). Image data in Table 10 were extracted from CIFAR-10 and CIFAR-100 (Cao et al., 2019). Image data in Table 11 were extracted from iNaturalist 2018 (Van Horn et al., 2018). Image data in Tables 12 and 13 were extracted from ISUN (Xu et al., 2015), Place365 (Zhou et al., 2017), Texture (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), LSUN-Crop (Yu et al., 2015) and LSUN-Resize (Yu et al., 2015).

Notes

  1. www.inaturalist.org.

  2. www.inaturalist.org.

References

  • Bao, H., Dong, L., Piao, S., & Wei, F. (2021). Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254

  • Bellet, A., Habrard, A., & Sebban, M. (2013). A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709

  • Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. Advances in Neural Information Processing Systems, pp. 1567–1578

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660

  • Chapelle, O., Haffner, P., & Vapnik, V. N. (1999). Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5), 1055–1064.

    Article  MATH  Google Scholar 

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pp. 1597–1607

  • Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., & Lin, D. (2019). MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613

  • Contributors, M. (2020). MMSegmentation: Open MMLab semantic segmentation toolbox and benchmark. Available online: https://github.com/open-mmlab/mmsegmentation (Retrieved on 18 May 2022)

  • Cortes, C., Mohri, M., & Rostamizadeh, A. (2012). Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13, 795–828.

    MathSciNet  MATH  Google Scholar 

  • De Boer, P. T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134(1), 19–67.

    Article  MathSciNet  MATH  Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pp. 248–255

  • Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  • Elad, M. (2010). Sparse and redundant representations: from theory to applications in signal and image processing, vol. 2. Springer

  • Ericson, T., & Zinoviev, V. (2001). Codes on Euclidean spheres. Elsevier

  • Glazyrin, A., & Yu, W. H. (2018). Upper bounds for s-distance sets and equiangular lines. Advances in Mathematics, 330, 810–833.

    Article  MathSciNet  MATH  Google Scholar 

  • Gretton, A., Fukumizu, K., Teo, C., Song, L., Schölkopf, B., & Smola, A. (2007). A kernel statistical test of independence. Advances in neural information processing systems, pp. 585–592

  • Guo, Y., Wang, X., Chen, Y., & Yu, S.X. (2022). Clipped hyperbolic classifiers are super-hyperbolic classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11–20

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In IEEE international conference on computer vision, pp. 2961–2969

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  • Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  • Hu, G., Xu, Z., Wang, G., Zeng, B., Liu, Y., & Lei, Y. (2021). Forecasting energy consumption of long-distance oil products pipeline based on improved fruit fly optimization algorithm and support vector regression. Energy, 224, 120153.

  • Jiang, Z., Tidor, J., Yao, Y., Zhang, S., & Zhao, Y. (2021). Equiangular lines with a fixed angle. Annals of Mathematics, 194(3), 729–743.

    Article  MathSciNet  MATH  Google Scholar 

  • Johannes, H. (1948). Equilateral point-sets in elliptic two- and three-dimensional spaces. Nieuw Arch. Wiskunde, 22(2), 355–362.

    MathSciNet  MATH  Google Scholar 

  • Kaya, M., & Bilge, H. Ş. (2019). Symmetry. Deep metric learning: A survey, 11(9), 1066.

    MATH  Google Scholar 

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6399–6408

  • Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., & Houlsby, N. (2020). Big Transfer (BiT): General visual representation learning. In European Conference Computer Vision pp. 491–507. Springer

  • Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of neural network representations revisited. In International conference on machine learning, pp. 3519–3529

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

    Article  MATH  Google Scholar 

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  MATH  Google Scholar 

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO: Common objects in context. In European Conference Computer Vision., pp. 740–755

  • Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12009–12019

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision, pp. 10012–10022

  • Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986

  • Liu, W., Wang, X., Owens, J., & Li, Y. (2020). Energy-based out-of-distribution detection. Adv. Neural Inform. Process. Syst., 33, 21464–21475.

    MATH  Google Scholar 

  • Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

  • Lu, D., & Weng, Q. (2007). A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 28(5), 823–870.

    Article  MATH  Google Scholar 

  • Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151

  • Maxwell, A. E., Warner, T. A., & Fang, F. (2018). Implementation of machine-learning classification in remote sensing: An applied review. International Journal of Remote Sensing, 39(9), 2784–2817.

    Article  MATH  Google Scholar 

  • McCallum, A., Freitag, D., & Pereira, F.C. (2000). Maximum entropy markov models for information extraction and segmentation. In: International Conference on Machine Learning., pp. 591–598

  • Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2013). Distance-based image classification: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2624–2637.

    Article  MATH  Google Scholar 

  • Mettes, P., Van der Pol, E., & Snoek, C. (2019). Hyperspherical prototype networks. Advances in neural information processing systems

  • Müller, S.G., & Hutter, F. (2021). Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 774–782

  • Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011). Reading digits in natural images with unsupervised feature learning. Adv. Neural Inform. Process. Syst. Worksh., pp. 1–9

  • Pernici, F., Bruni, M., Baecchi, C., & Del Bimbo, A. (2021). Regular polytope networks. Adv. Neural Inform. Process. Syst., pp. 4373–4387

  • Ranjan, R., Castillo, C.D., & Chellappa, R. (2017). L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507

  • Rao, H., Leung, C., & Miao, C. (2023). Hierarchical skeleton meta-prototype contrastive learning with hard skeleton mining for unsupervised person re-identification. Int. J. Comput. Vis. pp. 1–23

  • Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation, 29(9), 2352–2449.

    Article  MathSciNet  MATH  Google Scholar 

  • Renes, J. M., Blume-Kohout, R., Scott, A. J., & Caves, C. M. (2004). Symmetric informationally complete quantum measurements. Journal of Mathematical Physics, 45(6), 2171–2180.

    Article  MathSciNet  MATH  Google Scholar 

  • Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics pp. 400–407

  • Rudin, W. (1953). Principles of mathematical analysis

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, C. A., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D.(2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision., pp. 618–626

  • Shen, Y., Sun, X., & Wei, X.S. (2023). Equiangular basis vectors. arXiv preprint arXiv:2303.11637

  • Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems, pp. 4080–4090

  • Strohmer, T., & Heath, R. W., Jr. (2003). Grassmannian frames with applications to coding and communication. Applied and Computational Harmonic Analysis, 14(3), 257–275.

    Article  MathSciNet  MATH  Google Scholar 

  • Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. Advances in Neural Information Processing Systems., pp. 2553–2561

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In The IEEE / CVF Computer Vision and Pattern Recognition Conference., pp. 2818–2826

  • Tammes, P. M. L. (1930). On the origin of number and arrangement of the places of exit on the surface of pollen-grains. Recueil Des Travaux Botaniques Néerlandais, 27(1), 1–84.

    Google Scholar 

  • Tulyakov, S., Jaeger, S., Govindaraju, V., & Doermann, D. (2008). Review of classifier combination methods. Machine learning in document analysis and recognition. pp. 361–386

  • van Lint, J. H., & Seidel, J. J. (1966). Equilateral point sets in elliptic geometry. Indagationes Mathematicae, 28(3), 335–348.

    Article  MathSciNet  MATH  Google Scholar 

  • Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research9(11)

  • Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., & Belongie, S. (2018). The iNaturalist species classification and detection dataset. In: The IEEE / CVF Computer Vision and Pattern Recognition Conference, pp. 8769–8778

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, pp. 6000–6010

  • Vryniotis, V. (2021). How to train State-of-The-Art models using TorchVision’s latest primitives. https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001

  • Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., & Wu, Y. (2014). Learning fine-grained image similarity with deep ranking. In The IEEE / CVF Computer Vision and Pattern Recognition Conference, pp. 1386–1393

  • Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In: The IEEE / CVF Computer Vision and Pattern Recognition Conference, pp. 5265–5274

  • Wang, F., Xiang, X., Cheng, J., & Yuille, A.L. (2017). Normface: L2 hypersphere embedding for face verification. In: ACM International Conference Multimedia, pp. 1041–1049

  • Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(2), 207–244.

    MATH  Google Scholar 

  • Wei, X. S., Song, Y. Z., Mac Aodha, O., Wu, J., Peng, Y., Tang, J., Yang, J., & Belongie, S. (2022). Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 8927–8948.

    Article  Google Scholar 

  • Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In European Conference Computer Vision, pp. 499–515

  • Wightman, R., Touvron, H., & Jégou, H. (2021). ResNet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476

  • Wu, Z., Xiong, Y., Yu, S.X., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In The IEEE / CVF Computer Vision and Pattern Recognition Conference, pp. 3733–3742

  • Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In European Conference Computer Vision., pp. 418–434

  • Xu, P., Ehinger, K.A., Zhang, Y., Finkelstein, A., Kulkarni, S.R., & Xiao, J. (2015). Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755

  • Xu, W., Xian, Y., Wang, J., Schiele, B., & Akata, Z. (2022). Attribute prototype network for any-shot learning. International Journal of Computer Vision, 130(7), 1735–1753.

    Article  MATH  Google Scholar 

  • Yang, Y., Xie, L., Chen, S., Li, X., Lin, Z., & Tao, D. (2022). Do we really need a learnable classifier at the end of deep neural network? arXiv preprint arXiv:2203.09081

  • Ye, H. J., Zhan, D. C., Jiang, Y., & Zhou, Z. H. (2022). Heterogeneous few-shot model rectification with semantic mapping. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11), 3878–3891.

    Article  MATH  Google Scholar 

  • Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., & Xiao, J. (2015). Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365

  • Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision, pp. 6023–6032

  • Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146

  • Zhai, X., Kolesnikov, A., Houlsby, N., & Beyer, L. (2022). Scaling vision transformers. In The IEEE / CVF Computer Vision and Pattern Recognition Conference pp. 12104–12113

  • Zhang, H., Cisse, M., Dauphin, Y.N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412

  • Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. In: AAAI, pp. 13001–13008

  • Zhou, Z. H. (2016). Learnware: On the future of machine learning. Frontiers of Computer Science,10(4), 589–590.

  • Zhou, B., Cui, Q., Wei, X.S., & Chen, Z.M. (2020). BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In The IEEE / CVF Computer Vision and Pattern Recognition Conference, pp. 9719–9728

  • Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,40(6), 1452–1464.

  • Zhou, H.Y., Lu, C., Chen, C., Yang, S., & Yu, Y. (2023). A unified visual information preservation framework for self-supervised pre-training in medical image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7), 8020–8035.

  • Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ADE20k dataset. International Journal of Computer Vision, 127(3), 302–321.

Download references

Acknowledgements

The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. This work was supported by National Key R &D Program of China (2021YFA1001100), National Natural Science Foundation of China under Grant (62272231), and the Fundamental Research Funds for the Central Universities (4009002401).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiu-Shen Wei.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Zhouchen Lin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

Proof of Why A Unit Hypersphere is Effective: Following Sects. 3.3 and 3.4, we have M sample-label pairs in N classes. \(\varvec{W}\in \mathbb {R}^{d\times N}\) denotes the EBVs matrix while \(\hat{\varvec{w}}_{i}\in \mathbb {R}^{d}\) represents each categorical EBV, where \(i \in \{1,2,\ldots ,N\}\) and \(\Vert \hat{\varvec{w}}_{i}\Vert =1\). Since we have already assumed that all samples are well-separated, we directly use \(\hat{\varvec{w}}_{i}\) to represent the i-th class’ feature. Please allow me to emphasize once again that the categorical EBV for each class remains unchanged throughout the training process, which distinguishes it from those prototype-based methods.

As we assume that every class has the same sample number, therefore, the definition of the softmax loss is:

$$\begin{aligned} \mathcal {L}_{\texttt {softmax}} = -\frac{1}{N}\sum _{i=1}^N{\log \frac{e^{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{i}}}{\sum _{j=1}^N{e^{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}}}}\,. \end{aligned}$$
(15)

Additionally, as we adopt a temperature hyper-parameter \(\tau \) within the softmax loss, then its formulation is change into:

$$\begin{aligned} \mathcal {L}_{\mathcal {S}} = -\frac{1}{N}\sum _{i=1}^N{\log \frac{e^{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{i}}{\tau }}}{\sum _{j=1}^N {e^{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}{\tau }}}}}\,. \end{aligned}$$
(16)

By dividing \(e^{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{i}}{\tau }} = e^{\frac{1}{\tau }}\) from both the numerator and denominator:

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mathcal {S}&= -\frac{1}{N}\sum _{i=1}^N {\log \frac{1}{1 + \sum _{j=1,j\ne i}^N {e^{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}{\tau } - \frac{1}{\tau }}}}}\\&= \frac{1}{N}\sum _{i=1}^N {\log \left( {1 + \sum _{j=1,j\ne i}^N {e^{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j} - 1 }{\tau }}}} \right) }\,. \end{aligned} \end{aligned}$$
(17)

Evidently, \(f(x) = e^x\) is a convex function and \(\frac{1}{N}\sum _{i=1}^N {e^{x_i}} \ge e^{\frac{1}{N}\sum _{i=1}^{N}{x_i}}\). Then we have:

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mathcal {S} \ge \frac{1}{N}\sum _{i=1}^N {\log \left( {1 + (N-1) e^{\frac{1}{N-1} \sum \limits _{j=1,j\ne i}^N{( \frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j} - 1 }{\tau } )}}}\right) }. \end{aligned} \end{aligned}$$
(18)

The equality holds if and only if all \(\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}{\tau }, 1\le i < j \le N\) have the same value, i.e., features from different classes have the same distance. As it has been proved that with a fixed common angle, the maximum number of equiangular lines is linearly correlated with the dimension d as \(d \rightarrow \infty \) (Jiang et al., 2021). Since we have consider a large number of categories, which is much bigger than the dimension d of feature, (e.g., the class number equals 1000 while the dimension of feature equals 100 in Table 3, the class number equals 100,000 while the dimension of feature equals 5000 in Table 9. Note that ‘EBVs Dimension’ equals the dimension of feature.) this equality actually cannot hold in practice. Following (Wang et al., 2017), we then take feature dimension into consideration and improve the previous inequality.

Similar with \(f(x) = e^x\), the softplus function \(s(x) = \log (1 + C e^x)\) is also a convex function when \(C>0\), so that \(\frac{1}{N}\sum _{i=1}^N{\log (1 + C e^{x_i})} \ge \log (1 + C e^{\frac{1}{N}\sum _{i=1}^{N}{x_i}})\), then we have:

$$\begin{aligned} \begin{aligned}&\mathcal {L}_\mathcal {S} \ge \log \left( 1+ \left( N-1 \right) e^{\frac{1}{N(N-1)}\sum \limits _{i=1}^N\sum \limits _{j=1,j\ne i}^n{(\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j} - 1}{\tau })}} \right) \\&= \log \left( 1+ \left( N-1 \right) e^{\left( \frac{1}{N(N-1)}\sum \limits _{i=1}^N\sum \limits _{j=1,j\ne i}^N{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}{\tau }}\right) - \frac{1}{\tau }} \right) \,. \end{aligned} \end{aligned}$$
(19)

This equality holds if and only if \(\forall \varvec{w}_i\), the sums of distances to other class’ weight \(\sum _{j=1,j\ne i}^N{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}\) are all the same.

Note that

$$\begin{aligned} \left| \left| \sum _{i=1}^N{\varvec{w}_i}\right| \right| _2^2 = N + \sum _{i=1}^N\sum _{j=1,j\ne i}^N{\varvec{w}_i^{\top } \varvec{w}_j}, \end{aligned}$$
(20)

so

$$\begin{aligned} \sum _{i=1}^N\sum _{j=1,j\ne i}^N{\frac{\hat{\varvec{w}}_{i}^{\top } \hat{\varvec{w}}_{j}}{\tau }} \ge -\frac{N}{\tau }. \end{aligned}$$
(21)

The equality holds if and only if \(\sum _{i=1}^N{\varvec{w}_i}=\textbf{0}\). Thus,

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mathcal {S}&\ge \log \left( 1+ \left( N-1 \right) e^{-\frac{N}{\tau N(N-1)}-\frac{1}{\tau }}\right) \\&=\log \left( 1+ \left( N-1 \right) e^{-\frac{N}{\tau (N-1)}}\right) . \end{aligned} \end{aligned}$$
(22)

Taking \(N=1000\) as an example, we have mentioned that \(\tau \) is set as 0.07 in Sect. 4.1.1. Therefore the lower bound is around 0.00062. That is, the temperature hyper-parameter has already dealt with the problem that the softmax loss will be trapped at a very high value on training set if we normalize the features and weights to 1, and it will be fine to keep the predefined EBVs on the surface of a normalized hypersphere.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, Y., Sun, X., Wei, XS. et al. Equiangular Basis Vectors: A Novel Paradigm for Classification Tasks. Int J Comput Vis 133, 372–397 (2025). https://doi.org/10.1007/s11263-024-02189-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-024-02189-2

Keywords

Navigation