Elsevier

Pattern Recognition

Volume 122, February 2022, 108237
Pattern Recognition

Zero-shot learning via a specific rank-controlled semantic autoencoder

https://doi.org/10.1016/j.patcog.2021.108237Get rights and content

Highlights

  • The proposed LSA model solves the domain shift problem, and considers the low-rank structure of the reconstruction data.

  • The proposed SRSA model can avoid simultaneously minimizing the variance of reconstruction data.

  • Comprehensive experiments on some datasets demonstrate the effectiveness of the proposed approaches.

Abstract

Existing embedding zero-shot learning models usually learn a projection function from the visual feature space to the semantic embedding space, e.g. attribute space or word vector space. However, the projection learned based on seen samples may not generalize well to unseen classes, which is known as the projection domain shift problem in ZSL. To address this issue, we propose a method named Low-rank Semantic Autoencoder (LSA) to consider the low-rank structure of seen samples to maintain the sparse feature of reconstruction error, which can further improve zero-shot learning capability. Moreover, to obtain a more robust projection for unseen classes, we propose a Specific Rank-controlled Semantic Autoencoder (SRSA) to accurately control of the projection’s rank. Extensive experiments on six benchmarks demonstrate the superiority of the proposed models over most existing embedding ZSL models under the standard zero-shot setting and the more realistic generalized zero-shot setting.

Introduction

Humans can identify a lot of categories, specifically, about 30,000 basic classes and even more sub-classes. At the same time, humans are also very good at recognizing objects even if they have never seen any of their examples, which is considered as the problem of zero-shot learning (ZSL) in machine learning. For example, if a child has seen cattle before, he/she can easily recognize a cow and learn that a cow looks like cattle with black-and-white color. Therefore, in machine learning, ZSL tries to recognize classes whose samples haven’t been available during training time [1], [2], [3]. Recently, the number of new models has been increasing rapidly.

Zero-shot recognition mainly makes use of the training set with labeled seen classes and the semantic relationship between seen and unseen classes. The semantic embedding space is defined as a high dimensional vector space where seen and unseen classes are usually related, which can be a semantic attribute space [4] or a semantic word vector space [5]. The class prototype denotes that both unseen and seen classes are embedded as vectors and the relationships between classes in the semantic embedding space are usually measured by a distance. For example, the prototypes of cows and bison should be closer than those of cows and desks. With respect to the bridge between the visual feature space and the semantic embedding space, most existing ZSL methods tend to learn a projection function that can project the data from the visual feature space to the semantic embedding space with the labeled training set including seen classes only. When classifying unseen objects, the function is used to project the visual representation of unseen classes into the same semantic embedding space including unseen and seen classes. Then the nearest neighbor (NN) search method is used to recognize the sample of unseen class, which is the testing and recognizing process.

In early works of ZSL [6], the attributes within a two-stage approach are widely used to predict the label of a sample that belongs to one of the unseen classes. For example, Direct Attributes Prediction (DAP) [4] learns probabilistic attribute classifiers to estimate the posterior of each attribute for a sample. Then it calculates class posteriors and infers class labels with maximum a posterior (MAP) estimate. In addition, Indirect Attribute Prediction (IAP) [4] infers the class posterior of seen classes, and it makes use of the probability of each class to predict the attribute posteriors of each sample. Then, the class posterior of unseen classes can be predicted with the help of a multiclass classifier. Moreover, the two-stage approach is also used when attributes are not available. For instance, CONSE [7] first infers seen class posteriors, then the image feature will be projected into the word2vec space [8].

Recently, most ZSL models directly learn a mapping from the feature space to the semantic space. For example, ALE [9] learns a bilinear compatibility function between the feature space and the semantic space with ranking loss. Likewise, SJE [10] learns the bilinear compatibility but with the structure SVM loss optimized. Alternatively, DeViSE [11] learns a linear mapping using an efficient ranking loss formulation, which is evaluated on the large-scale ImageNet dataset. ESZSL [12] learns the bilinear compatibility using the square loss and regularizes the objective using the Frobenius norm.

However, ZSL suffers from the domain shift problem [13] that the visual representations of the same attributes may look completely different in unseen classes. To solve this problem, Kodirov et al. [14] proposed a novel approach called Semantic Autoencoder (SAE) for ZSL. Firstly, the visual feature representation of an image is projected into a semantic space by an encoder, which is the same as the conventional ZSL model. Secondly, the visual feature projection in the semantic space is projected back to the original space as the reconstructed visual feature representation by the decoder. Although SAE works well in many datasets, it does not take the low-rank structure of the data into account, which means that the robustness of the model is not strong when encountering the images in the wild. For example, sharks, dolphins and blue whales all have similar features, e.g.,“a tail”, so we hope to jointly optimize low-rank embedding and reconstruction to capture shared discriminative features across seen and unseen classes (See Fig. 1). We propose a novel approach called Low-rank Semantic Autoencoder (LSA) for zero-shot learning in this paper. The LSA model not only solves the domain shift problem, but also considers the low-rank structure of the reconstruction data and the sparse feature of reconstruction error. In addition, we additionally propose a Specific Rank-controlled Semantic Autoencoder (SRSA) for achieving an explicit control on the rank of projection. This model can avoid simultaneously minimizing the variance of reconstruction data while minimizing the objective’s rank. Experimental results on six benchmark datasets demonstrate that our proposed LSA and SRSA significantly outperform existing state-of-the-art ZSL models.

Section snippets

Related work

In this section, considering the key procedure in ZSL, we mainly introduce related work in terms of embedding ZSL models and projection domain bias problem in ZSL.

Proposed approach

In this section, firstly, we provide the model formulation and optimization algorithm in detail for the first proposed model Low-rank Semantic Autoencoder (LSA). Secondly, a Specific Rank-controlled Semantic Autoencoder (SRSA) is proposed. At last, we briefly introduce the zero-shot recognition method.

Experiments

In this section, we implement both LSA and SRSA on five small-scale benchmark datasets (CUB, SUN, APY, AWA1 and AWA2) and one large-scale benchmark dataset (ImageNet). The LSA and SRSA model have two free parameters “λ” and “μ”, which are set by class-wise cross-validation using the training data. Specifically, “λ” and “μ” vary from 106 to 106. For SRSA, “τ” varies from 10 to 30.

Conclusion

In this paper, we firstly propose a novel Low-rank Semantic Autoencoder (LSA) model for zero-shot learning. We take the low-rank structure of data into account and add the corresponding constraint to our LSA model. Meanwhile, we use the L1-norm metric to measure the reconstruction error because it is sparse. In addition, we propose a Specific Rank-controlled Semantic Autoencoder (SRSA), which can avoid simultaneously minimizing the variance of reconstruction data while minimizing projection’s

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant 61906141, 62050175 and 62036007, the National Natural Science Foundation of Shaanxi Province under Grant no. 2020JQ-317, China Postdoctoral Science Foundation (Grant no. 2019M653564), the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (no. VRLAB2021B02), CCF-Tencent Open Fund, the Open Project Program of Key Laboratory of Computer Network and Information

Yang Liu received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 2013, 2015 and 2018, respectively. He is currently a Post-Doctoral Researcher in Xidian University, Xi’an, China. He has authored nearly 20 technical articles in refereed journals and proceedings, including IEEE Trans. Image, IEEE Trans. Cybernetics, PR, CVPR, AAAI, and IJCAI. His research interests include dimensionality reduction, pattern recognition, and deep

References (55)

  • M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G.S. Corrado, J. Dean, Zero-shot learning by convex...
  • T. Mikolov et al.

    Distributed representations of words and phrases and their compositionality

    Advances in Neural Information Processing Systems

    (2013)
  • Z. Akata et al.

    Label-embedding for image classification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • Z. Akata et al.

    Evaluation of output embeddings for fine-grained image classification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • A. Frome et al.

    Devise: a deep visual-semantic embedding model

    Advances in Neural Information Processing Systems

    (2013)
  • B. Romera-Paredes et al.

    An embarrassingly simple approach to zero-shot learning

    International Conference on Machine Learning

    (2015)
  • S. Changpinyo et al.

    Synthesized classifiers for zero-shot learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • E. Kodirov et al.

    Semantic autoencoder for zero-shot learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • H. Zhang et al.

    Zero-shot kernel learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • Y. Annadani et al.

    Preserving semantic relations for zero-shot learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • L. Zhang et al.

    Learning a deep embedding model for zero-shot learning

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • F. Sung et al.

    Learning to compare: relation network for few-shot learning

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (2018)
  • Z. Zhang et al.

    Zero-shot learning via semantic similarity embedding

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • Z. Ding et al.

    Low-rank embedded ensemble semantic dictionary for zero-shot learning

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Z. Akata et al.

    Evaluation of output embeddings for fine-grained image classification

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • B. Romera-Paredes et al.

    An embarrassingly simple approach to zero-shot learning

    International Conference on Machine Learning

    (2015)
  • J. Lei Ba et al.

    Predicting deep zero-shot convolutional neural networks using textual descriptions

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • Cited by (31)

    View all citing articles on Scopus

    Yang Liu received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 2013, 2015 and 2018, respectively. He is currently a Post-Doctoral Researcher in Xidian University, Xi’an, China. He has authored nearly 20 technical articles in refereed journals and proceedings, including IEEE Trans. Image, IEEE Trans. Cybernetics, PR, CVPR, AAAI, and IJCAI. His research interests include dimensionality reduction, pattern recognition, and deep learning.

    Xinbo Gao received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 1994, 1997, and 1999, respectively. He was a Research Fellow with the Department of Computer Science, Shizuoka University, Shizuoka, Japan, from 1997 to 1998. From 2000 to 2001,he was a Post-Doctoral Research Fellow with the Department of Information Engineering, Chinese University of Hong Kong, Hong Kong. Since 2001, he has been with the School of Electronic Engineering, Xidian University. He is currently a Professor of pattern recognition and intelligent systems, and the Director of the State Key Laboratory of Integrated Services Networks, Xidian University. He has authored five books and around 150 technical articles in refereed journals and proceedings, including IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Systems, Man and Cybernetics, and Pattern Recognition in his areas of expertise. His current research interests include computational intelligence, machine learning, computer vision, pattern recognition and wireless communications.

    Jungong Han is a tenured faculty member with the School of Computing and Communications at Lancaster University, Lancaster, UK. His current research interests include multimedia content identification, multisensor data fusion, computer vision, and multimedia security. Dr. Han is an Associate Editor of Neurocomputing (Elsevier), and an Editorial Board Member of Multimedia Tools and Applications (Springer).

    Li Liu received the B.Eng. degree in electronic information engineering from Xi’an Jiaotong University, Xi’an, China, in 2011, and the Ph.D. degree from the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, U.K., in 2014. He is currently with the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. His current research interests include computer vision, machine learning, and data mining.

    Ling Shao (M’09-SM’10) was a Professor with the School of Computing Sciences, University of East Anglia, Norwich, U.K. He is currently the CEO and Chief Scientist of the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. He is an Associate Editor of the IEEE Transactions on Image Processing, the IEEE Transactions on Neural Networks and Learning Systems, and several other journals.

    View full text