Zero-shot learning via a specific rank-controlled semantic autoencoder

doi:10.1016/j.patcog.2021.108237

Pattern Recognition

Volume 122, February 2022, 108237

https://doi.org/10.1016/j.patcog.2021.108237 Get rights and content

Highlights

•
The proposed LSA model solves the domain shift problem, and considers the low-rank structure of the reconstruction data.
•
The proposed SRSA model can avoid simultaneously minimizing the variance of reconstruction data.
•
Comprehensive experiments on some datasets demonstrate the effectiveness of the proposed approaches.

Abstract

Existing embedding zero-shot learning models usually learn a projection function from the visual feature space to the semantic embedding space, e.g. attribute space or word vector space. However, the projection learned based on seen samples may not generalize well to unseen classes, which is known as the projection domain shift problem in ZSL. To address this issue, we propose a method named Low-rank Semantic Autoencoder (LSA) to consider the low-rank structure of seen samples to maintain the sparse feature of reconstruction error, which can further improve zero-shot learning capability. Moreover, to obtain a more robust projection for unseen classes, we propose a Specific Rank-controlled Semantic Autoencoder (SRSA) to accurately control of the projection’s rank. Extensive experiments on six benchmarks demonstrate the superiority of the proposed models over most existing embedding ZSL models under the standard zero-shot setting and the more realistic generalized zero-shot setting.

Introduction

Humans can identify a lot of categories, specifically, about 30,000 basic classes and even more sub-classes. At the same time, humans are also very good at recognizing objects even if they have never seen any of their examples, which is considered as the problem of zero-shot learning (ZSL) in machine learning. For example, if a child has seen cattle before, he/she can easily recognize a cow and learn that a cow looks like cattle with black-and-white color. Therefore, in machine learning, ZSL tries to recognize classes whose samples haven’t been available during training time [1], [2], [3]. Recently, the number of new models has been increasing rapidly.

Zero-shot recognition mainly makes use of the training set with labeled seen classes and the semantic relationship between seen and unseen classes. The semantic embedding space is defined as a high dimensional vector space where seen and unseen classes are usually related, which can be a semantic attribute space [4] or a semantic word vector space [5]. The class prototype denotes that both unseen and seen classes are embedded as vectors and the relationships between classes in the semantic embedding space are usually measured by a distance. For example, the prototypes of cows and bison should be closer than those of cows and desks. With respect to the bridge between the visual feature space and the semantic embedding space, most existing ZSL methods tend to learn a projection function that can project the data from the visual feature space to the semantic embedding space with the labeled training set including seen classes only. When classifying unseen objects, the function is used to project the visual representation of unseen classes into the same semantic embedding space including unseen and seen classes. Then the nearest neighbor (NN) search method is used to recognize the sample of unseen class, which is the testing and recognizing process.

In early works of ZSL [6], the attributes within a two-stage approach are widely used to predict the label of a sample that belongs to one of the unseen classes. For example, Direct Attributes Prediction (DAP) [4] learns probabilistic attribute classifiers to estimate the posterior of each attribute for a sample. Then it calculates class posteriors and infers class labels with maximum a posterior (MAP) estimate. In addition, Indirect Attribute Prediction (IAP) [4] infers the class posterior of seen classes, and it makes use of the probability of each class to predict the attribute posteriors of each sample. Then, the class posterior of unseen classes can be predicted with the help of a multiclass classifier. Moreover, the two-stage approach is also used when attributes are not available. For instance, CONSE [7] first infers seen class posteriors, then the image feature will be projected into the word2vec space [8].

Recently, most ZSL models directly learn a mapping from the feature space to the semantic space. For example, ALE [9] learns a bilinear compatibility function between the feature space and the semantic space with ranking loss. Likewise, SJE [10] learns the bilinear compatibility but with the structure SVM loss optimized. Alternatively, DeViSE [11] learns a linear mapping using an efficient ranking loss formulation, which is evaluated on the large-scale ImageNet dataset. ESZSL [12] learns the bilinear compatibility using the square loss and regularizes the objective using the Frobenius norm.

However, ZSL suffers from the domain shift problem [13] that the visual representations of the same attributes may look completely different in unseen classes. To solve this problem, Kodirov et al. [14] proposed a novel approach called Semantic Autoencoder (SAE) for ZSL. Firstly, the visual feature representation of an image is projected into a semantic space by an encoder, which is the same as the conventional ZSL model. Secondly, the visual feature projection in the semantic space is projected back to the original space as the reconstructed visual feature representation by the decoder. Although SAE works well in many datasets, it does not take the low-rank structure of the data into account, which means that the robustness of the model is not strong when encountering the images in the wild. For example, sharks, dolphins and blue whales all have similar features, e.g.,“a tail”, so we hope to jointly optimize low-rank embedding and reconstruction to capture shared discriminative features across seen and unseen classes (See Fig. 1). We propose a novel approach called Low-rank Semantic Autoencoder (LSA) for zero-shot learning in this paper. The LSA model not only solves the domain shift problem, but also considers the low-rank structure of the reconstruction data and the sparse feature of reconstruction error. In addition, we additionally propose a Specific Rank-controlled Semantic Autoencoder (SRSA) for achieving an explicit control on the rank of projection. This model can avoid simultaneously minimizing the variance of reconstruction data while minimizing the objective’s rank. Experimental results on six benchmark datasets demonstrate that our proposed LSA and SRSA significantly outperform existing state-of-the-art ZSL models.

Section snippets

Related work

In this section, considering the key procedure in ZSL, we mainly introduce related work in terms of embedding ZSL models and projection domain bias problem in ZSL.

Proposed approach

In this section, firstly, we provide the model formulation and optimization algorithm in detail for the first proposed model Low-rank Semantic Autoencoder (LSA). Secondly, a Specific Rank-controlled Semantic Autoencoder (SRSA) is proposed. At last, we briefly introduce the zero-shot recognition method.

Experiments

In this section, we implement both LSA and SRSA on five small-scale benchmark datasets (CUB, SUN, APY, AWA1 and AWA2) and one large-scale benchmark dataset (ImageNet). The LSA and SRSA model have two free parameters “ $λ$ ” and “ $μ$ ”, which are set by class-wise cross-validation using the training data. Specifically, “ $λ$ ” and “ $μ$ ” vary from $10^{- 6}$ to $10^{6}$ . For SRSA, “ $τ$ ” varies from 10 to 30.

Conclusion

In this paper, we firstly propose a novel Low-rank Semantic Autoencoder (LSA) model for zero-shot learning. We take the low-rank structure of data into account and add the corresponding constraint to our LSA model. Meanwhile, we use the L1-norm metric to measure the reconstruction error because it is sparse. In addition, we propose a Specific Rank-controlled Semantic Autoencoder (SRSA), which can avoid simultaneously minimizing the variance of reconstruction data while minimizing projection’s

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China under Grant 61906141, 62050175 and 62036007, the National Natural Science Foundation of Shaanxi Province under Grant no. 2020JQ-317, China Postdoctoral Science Foundation (Grant no. 2019M653564), the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (no. VRLAB2021B02), CCF-Tencent Open Fund, the Open Project Program of Key Laboratory of Computer Network and Information

Yang Liu received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 2013, 2015 and 2018, respectively. He is currently a Post-Doctoral Researcher in Xidian University, Xi’an, China. He has authored nearly 20 technical articles in refereed journals and proceedings, including IEEE Trans. Image, IEEE Trans. Cybernetics, PR, CVPR, AAAI, and IJCAI. His research interests include dimensionality reduction, pattern recognition, and deep

References (55)

C. Geng et al.
Guided CNN for generalized zero-shot and open-set recognition using visual and semantic prototypes
Pattern Recognit.
(2020)
Y. Liu et al.
Label-activating framework for zero-shot learning
Neural Netw.
(2020)
M. Xing et al.
Ventral & dorsal stream theory based zero-shot action recognition
Pattern Recognit.
(2021)
Z. Cao et al.
Zero-shot handwritten chinese character recognition with hierarchical decomposition embedding
Pattern Recognit.
(2020)
Y. Liu et al.
Relation-based discriminative cooperation network for zero-shot classification
Pattern Recognit.
(2021)
H. Zhang et al.
Dual-verification network for zero-shot learning
Inf. Sci.
(2019)
M. Palatucci et al.
Zero-shot learning with semantic output codes
Advances in Neural Information Processing Systems
(2009)
C.H. Lampert et al.
Attribute-based classification for zero-shot visual object categorization
IEEE Trans. Pattern Anal. Mach. Intell.
(2014)
R. Socher et al.
Zero-shot learning through cross-modal transfer
Advances in Neural Information Processing Systems
(2013)
Z. Al-Halah et al.
Recovering the missing link: predicting class-attribute associations for unsupervised zero-shot learning
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)

M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G.S. Corrado, J. Dean, Zero-shot learning by convex...

T. Mikolov et al.

Distributed representations of words and phrases and their compositionality

Advances in Neural Information Processing Systems

(2013)

Z. Akata et al.

Label-embedding for image classification

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

Z. Akata et al.

Evaluation of output embeddings for fine-grained image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

A. Frome et al.

Devise: a deep visual-semantic embedding model

Advances in Neural Information Processing Systems

(2013)

B. Romera-Paredes et al.

An embarrassingly simple approach to zero-shot learning

International Conference on Machine Learning

(2015)

S. Changpinyo et al.

Synthesized classifiers for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

E. Kodirov et al.

Semantic autoencoder for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

H. Zhang et al.

Zero-shot kernel learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

Y. Annadani et al.

Preserving semantic relations for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

L. Zhang et al.

Learning a deep embedding model for zero-shot learning

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

F. Sung et al.

Learning to compare: relation network for few-shot learning

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

(2018)

Z. Zhang et al.

Zero-shot learning via semantic similarity embedding

Proceedings of the IEEE International Conference on Computer Vision

(2015)

Z. Ding et al.

Low-rank embedded ensemble semantic dictionary for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

Z. Akata et al.

Evaluation of output embeddings for fine-grained image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

B. Romera-Paredes et al.

An embarrassingly simple approach to zero-shot learning

International Conference on Machine Learning

(2015)

J. Lei Ba et al.

Predicting deep zero-shot convolutional neural networks using textual descriptions

Proceedings of the IEEE International Conference on Computer Vision

(2015)

Cited by (31)

Mutual Balancing in State-Object Components for Compositional Zero-Shot Learning
2024, Pattern Recognition
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions from seen states and objects. The disparity between the manually labeled semantic information and its actual visual features causes a significant imbalance of visual deviation in the distribution of various object classes and state classes, which is ignored by existing methods. To ameliorate these issues, we consider the CZSL task as an unbalanced multi-label classification task and propose a novel method called MUtual balancing in STate-object components (MUST) for CZSL, which provides a balancing inductive bias for the model. In particular, we split the classification of the composition classes into two consecutive processes to analyze the entanglement of the two components to get additional knowledge in advance, which reflects the degree of visual deviation between the two components. We use the knowledge gained to modify the model’s training process in order to generate more distinct class borders for classes with significant visual deviations. Extensive experiments demonstrate that our approach significantly outperforms the state-of-the-art on MIT-States, UT-Zappos, and C-GQA when combined with the basic CZSL frameworks, and it can improve various CZSL frameworks. Our code is available at https://github.com/LanchJL/MUST.
A novel mechanical fault diagnosis for high-voltage circuit breakers with zero-shot learning
2024, Expert Systems with Applications
In recent years, data-driven methods have been widely used in the field of high-voltage circuit breakers (HVCBs) fault diagnosis. However, due to the complex mechanical structure of HVCBs and the special operating environment, it is difficult to obtain a large amount of fault samples and exhaust all fault types. The lack of fault samples and fault types typically results in significant degradation of diagnostic performance. To address this problem, we design a novel method named R-MLL for zero-shot HVCB fault diagnosis. R-MLL tries to identify unseen fault types only by training seen fault types. First, to focus on all the details of the HVCB mechanical vibration signal, the wavelet transform is used to multi-scale refine the fault data. Second, a new network (RDSCNN) is designed to extract multidimensional features based on convolutional neural network incorporating residual block and depthwise separable convolution. Third, a multi-label attribute learning network is designed, enabling the fusion of fault features and attributes and allowing attribute labels to assist fault diagnosis tasks. Extensive experiments show that R-MLL gets average accuracy of 86.2% for compound fault diagnosis without the need of using target fault samples for building the diagnostic model. Comparisons with a number of state-of-the-art techniques show the superiority of the proposed method for zero-shot HVCB diagnosis.
Adaptive Relation-Aware Network for zero-shot classification
2024, Neural Networks
Supervised learning-based image classification in computer vision relies on visual samples containing a large amount of labeled information. Considering that it is labor-intensive to collect and label images and construct datasets manually, Zero-Shot Learning (ZSL) achieves knowledge transfer from seen categories to unseen categories by mining auxiliary information, which reduces the dependence on labeled image samples and is one of the current research hotspots in computer vision. However, most ZSL methods fail to properly measure the relationships between classes, or do not consider the differences and similarities between classes at all. In this paper, we propose Adaptive Relation-Aware Network (ARAN), a novel ZSL approach that incorporates the improved triplet loss from deep metric learning into a VAE-based generative model, which helps to model inter-class and intra-class relationships for different classes in ZSL datasets and generate an arbitrary amount of high-quality visual features containing more discriminative information. Moreover, we validate the effectiveness and superior performance of our ARAN through experimental evaluations under ZSL and more practical GZSL settings on three popular datasets AWA2, CUB, and SUN.
Improving generalized zero-shot learning via cluster-based semantic disentangling representation
2024, Pattern Recognition
Generalized Zero-Shot Learning (GZSL) aims to recognize both seen and unseen classes by training only the seen classes, in which the instances of unseen classes tend to be biased towards the seen class. In this paper, we propose a Cluster-based Semantic Disentangling Representation (CSDR) method to improve GZSL by alleviating the problems of domain shift and semantic gap. First, we cluster the seen data into multiple clusters, where the samples in each cluster belong to several original seen categories, so as to facilitate fine-grained semantic disentangling of visual feature vectors. Then, we introduce representation random swapping and contrastive learning based on the clustering results to realize the disentangling semantic representations of semantic-unspecific, class-shared, and class-unique. The fine-grained semantic disentangling representations show high intra-class similarity and inter-class discriminability, which improve the performance of GZSL by alleviating the problem of domain shift. Finally, we construct the visual-semantic embedding space by the variational auto-encoder and alignment module, which can bridge the semantic gap by generating strongly discriminative unseen class samples. Extensive experimental results on four public data sets prove that our method significantly outperforms state-of-the-art methods in generalized and conventional settings.
BCE4ZSR: Bi-encoder empowered by teacher cross-encoder for zero-shot cold-start news recommendation
2024, Information Processing and Management
In the realm of news recommendations, the persistent challenge of the cold-start problem continues to impede progress. Existing approaches rely heavily on information exchange between news articles and users to personalize news recommendations and have struggled to adapt to users and news articles without historical interaction data. In this paper, we introduce BCE4ZSR, a novel framework that leverages a rarely utilized zero-shot approach to effectively tackle the cold-start problem in news recommendations. The proposed approach consists of two main steps: First, we generate embeddings for inference with a sentence transformer (a bi-encoder). In the second step, a fine-tuned transformer model is augmented during the training phase to distil the bi-encoder with knowledge from the cross-encoder model using a student–teacher framework. As a result of this synergy, the cross-encoder serves as a teacher, imparting its knowledge to the bi-encoder. The proposed technique can be applied to any neural news recommender system and is empirically evaluated in both cold-start and regular user-news interaction situations. Experiments on real-world benchmark datasets (MIND-small and MIND-large) indicated that BCE4ZSR outperformed the baseline methods in terms of nDCG@k, AUC, and MRR. Specifically, AUC improvement of 1.5%–6% & 2.2%–7.93%; MRR improved by 1%–13.85% & 0.9%–14.72%; nDCG@5 improved by 4.9%–18.79% & 2.1%–16.44%; and nDCG@10 improved by 1.7%–15.18% & 1.9%–14.24% in the cold start and warm user scenarios respectively, proving the superiority of our model compared to baseline methods.
Learning adversarial semantic embeddings for zero-shot recognition in open worlds
2024, Pattern Recognition
Zero-Shot Learning (ZSL) focuses on classifying samples of unseen classes with only their side semantic information presented during training. It cannot handle real-life, open-world scenarios where there are test samples of unknown classes for which neither samples (e.g., images) nor their side semantic information is known during training. Open-Set Recognition (OSR) is dedicated to addressing the unknown class issue, but existing OSR methods are not designed to model the semantic information of the unseen classes. To tackle this combined ZSL and OSR problem, we consider the case of “Zero-Shot Open-Set Recognition” (ZS-OSR), where a model is trained under the ZSL setting but it is required to accurately classify samples from the unseen classes while being able to reject samples from the unknown classes during inference. We perform large experiments on combining existing state-of-the-art ZSL and OSR models for the ZS-OSR task on four widely used datasets adapted from the ZSL task, and reveal that ZS-OSR is a non-trivial task as the simply combined solutions perform badly in distinguishing the unseen-class and unknown-class samples. We further introduce a novel approach specifically designed for ZS-OSR, in which our model learns to generate adversarial semantic embeddings of the unknown classes to train an unknowns-informed ZS-OSR classifier. Extensive empirical results show that our method 1) substantially outperforms the combined solutions in detecting the unknown classes while retaining the classification accuracy on the unseen classes and 2) achieves similar superiority under generalized ZS-OSR settings. Our code is available at https://github.com/lhrst/ASE.

View all citing articles on Scopus

Xinbo Gao received the B.Eng., M.Sc., and Ph.D. degrees in signal and information processing from Xidian University, Xi’an, China, in 1994, 1997, and 1999, respectively. He was a Research Fellow with the Department of Computer Science, Shizuoka University, Shizuoka, Japan, from 1997 to 1998. From 2000 to 2001,he was a Post-Doctoral Research Fellow with the Department of Information Engineering, Chinese University of Hong Kong, Hong Kong. Since 2001, he has been with the School of Electronic Engineering, Xidian University. He is currently a Professor of pattern recognition and intelligent systems, and the Director of the State Key Laboratory of Integrated Services Networks, Xidian University. He has authored five books and around 150 technical articles in refereed journals and proceedings, including IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Systems, Man and Cybernetics, and Pattern Recognition in his areas of expertise. His current research interests include computational intelligence, machine learning, computer vision, pattern recognition and wireless communications.

Jungong Han is a tenured faculty member with the School of Computing and Communications at Lancaster University, Lancaster, UK. His current research interests include multimedia content identification, multisensor data fusion, computer vision, and multimedia security. Dr. Han is an Associate Editor of Neurocomputing (Elsevier), and an Editorial Board Member of Multimedia Tools and Applications (Springer).

Li Liu received the B.Eng. degree in electronic information engineering from Xi’an Jiaotong University, Xi’an, China, in 2011, and the Ph.D. degree from the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, U.K., in 2014. He is currently with the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. His current research interests include computer vision, machine learning, and data mining.

Ling Shao (M’09-SM’10) was a Professor with the School of Computing Sciences, University of East Anglia, Norwich, U.K. He is currently the CEO and Chief Scientist of the Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates. He is an Associate Editor of the IEEE Transactions on Image Processing, the IEEE Transactions on Neural Networks and Learning Systems, and several other journals.

View full text

Zero-shot learning via a specific rank-controlled semantic autoencoder

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed approach

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Neural Netw.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Inf. Sci.

Zero-shot learning with semantic output codes

Advances in Neural Information Processing Systems

Attribute-based classification for zero-shot visual object categorization

IEEE Trans. Pattern Anal. Mach. Intell.

Zero-shot learning through cross-modal transfer

Advances in Neural Information Processing Systems

Recovering the missing link: predicting class-attribute associations for unsupervised zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Distributed representations of words and phrases and their compositionality

Advances in Neural Information Processing Systems

Label-embedding for image classification

IEEE Trans. Pattern Anal. Mach. Intell.

Evaluation of output embeddings for fine-grained image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Devise: a deep visual-semantic embedding model

Advances in Neural Information Processing Systems

An embarrassingly simple approach to zero-shot learning

International Conference on Machine Learning

Synthesized classifiers for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Semantic autoencoder for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zero-shot kernel learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Preserving semantic relations for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning a deep embedding model for zero-shot learning

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Learning to compare: relation network for few-shot learning

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zero-shot learning via semantic similarity embedding

Proceedings of the IEEE International Conference on Computer Vision

Low-rank embedded ensemble semantic dictionary for zero-shot learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Evaluation of output embeddings for fine-grained image classification

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

An embarrassingly simple approach to zero-shot learning

International Conference on Machine Learning

Predicting deep zero-shot convolutional neural networks using textual descriptions

Proceedings of the IEEE International Conference on Computer Vision