Zero shot learning based on class visual prototypes and semantic consistency
Introduction
Zero shot classification problem is to recognize the unseen classes through the seen classes and their shared information with unseen classes. Unseen classes are those classes that do not have labeled images for training. Seen classes are those having labeled images for training. The problem is proposed for the difficulties of collecting large amount of labeled images for every class and existing classification models are unable to recognize the unseen classes. Zero shot learning tries to recognize unseen ones with the seen samples. They share a common semantic space where the unseen objects are related to seen ones. Each class prototype is defined as the semantic representations related to the class label in semantic space. The semantic representations are attributes [18] word vectors [30] or textual descriptions [3].
Zero shot learning method has been summarized in [35]. Some methods center on learning the visual to semantic mapping function [19]. Test samples can be mapped to the semantic space. These methods suffer from the hubness problem, that is, a few unseen class prototypes tend to be the neighbors of many mapped samples in the semantic space. Some methods center on learning the common subspace of visual space and semantic space [10]. Some methods center on learning the semantic to visual mapping function [23] to alleviate hubness problem. Our method belongs to the last. We focus on learning seen and unseen class visual prototypes in visual space by projecting seen and unseen class prototypes in semantic space to visual space, thus, mitigating hubness problem.
Some methods [21], [32] focus on learning the class visual prototypes by projecting class prototypes to visual space. However, they do not consider the semantic inconsistency problem. The similarities between classes in the two spaces differ greatly. The classes should have similar relationships with other classes in both of two spaces. Different from existing methods, we consider the semantic consistency when learning the class visual prototypes. We align the visual and semantic spaces with the shared sparse graph of the two spaces. There exists domain shift problem due to that the seen and unseen classes are disjoint [10]. We alleviate the problem by preserving the semantic structure of unseen class prototypes in semantic space when learning unseen class visual prototypes.
Motivated by the observations, we put forward a zero shot classification method based on class visual prototypes and semantic consistency (CVPSC), which consists the four steps. Firstly, it learns the seen class visual prototypes, taking the semantic consistency across the two spaces into consideration. Secondly, the semantic to visual mapping function is obtained by the seen class visual prototypes and the seen class prototypes. Directly utilizing the mapping function to unseen classes will cause domain shift problem. Thus, thirdly, it synthesizes unseen class visual prototypes using the mapping function and preserving their semantic structure in semantic space for alleviating domain shift problem. It also preserves the semantic consistency of unseen classes in visual and semantic space to avoid the semantic inconsistency. Finally, given the class visual prototypes, the labels of the test instances are predicted by Nearest neighbor classifiers. Experiments show that CVPSC achieves promising results on the zero shot classification tasks. The contributions are given as follows.
- (1)
We learn the seen class visual prototypes while considering preserving the semantic consistency to avoid the semantic inconsistency problem between visual and semantic spaces. We can acquire more accurate seen class visual prototypes.
- (2)
We get the unseen class visual prototypes by the semantic to visual mapping function while preserving the semantic structure in semantic space to alleviate domain shift problem and preserving semantic consistency across two spaces to avoid the semantic inconsistency problem.
- (3)
The semantic consistency is represented by the sparse graph shared by class visual prototypes and class prototypes where the coefficients in the sparse graph are sparse to capture the main relationships between different classes. We can capture the common structure of class prototypes in the two spaces.
- (4)
We verify the effectiveness of the proposed CVPSC method with extensive experiments on four real world datasets, obtaining state-of-the-art results.
This paper is organized as follows. A brief review of related works is presented in Section 2. A novel zero shot classification method based on class visual prototypes and semantic consistency (CVPSC) is given in Section 3. The optimization process is presented inSection 4. In Sections 5, the extensive experiments are conducted. The conclusions are given in Section 6.
Section snippets
Related work
Many zero shot learning methods have been proposed recently. Projecting to semantic space methods learn mapping functions from visual to semantic space [8], [16], [19], [24]. DAP [19] tries to predict the attributes of the unseen images. UDAZS [16] learns sparse attribute representations for the unseen images. MCME [14] projects visual representations to semantic space with considering manifold structure of seen images. DeVise [8] is proposed for mapping unseen images to the semantic space. SAE
The proposed method
denotes seen data where d is the feature dimension. denotes unseen data, ns and nt are the number of seen and unseen data. is seen one hot label c is the seen class number. is unseen label matrix, u is the unseen class number. Ps ∈ Rc × a and Pt ∈ Ru × a are the seen and unseen class prototypes. Pi ∈ Ra is the class prototype of the i-th class, a is the semantic representation dimension.
Optimization
We give the optimization method for learning Rs and Rt.
Experiments
We perform many experiments on datasets AwA [19], aPY [7], CUB [33], SUN [29]. The experiment settings are the same with [36]. The image features are ResNet-101 features [13]. The compared methods are DAP [19], SAE [17], ESZSL [31], LATEM [34], GAZSL [42], CDL [15].
Conclusion
In this paper, we proposed a zero shot learning method based on class visual prototypes and semantic consistency. We learn the seen and unseen class visual prototypes and consider the semantic consistency represented by the sparse graph shared by visual space and semantic space. The sparse graph is founded on L1 norm minimization to capture the main relationships between different classes. Instead of using the fixed seen class visual prototypes, the seen class visual prototypes are learnt by
Declaration of Competing Interest
The authors declared that they have no conflicts of interest to this work.
Acknowledgments
This work is supported by China Postdoctoral Science Foundation funded project under Grant 2018M631125, National Natural Science Foundation of China (Grant No. 61806155, 61472305, 61070143), Fundamental Research Funds for the Central Universities under Grant XJS18037, Science and technology project of Shaanxi province, China (Grant No. 2015GY027), Aeronautical Science Foundation of China (Grant No. 20151981009), Key Science and Technology Program of Shaanxi Province, China (No. 2016GY-112),
References (42)
- et al.
Zero-shot learning on semantic class prototype graph
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
Manifold regularized cross-modal embedding for zero-shot learning
Inf. Sci.
(2017) - et al.
Learning unseen visual prototypes for zero-shot classification
Knowl.-Based Syst.
(2018) - et al.
Zero-shot classification by transferring knowledge and preserving data structure
Neurocomputing
(2017) - et al.
Zero-shot learning via discriminative representation extraction
Pattern Recognit. Lett.
(2018) - et al.
Adversarial unseen visual feature synthesis for zero-shot learning
Neurocomputing
(2019) - et al.
Zero-shot hashing with orthogonal projection for image retrieval
Pattern Recognit. Lett.
(2019) - et al.
Label-embedding for image classification
IEEE Trans. Pattern Anal. Mach. Intell.
(2016) - et al.
Zero-shot visual recognition using semantics-preserving adversarial embedding network
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018) - et al.
Attributes2classname: a discriminative model for attribute-based unsupervised zero-shot learning
Proceedings of the IEEE International Conference on Computer Vision
(2017)
Zero shot learning via multi-scale manifold regularization
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Generative zero-shot learning via low-rank embedded semantic dictionary
IEEE Trans. Pattern Anal. Mach. Intell.
Improving zero-shot learning by mitigating the hubness problem
Workshop at ICLR
Describing objects by their attributes
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
Devise: a deep visual-semantic embedding model
Advances in Neural Information Processing Systems
Learning multimodal latent attributes
IEEE Trans. Pattern Anal. Mach. Intell.
Transductive multi-view zero-shot learning
IEEE Trans. Pattern Anal. Mach. Intell.
Learning attributes equals multi-source domain generalization
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Learning class prototypes via structure alignment for zero-shot recognition
Proceedings of the European Conference on Computer Vision (ECCV)
Unsupervised domain adaptation for zero-shot learning
Proceedings of the IEEE International Conference on Computer Vision
Cited by (14)
Generating diverse augmented attributes for generalized zero shot learning
2023, Pattern Recognition LettersGAN-MVAE: A discriminative latent feature generation framework for generalized zero-shot learning
2022, Pattern Recognition LettersCitation Excerpt :This work is dedicated to GZSL task. Most of the existing ZSL methods build an embedding model to learn the cross-modal mapping between semantic space and visual space [5,7–12]. Since the seen classes and the unseen classes are disjoint, the embedding model learned only from the seen classes will produce a bias when used directly in the unseen classes, i.e. the projection domain shifts [13].
Instance-Based Zero-Shot learning for semi-Automatic MeSH indexing
2021, Pattern Recognition LettersCitation Excerpt :Similarly, on image-based domains, the label indicators may reveal insightful information about the underlying relationship of the requested entities, enabling ZSL approaches to tackle this extreme scenario. For both of these fields, consistency of adopted embeddings from separate sources – although the source data may stem from image or video signals, their corresponding label indicator is expressed into text format – and the well-known hubness problem are crucial to be tackled [10]. Despite the large popularity of textual data (e.g. social networks, document summarization, key-phrase extraction), there is a shortage of related work that utilize ZSL concept based solely on this kind of source modality.
Relation-based Discriminative Cooperation Network for Zero-Shot Classification
2021, Pattern RecognitionCitation Excerpt :For example, if a child has seen cattle before, he/she can easily recognize a cow and learn that a cow looks like cattle with black-and-white color. Inspired by the ability of humans to identify unseen categories, that the research area of Zero-Shot Learning (ZSL) [1–4] aims to recognize classes whose samples haven’t been available during training time has received increasing interests. Different from traditional supervised learning, ZSL considers an extreme case where testing classes is unavailable during training, i.e., the training (seen) classes and testing (unseen) classes are disjoint [5].
Zero shot augmentation learning in internet of biometric things for health signal processing
2021, Pattern Recognition LettersCitation Excerpt :Semantic attributes and word vectors can solve the problem of missing data in zero shot learning [9]. At present, zero shot learning cannot eliminate the dependence on semantic information of known categories [10]. Most of the methods embed visual features into other semantic spaces [11] and then use nearest neighbours to classify unseen instances, which is essentially a kind of knowledge transfer.
CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval
2021, Pattern Recognition LettersCitation Excerpt :Therefore, it is necessary to train a model with zero information of unseen classes but it still can handle the cross-modal retrieval for unseen classes. The zero-shot cross-modal retrieval which is conceptually similar to zero-shot learning [9–12] has become an emerging research topic recently. Over the past few years, only a few works have been proposed to address zero-shot cross-modal retrieval [13–15].