Zero shot learning based on class visual prototypes and semantic consistency

doi:10.1016/j.patrec.2020.04.029

Pattern Recognition Letters

Volume 135, July 2020, Pages 368-374

https://doi.org/10.1016/j.patrec.2020.04.029 Get rights and content

Highlights

•
We learn class visual prototypes while preserving semantic consistency.
•
The shared sparse graph is used to represent semantic consistency.
•
We learn the class visual prototypes and shared sparse graph, simultaneously.
•
The experiment results indicate that our method outperforms many state-of-art methods.

Abstract

Zero shot classification is to recognize unseen images which are not present in training set, which is a quite difficult task. Traditional zero shot classification methods ignore the semantic inconsistencies between visual and semantic spaces which make those methods less effective. Projecting semantic representations to visual space can alleviate hubness problem. However, directly utilizing the semantic to visual mapping function learnt by seen classes to unseen classes will lead to domain shift problem. We propose a zero shot learning method which learns visual prototypes and preserves semantic consistency across visual and semantic spaces simultaneously, to handle semantic inconsistency problem and domain shift problem. The semantic consistency is represented by a shared sparse graph of visual space and semantic space. Our key insight is that the visual prototypes learning and the sparse graph learning are unified into a single process. Extensive experiments demonstrate that the results by the proposed method could be boosted significantly.

Introduction

Zero shot classification problem is to recognize the unseen classes through the seen classes and their shared information with unseen classes. Unseen classes are those classes that do not have labeled images for training. Seen classes are those having labeled images for training. The problem is proposed for the difficulties of collecting large amount of labeled images for every class and existing classification models are unable to recognize the unseen classes. Zero shot learning tries to recognize unseen ones with the seen samples. They share a common semantic space where the unseen objects are related to seen ones. Each class prototype is defined as the semantic representations related to the class label in semantic space. The semantic representations are attributes [18] word vectors [30] or textual descriptions [3].

Zero shot learning method has been summarized in [35]. Some methods center on learning the visual to semantic mapping function [19]. Test samples can be mapped to the semantic space. These methods suffer from the hubness problem, that is, a few unseen class prototypes tend to be the neighbors of many mapped samples in the semantic space. Some methods center on learning the common subspace of visual space and semantic space [10]. Some methods center on learning the semantic to visual mapping function [23] to alleviate hubness problem. Our method belongs to the last. We focus on learning seen and unseen class visual prototypes in visual space by projecting seen and unseen class prototypes in semantic space to visual space, thus, mitigating hubness problem.

Some methods [21], [32] focus on learning the class visual prototypes by projecting class prototypes to visual space. However, they do not consider the semantic inconsistency problem. The similarities between classes in the two spaces differ greatly. The classes should have similar relationships with other classes in both of two spaces. Different from existing methods, we consider the semantic consistency when learning the class visual prototypes. We align the visual and semantic spaces with the shared sparse graph of the two spaces. There exists domain shift problem due to that the seen and unseen classes are disjoint [10]. We alleviate the problem by preserving the semantic structure of unseen class prototypes in semantic space when learning unseen class visual prototypes.

Motivated by the observations, we put forward a zero shot classification method based on class visual prototypes and semantic consistency (CVPSC), which consists the four steps. Firstly, it learns the seen class visual prototypes, taking the semantic consistency across the two spaces into consideration. Secondly, the semantic to visual mapping function is obtained by the seen class visual prototypes and the seen class prototypes. Directly utilizing the mapping function to unseen classes will cause domain shift problem. Thus, thirdly, it synthesizes unseen class visual prototypes using the mapping function and preserving their semantic structure in semantic space for alleviating domain shift problem. It also preserves the semantic consistency of unseen classes in visual and semantic space to avoid the semantic inconsistency. Finally, given the class visual prototypes, the labels of the test instances are predicted by Nearest neighbor classifiers. Experiments show that CVPSC achieves promising results on the zero shot classification tasks. The contributions are given as follows.

(1)
We learn the seen class visual prototypes while considering preserving the semantic consistency to avoid the semantic inconsistency problem between visual and semantic spaces. We can acquire more accurate seen class visual prototypes.
(2)
We get the unseen class visual prototypes by the semantic to visual mapping function while preserving the semantic structure in semantic space to alleviate domain shift problem and preserving semantic consistency across two spaces to avoid the semantic inconsistency problem.
(3)
The semantic consistency is represented by the sparse graph shared by class visual prototypes and class prototypes where the coefficients in the sparse graph are sparse to capture the main relationships between different classes. We can capture the common structure of class prototypes in the two spaces.
(4)
We verify the effectiveness of the proposed CVPSC method with extensive experiments on four real world datasets, obtaining state-of-the-art results.

This paper is organized as follows. A brief review of related works is presented in Section 2. A novel zero shot classification method based on class visual prototypes and semantic consistency (CVPSC) is given in Section 3. The optimization process is presented inSection 4. In Sections 5, the extensive experiments are conducted. The conclusions are given in Section 6.

Section snippets

Related work

Many zero shot learning methods have been proposed recently. Projecting to semantic space methods learn mapping functions from visual to semantic space [8], [16], [19], [24]. DAP [19] tries to predict the attributes of the unseen images. UDAZS [16] learns sparse attribute representations for the unseen images. MCME [14] projects visual representations to semantic space with considering manifold structure of seen images. DeVise [8] is proposed for mapping unseen images to the semantic space. SAE

The proposed method

$X_{s} \in R^{n_{s} \times d}$ denotes seen data where d is the feature dimension. $X_{t} \in R^{n_{t} \times d}$ denotes unseen data, n_s and n_t are the number of seen and unseen data. $Y_{s} \in R^{n_{s} \times c}$ is seen one hot label c is the seen class number. $Y_{t} \in R^{n_{t} \times u}$ is unseen label matrix, u is the unseen class number. P_s ∈ R^c × a and P_t ∈ R^u × a are the seen and unseen class prototypes. P_i ∈ R^a is the class prototype of the i-th class, a is the semantic representation dimension.

Optimization

We give the optimization method for learning R_s and R_t.

Experiments

We perform many experiments on datasets AwA [19], aPY [7], CUB [33], SUN [29]. The experiment settings are the same with [36]. The image features are ResNet-101 features [13]. The compared methods are DAP [19], SAE [17], ESZSL [31], LATEM [34], GAZSL [42], CDL [15].

Conclusion

In this paper, we proposed a zero shot learning method based on class visual prototypes and semantic consistency. We learn the seen and unseen class visual prototypes and consider the semantic consistency represented by the sparse graph shared by visual space and semantic space. The sparse graph is founded on L₁ norm minimization to capture the main relationships between different classes. Instead of using the fixed seen class visual prototypes, the seen class visual prototypes are learnt by

Declaration of Competing Interest

The authors declared that they have no conflicts of interest to this work.

Acknowledgments

This work is supported by China Postdoctoral Science Foundation funded project under Grant 2018M631125, National Natural Science Foundation of China (Grant No. 61806155, 61472305, 61070143), Fundamental Research Funds for the Central Universities under Grant XJS18037, Science and technology project of Shaanxi province, China (Grant No. 2015GY027), Aeronautical Science Foundation of China (Grant No. 20151981009), Key Science and Technology Program of Shaanxi Province, China (No. 2016GY-112),

References (42)

Z. Fu et al.
Zero-shot learning on semantic class prototype graph
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
Z. Ji et al.
Manifold regularized cross-modal embedding for zero-shot learning
Inf. Sci.
(2017)
X. Li et al.
Learning unseen visual prototypes for zero-shot classification
Knowl.-Based Syst.
(2018)
X. Li et al.
Zero-shot classification by transferring knowledge and preserving data structure
Neurocomputing
(2017)
T. Long et al.
Zero-shot learning via discriminative representation extraction
Pattern Recognit. Lett.
(2018)
H. Zhang et al.
Adversarial unseen visual feature synthesis for zero-shot learning
Neurocomputing
(2019)
H. Zhang et al.
Zero-shot hashing with orthogonal projection for image retrieval
Pattern Recognit. Lett.
(2019)
Z. Akata et al.
Label-embedding for image classification
IEEE Trans. Pattern Anal. Mach. Intell.
(2016)
L. Chen et al.
Zero-shot visual recognition using semantics-preserving adversarial embedding network
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018)
B. Demirel et al.
Attributes2classname: a discriminative model for attribute-based unsupervised zero-shot learning
Proceedings of the IEEE International Conference on Computer Vision
(2017)

S. Deutsch et al.

Zero shot learning via multi-scale manifold regularization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

Z. Ding et al.

Generative zero-shot learning via low-rank embedded semantic dictionary

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

G. Dinu et al.

Improving zero-shot learning by mitigating the hubness problem

Workshop at ICLR

(2015)

A. Farhadi et al.

Describing objects by their attributes

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

(2009)

A. Frome et al.

Devise: a deep visual-semantic embedding model

Advances in Neural Information Processing Systems

(2013)

Y. Fu et al.

Learning multimodal latent attributes

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

Y. Fu et al.

Transductive multi-view zero-shot learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

C. Gan et al.

Learning attributes equals multi-source domain generalization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

K. He et al.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

H. Jiang et al.

Learning class prototypes via structure alignment for zero-shot recognition

Proceedings of the European Conference on Computer Vision (ECCV)

(2018)

E. Kodirov et al.

Unsupervised domain adaptation for zero-shot learning

Proceedings of the IEEE International Conference on Computer Vision

(2015)

Cited by (14)

Generating diverse augmented attributes for generalized zero shot learning
2023, Pattern Recognition Letters
Generalized Zero-Shot Learning (GZSL) has become an important research due to its powerful ability of recognizing unseen objects. Generative methods, converting conventional GZSL to fully supervised learning, can achieve competing performance, and most of them use semantic attributes plus Gaussian noise to enrich generated features. The visual features obtained in this way are consistent with the semantic description. However, the reality is that the semantic description of the visual features of the same category should be different, because the appearance of images differs from each other although they belong to a same category, i.e., mapping from semantic attributes to visual features should be a “many to many” relationship rather than “one to many”. Therefore, we propose a novel method to generate diverse augmented attribute, which are subsequently utilized to synthesize features. We construct a semantic generator based on a pre-trained semantic mapper, which augments the category semantics. Using the augmented category semantics to generate visual features will result in a better fit of the generated visual features to the distribution of real features. The proposed method can well solve the pseudo diversity of visual features generated by most generative GZSL methods. We evaluate the proposed method on five popular benchmark datasets, and the results show that it can achieve the state-of-the-art performance.
GAN-MVAE: A discriminative latent feature generation framework for generalized zero-shot learning
2022, Pattern Recognition Letters
Citation Excerpt :
This work is dedicated to GZSL task. Most of the existing ZSL methods build an embedding model to learn the cross-modal mapping between semantic space and visual space [5,7–12]. Since the seen classes and the unseen classes are disjoint, the embedding model learned only from the seen classes will produce a bias when used directly in the unseen classes, i.e. the projection domain shifts [13].
Generalized zero-shot learning (GZSL) is a challenging task that aims to recognize both seen and unseen classes. It is achieved by transferring knowledge from seen classes to unseen classes via a shared semantic space (e.g. attribute space). Recently, Generative adversarial network (GAN) have gained considerable attention in GZSL. GAN can generate missing unseen classes samples from class-specific semantic embedding for training, thereby transforming GZSL into a traditional classification task and achieving impressive results. However, due to the instability during training and the complexity of data distribution, a simple GAN framework cannot capture the real data distribution perfectly, and there is still a large gap between the generated and real sample distributions, which severely limits the performance of GZSL. Therefore, the proposed GAN-MVAE further aligns the real and generated samples by mapping them into the latent space of multi-modal reconstruction variational autoencoder (MVAE), while preserving discriminative semantic information through cross-modal reconstruction. GAN-MVAE provides some inspiration for the study of multi-modal alignment and asymmetry VAE. Extensive experiments on four GZSL benchmark datasets show that GAN-MVAE significantly outperforms the state of the arts.
Instance-Based Zero-Shot learning for semi-Automatic MeSH indexing
2021, Pattern Recognition Letters
Citation Excerpt :
Similarly, on image-based domains, the label indicators may reveal insightful information about the underlying relationship of the requested entities, enabling ZSL approaches to tackle this extreme scenario. For both of these fields, consistency of adopted embeddings from separate sources – although the source data may stem from image or video signals, their corresponding label indicator is expressed into text format – and the well-known hubness problem are crucial to be tackled [10]. Despite the large popularity of textual data (e.g. social networks, document summarization, key-phrase extraction), there is a shortage of related work that utilize ZSL concept based solely on this kind of source modality.
Zero-shot learning constitutes a variant of the broader category of weakly supervised learning algorithms. Its main asset is the possibility of identifying entities for which no training data are provided in advance. Under this extreme scenario, conventional supervised learning methods cannot operate properly, while consumption of human resources for obtaining even limited instances may be highly restricted, especially when the label space is quite complex because of its cardinality and the underlying semantic dependencies. However, removing the human factor from the learning loop under complicated tasks cannot guarantee robust performance. Thus, semi-automated solutions are widely accepted by both the research and industrial communities, favoring cooperation of human and machine, mainly for alleviating the spent effort of the former, and for acquiring safer predictions. In contrast with the majority of existing Zero-shot learning approaches, we propose a generalized instance-based method oriented towards tackling the Multi-label classification task without performing any transductive operations over the test instances. Instead, we aim to provide a label ranking of the unseen classes exploiting sentence-based semantic embeddings and label similarities, through a dedicated fine-tuned language representational model. We also use a pattern matching rule to further boost the ranking of our method. Some realistic assumptions are made in order for our approach to work correctly and provide said ranking. Results on a biomedical database with a semantically rich fine-grained label space are really promising, rendering its utilization as a helpful and computationally inexpensive tool for facilitating semi-automated indexing.
Relation-based Discriminative Cooperation Network for Zero-Shot Classification
2021, Pattern Recognition
Citation Excerpt :
For example, if a child has seen cattle before, he/she can easily recognize a cow and learn that a cow looks like cattle with black-and-white color. Inspired by the ability of humans to identify unseen categories, that the research area of Zero-Shot Learning (ZSL) [1–4] aims to recognize classes whose samples haven’t been available during training time has received increasing interests. Different from traditional supervised learning, ZSL considers an extreme case where testing classes is unavailable during training, i.e., the training (seen) classes and testing (unseen) classes are disjoint [5].
Zero-shot learning (ZSL) aims to assign the category corresponding to the relevant semantic as the label of the unseen sample based on the relationship between the learned visual and semantic features. However, most typical ZSL models faced with the domain bias problem, which leads to unseen or test samples being easily misclassified into seen or training categories. To handle this problem, we propose a relation-based discriminative cooperation network (RDCN) model for ZSL in this work. The proposed model effectively utilize the robust metric space spanned by the cooperated semantics with the help of a set of relations. On the other hand, we devise the relation network to measure the relationship between the visual features and embedded semantics, and the validation information will guide the embedding module to learn more discriminative information. At last, the proposed RDCN model is validated on six benchmarks, and extensive experiments demonstrate the superiority of proposed method over most existing ZSL models on the traditional zero-shot setting and the more realistic generalized zero-shot setting.
Zero shot augmentation learning in internet of biometric things for health signal processing
2021, Pattern Recognition Letters
Citation Excerpt :
Semantic attributes and word vectors can solve the problem of missing data in zero shot learning [9]. At present, zero shot learning cannot eliminate the dependence on semantic information of known categories [10]. Most of the methods embed visual features into other semantic spaces [11] and then use nearest neighbours to classify unseen instances, which is essentially a kind of knowledge transfer.
In recent years, the number of Internet of Things (IoT) devices has increased rapidly. The Internet of Biometric Things (IoBT) can process biometrics and health signals, and it will greatly extend the range of biometric applications. The analysis of health signals in the IoBT can use computer-aided diagnosis techniques. However, most of the existing computer-aided diagnosis methods are developed for common diseases and are not suitable for rare diseases. Zero shot learning is a potential method for the computer-aided diagnosis of rare diseases because it can identify objects of unknown categories. However, the existing zero shot learning methods are based on attribute learning and rely on an attribute dataset. There is no attribute dataset for health signal processing. Therefore, the existing zero shot learning methods are not suitable for health signal processing. Based on the above background, we propose a zero shot augmentation learning model (ZSAL) in the IoBT for health signal processing. First, an expert doctor identifies the contour of a lesion and selects a background image without a lesion. Second, the computer automatically generates virtual images using zero shot augmentation technology. Finally, the generated virtual dataset is used to train a convolutional classifier, and then we apply the classifier to the computer-aided diagnosis of actual medical images. The experiment shows the efficiency and effectiveness of our method.
CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval
2021, Pattern Recognition Letters
Citation Excerpt :
Therefore, it is necessary to train a model with zero information of unseen classes but it still can handle the cross-modal retrieval for unseen classes. The zero-shot cross-modal retrieval which is conceptually similar to zero-shot learning [9–12] has become an emerging research topic recently. Over the past few years, only a few works have been proposed to address zero-shot cross-modal retrieval [13–15].
Cross-modal retrieval has recently attracted much attention because it helps users retrieve data across different modalities. However, with the explosive growth of data, a large number of new emerging concepts (unseen classes) that have not been appeared in the training data (seen classes) bring great challenges to the traditional cross-modal retrieval. Nevertheless, most existing approaches mainly focus on improving cross-modal retrieval performance of seen classes, which may fail in the unseen classes. To address the challenge of zero-shot cross-modal retrieval, we propose an orthogonal method in this paper, i.e., Cross-modal Hashing with Orthogonal Projection (CHOP). It projects cross-modal features and class attributes onto a Hamming space, where each projection of cross-modal features is orthogonal to the mismatched class attributes. By so doing, the model can learn a discriminative and binary representation of each modality. In addition, the class attributes build a bridge to transfer knowledge from seen classes to unseen classes. Furthermore, the orthogonal constraint on binary codes can help to mitigate the hubness problem. Extensive experiments on three benchmark datasets show that the proposed CHOP is effective in handling zero-shot cross-modal retrieval.

View all citing articles on Scopus

View full text

Zero shot learning based on class visual prototypes and semantic consistency

Highlights

Abstract

Introduction

Section snippets

Related work

The proposed method

Optimization

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

IEEE Trans. Pattern Anal. Mach. Intell.

Inf. Sci.

Knowl.-Based Syst.

Neurocomputing

Pattern Recognit. Lett.

Neurocomputing

Pattern Recognit. Lett.

Label-embedding for image classification

IEEE Trans. Pattern Anal. Mach. Intell.

Zero-shot visual recognition using semantics-preserving adversarial embedding network

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Attributes2classname: a discriminative model for attribute-based unsupervised zero-shot learning

Proceedings of the IEEE International Conference on Computer Vision

Zero shot learning via multi-scale manifold regularization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Generative zero-shot learning via low-rank embedded semantic dictionary

IEEE Trans. Pattern Anal. Mach. Intell.

Improving zero-shot learning by mitigating the hubness problem

Workshop at ICLR

Describing objects by their attributes

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

Devise: a deep visual-semantic embedding model

Advances in Neural Information Processing Systems

Learning multimodal latent attributes

IEEE Trans. Pattern Anal. Mach. Intell.

Transductive multi-view zero-shot learning

IEEE Trans. Pattern Anal. Mach. Intell.

Learning attributes equals multi-source domain generalization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning class prototypes via structure alignment for zero-shot recognition

Proceedings of the European Conference on Computer Vision (ECCV)

Unsupervised domain adaptation for zero-shot learning

Proceedings of the IEEE International Conference on Computer Vision