Elsevier

Pattern Recognition Letters

Volume 135, July 2020, Pages 368-374
Pattern Recognition Letters

Zero shot learning based on class visual prototypes and semantic consistency

https://doi.org/10.1016/j.patrec.2020.04.029Get rights and content

Highlights

  • We learn class visual prototypes while preserving semantic consistency.

  • The shared sparse graph is used to represent semantic consistency.

  • We learn the class visual prototypes and shared sparse graph, simultaneously.

  • The experiment results indicate that our method outperforms many state-of-art methods.

Abstract

Zero shot classification is to recognize unseen images which are not present in training set, which is a quite difficult task. Traditional zero shot classification methods ignore the semantic inconsistencies between visual and semantic spaces which make those methods less effective. Projecting semantic representations to visual space can alleviate hubness problem. However, directly utilizing the semantic to visual mapping function learnt by seen classes to unseen classes will lead to domain shift problem. We propose a zero shot learning method which learns visual prototypes and preserves semantic consistency across visual and semantic spaces simultaneously, to handle semantic inconsistency problem and domain shift problem. The semantic consistency is represented by a shared sparse graph of visual space and semantic space. Our key insight is that the visual prototypes learning and the sparse graph learning are unified into a single process. Extensive experiments demonstrate that the results by the proposed method could be boosted significantly.

Introduction

Zero shot classification problem is to recognize the unseen classes through the seen classes and their shared information with unseen classes. Unseen classes are those classes that do not have labeled images for training. Seen classes are those having labeled images for training. The problem is proposed for the difficulties of collecting large amount of labeled images for every class and existing classification models are unable to recognize the unseen classes. Zero shot learning tries to recognize unseen ones with the seen samples. They share a common semantic space where the unseen objects are related to seen ones. Each class prototype is defined as the semantic representations related to the class label in semantic space. The semantic representations are attributes [18] word vectors [30] or textual descriptions [3].

Zero shot learning method has been summarized in [35]. Some methods center on learning the visual to semantic mapping function [19]. Test samples can be mapped to the semantic space. These methods suffer from the hubness problem, that is, a few unseen class prototypes tend to be the neighbors of many mapped samples in the semantic space. Some methods center on learning the common subspace of visual space and semantic space [10]. Some methods center on learning the semantic to visual mapping function [23] to alleviate hubness problem. Our method belongs to the last. We focus on learning seen and unseen class visual prototypes in visual space by projecting seen and unseen class prototypes in semantic space to visual space, thus, mitigating hubness problem.

Some methods [21], [32] focus on learning the class visual prototypes by projecting class prototypes to visual space. However, they do not consider the semantic inconsistency problem. The similarities between classes in the two spaces differ greatly. The classes should have similar relationships with other classes in both of two spaces. Different from existing methods, we consider the semantic consistency when learning the class visual prototypes. We align the visual and semantic spaces with the shared sparse graph of the two spaces. There exists domain shift problem due to that the seen and unseen classes are disjoint [10]. We alleviate the problem by preserving the semantic structure of unseen class prototypes in semantic space when learning unseen class visual prototypes.

Motivated by the observations, we put forward a zero shot classification method based on class visual prototypes and semantic consistency (CVPSC), which consists the four steps. Firstly, it learns the seen class visual prototypes, taking the semantic consistency across the two spaces into consideration. Secondly, the semantic to visual mapping function is obtained by the seen class visual prototypes and the seen class prototypes. Directly utilizing the mapping function to unseen classes will cause domain shift problem. Thus, thirdly, it synthesizes unseen class visual prototypes using the mapping function and preserving their semantic structure in semantic space for alleviating domain shift problem. It also preserves the semantic consistency of unseen classes in visual and semantic space to avoid the semantic inconsistency. Finally, given the class visual prototypes, the labels of the test instances are predicted by Nearest neighbor classifiers. Experiments show that CVPSC achieves promising results on the zero shot classification tasks. The contributions are given as follows.

  • (1)

    We learn the seen class visual prototypes while considering preserving the semantic consistency to avoid the semantic inconsistency problem between visual and semantic spaces. We can acquire more accurate seen class visual prototypes.

  • (2)

    We get the unseen class visual prototypes by the semantic to visual mapping function while preserving the semantic structure in semantic space to alleviate domain shift problem and preserving semantic consistency across two spaces to avoid the semantic inconsistency problem.

  • (3)

    The semantic consistency is represented by the sparse graph shared by class visual prototypes and class prototypes where the coefficients in the sparse graph are sparse to capture the main relationships between different classes. We can capture the common structure of class prototypes in the two spaces.

  • (4)

    We verify the effectiveness of the proposed CVPSC method with extensive experiments on four real world datasets, obtaining state-of-the-art results.

This paper is organized as follows. A brief review of related works is presented in Section 2. A novel zero shot classification method based on class visual prototypes and semantic consistency (CVPSC) is given in Section 3. The optimization process is presented inSection 4. In Sections 5, the extensive experiments are conducted. The conclusions are given in Section 6.

Section snippets

Related work

Many zero shot learning methods have been proposed recently. Projecting to semantic space methods learn mapping functions from visual to semantic space [8], [16], [19], [24]. DAP [19] tries to predict the attributes of the unseen images. UDAZS [16] learns sparse attribute representations for the unseen images. MCME [14] projects visual representations to semantic space with considering manifold structure of seen images. DeVise [8] is proposed for mapping unseen images to the semantic space. SAE

The proposed method

XsRns×d denotes seen data where d is the feature dimension. XtRnt×d denotes unseen data, ns and nt are the number of seen and unseen data. YsRns×c is seen one hot label c is the seen class number. YtRnt×u is unseen label matrix, u is the unseen class number. Ps ∈ Rc × a and Pt ∈ Ru × a are the seen and unseen class prototypes. Pi ∈ Ra is the class prototype of the i-th class, a is the semantic representation dimension.

Optimization

We give the optimization method for learning Rs and Rt.

Experiments

We perform many experiments on datasets AwA [19], aPY [7], CUB [33], SUN [29]. The experiment settings are the same with [36]. The image features are ResNet-101 features [13]. The compared methods are DAP [19], SAE [17], ESZSL [31], LATEM [34], GAZSL [42], CDL [15].

Conclusion

In this paper, we proposed a zero shot learning method based on class visual prototypes and semantic consistency. We learn the seen and unseen class visual prototypes and consider the semantic consistency represented by the sparse graph shared by visual space and semantic space. The sparse graph is founded on L1 norm minimization to capture the main relationships between different classes. Instead of using the fixed seen class visual prototypes, the seen class visual prototypes are learnt by

Declaration of Competing Interest

The authors declared that they have no conflicts of interest to this work.

Acknowledgments

This work is supported by China Postdoctoral Science Foundation funded project under Grant 2018M631125, National Natural Science Foundation of China (Grant No. 61806155, 61472305, 61070143), Fundamental Research Funds for the Central Universities under Grant XJS18037, Science and technology project of Shaanxi province, China (Grant No. 2015GY027), Aeronautical Science Foundation of China (Grant No. 20151981009), Key Science and Technology Program of Shaanxi Province, China (No. 2016GY-112),

References (42)

  • S. Deutsch et al.

    Zero shot learning via multi-scale manifold regularization

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Z. Ding et al.

    Generative zero-shot learning via low-rank embedded semantic dictionary

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • G. Dinu et al.

    Improving zero-shot learning by mitigating the hubness problem

    Workshop at ICLR

    (2015)
  • A. Farhadi et al.

    Describing objects by their attributes

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    (2009)
  • A. Frome et al.

    Devise: a deep visual-semantic embedding model

    Advances in Neural Information Processing Systems

    (2013)
  • Y. Fu et al.

    Learning multimodal latent attributes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • Y. Fu et al.

    Transductive multi-view zero-shot learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • C. Gan et al.

    Learning attributes equals multi-source domain generalization

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • H. Jiang et al.

    Learning class prototypes via structure alignment for zero-shot recognition

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • E. Kodirov et al.

    Unsupervised domain adaptation for zero-shot learning

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • Cited by (14)

    • GAN-MVAE: A discriminative latent feature generation framework for generalized zero-shot learning

      2022, Pattern Recognition Letters
      Citation Excerpt :

      This work is dedicated to GZSL task. Most of the existing ZSL methods build an embedding model to learn the cross-modal mapping between semantic space and visual space [5,7–12]. Since the seen classes and the unseen classes are disjoint, the embedding model learned only from the seen classes will produce a bias when used directly in the unseen classes, i.e. the projection domain shifts [13].

    • Instance-Based Zero-Shot learning for semi-Automatic MeSH indexing

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Similarly, on image-based domains, the label indicators may reveal insightful information about the underlying relationship of the requested entities, enabling ZSL approaches to tackle this extreme scenario. For both of these fields, consistency of adopted embeddings from separate sources – although the source data may stem from image or video signals, their corresponding label indicator is expressed into text format – and the well-known hubness problem are crucial to be tackled [10]. Despite the large popularity of textual data (e.g. social networks, document summarization, key-phrase extraction), there is a shortage of related work that utilize ZSL concept based solely on this kind of source modality.

    • Relation-based Discriminative Cooperation Network for Zero-Shot Classification

      2021, Pattern Recognition
      Citation Excerpt :

      For example, if a child has seen cattle before, he/she can easily recognize a cow and learn that a cow looks like cattle with black-and-white color. Inspired by the ability of humans to identify unseen categories, that the research area of Zero-Shot Learning (ZSL) [1–4] aims to recognize classes whose samples haven’t been available during training time has received increasing interests. Different from traditional supervised learning, ZSL considers an extreme case where testing classes is unavailable during training, i.e., the training (seen) classes and testing (unseen) classes are disjoint [5].

    • Zero shot augmentation learning in internet of biometric things for health signal processing

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Semantic attributes and word vectors can solve the problem of missing data in zero shot learning [9]. At present, zero shot learning cannot eliminate the dependence on semantic information of known categories [10]. Most of the methods embed visual features into other semantic spaces [11] and then use nearest neighbours to classify unseen instances, which is essentially a kind of knowledge transfer.

    • CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Therefore, it is necessary to train a model with zero information of unseen classes but it still can handle the cross-modal retrieval for unseen classes. The zero-shot cross-modal retrieval which is conceptually similar to zero-shot learning [9–12] has become an emerging research topic recently. Over the past few years, only a few works have been proposed to address zero-shot cross-modal retrieval [13–15].

    View all citing articles on Scopus
    View full text