Elsevier

Neural Networks

Volume 121, January 2020, Pages 1-9
Neural Networks

Label-activating framework for zero-shot learning

https://doi.org/10.1016/j.neunet.2019.08.023Get rights and content

Abstract

Existing zero-shot learning (ZSL) models usually learn mappings between visual space and semantic space. However, few of them take the label information into account. Indirect Attribute Prediction (IAP) learns the posterior probability of each attribute by label space, but labels of seen and unseen classes are defined in different spaces, which is not suitable for Generalized ZSL (GZSL). We propose a Label-Activating Framework (LAF) for semantic-based classification. The purpose of the proposed framework is to activate the label space by learning mappings from vision and semantics to labels. In the training phase, the original label space made up of one-hot vectors is used as common space, on which visual features and semantic information are embedded. After the label space is activated, labels of unseen classes can be regarded as the linear combination of labels of seen classes. In this case, seen and unseen labels are defined in the same space, and the label space has specific meanings rather than only signs of each class. Doing so makes the activated label space become very discriminative, especially for GZSL, which is therefore more challenging and reasonable for real-world tasks. In addition, we develop a specific model based on the framework, which effectively mitigate the projection domain shift problem. Extensive experiments show our framework outperforms state-of-the-art methods and also its suitability for GZSL.

Introduction

While humans can distinguish 30,000 basic object categories (Biederman, 1987) and lots of subordinate ones (e.g. breeds of cats), and even recognized new classes dynamically from few examples with little effort, it is not easy for computer-based machine learning models, because it requires hundreds of labeled samples for each object category. Due to the high cost of collection and annotation of training samples, motivated by the ability of humans to recognize without seeing examples, the research area of transfer learning (Chen et al., 2014, Peng et al., 2018) has received increasing interests, which aims to make good use of previously learned knowledge to recognize new classes. Particularly, zero-shot learning (ZSL) tries to learn a model capable of recognizing unseen categories without labeled training data.

An effective solution to zero-shot learning is that introducing the semantic (attribute) space as an intermediate layer (Farhadi et al., 2009, Lampert et al., 2009). Semantic is the high-level description of a class or an instance so that it can be shared by multiple categories. The semantic description of a class bridges a gap between the low-level features and high-level class concepts (Palatucci, Pomerleau, Hinton, & Mitchell, 2009). Zero-shot learning utilizes the attribute relationship between seen and unseen classes to finish the recognition task. Generally, in the training time, probability models or mappings are built from samples in seen categories to establish the connection between visual and semantic space. In the testing time, embedded features in the semantic space are utilized to match the correct class prototype by some search methods. In addition, test samples can also be considered from both seen and unseen categories in the testing phrase, which is called Generalized Zero-Shot Learning (GZSL) (Chao, Changpinyo, Gong, & Sha, 2016). In real-world applications, seen categories are usually more common than unseen ones, so that GZSL is more realistic and challenging than ZSL for practical recognition tasks.

In early works of ZSL, the semantic within a two-stage approach is used to predict the label of an image that belongs to one of the unseen classes. In most cases, the semantic of an input image sample is inferred in the first stage, then its class label is predicted by searching the class that includes the most similar set of attributes. For example, Direct Attributes Prediction (DAP) (Lampert, Nickisch, & Harmeling, 2014) first estimates the posterior of each attribute for an image sample by learning probabilistic attribute classifiers. Then it calculates the class posteriors and infers the class label with maximum a posterior (MAP) estimate. With aid of semantic description, many ZSL methods usually learn a model by embedding the visual features and attributes into a common space. To preserve the geometry structure of features in the common space, Deutsch, Kolouri, Kim, Owechko, and Soatto (2017) and Xu et al. (2017) take the graph information into account, but it increases the computational complexity. Moreover, many methods directly establish the connection between visual and attribute space, however, due to the strong correlation among attributes, the result is not satisfied (Jayaraman, Sha, & Grauman, 2014). The reason is that the learned mapping from feature space to attribute space can be seen as a multi-label classifier (Elisseeff & Weston, 2001), which is more complicated than the single-label classifier.

In contrast, if the label space is introduced between the visual feature and attribute space, the aforementioned problem can be alleviated to some extent. As far as we know, there are few methods utilizing the label space to alleviate the visual feature and attribute space. Indirect Attribute Prediction (IAP) (Lampert et al., 2014) is an effective method that indirectly obtains each attribute posterior probability through the label space, and then obtains each probability of the unseen category and accomplishes the ZSL task. However, there are two main disadvantages for IAP, which may make it unsuitable for the GZSL task. First, the labels of seen and unseen categories are defined in different spaces. Second, these labels have no specific meanings and are only signs of classes, resulting in less discriminative.

To handle the above problems, a Label-Activating Framework (LAF) is proposed in this paper. The framework treats the label space as a common space and learns two mappings for activating the original label information. Then, the labels of seen and unseen categories are defined in the activated space. An example is shown in Fig. 1 as the illustration of our framework.

Assume magpie and perch are both seen categories while penguin and striped killifish are two unseen categories. In the training time, visual features, labels and semantic descriptions of seen categories are given, we learn two mappings to embed visual space and semantics into the label space respectively. In addition, labels and attributes are required to reconstruct each other. In the testing time, given visual features and semantic descriptions of unseen samples, the projected attributes of unseen classes in the label space are used as their corresponding labels. For example, penguin refers to both magpie and perch due to the two attributes “wings” and “aquatic”. Similarly, the semantic descriptions of Striped killifish contain “aquatic”, so its projection refer to perch in the label space. In the new activated label space, labels of penguin and striped killifish are not one-hot vector yet and can be regarded as the linear combination of seen categories, which makes the label space meaningful and discriminative especially in GZSL. We develop a specific model under our framework and compare it with existing state-of-the-art ZSL models on four datasets. Experimental results show our framework has better performance in most cases, especially for the GZSL task. Our main contributions are summarized as follows:

  • We propose a novel ZSL framework that activates the label space by learning two mappings. Then, labels of unseen classes can be regarded as the linear combination of labels of seen classes, which are more suitable for the GZSL task.

  • The label space is continuous rather than discrete after the label space is activated. It is more natural for the generation of new categories and has stronger discriminative property.

  • We develop a specific model based on the proposed framework, which establishes the new state-of-the-art performance on ZSL and GZSL on all conventional benchmark datasets.

Section snippets

Related works

In this section, we briefly review some works closely related to ours.

Our framework and method

In this section, we firstly present a label-activating framework. Then one specific model and corresponding algorithms are developed. Finally, the classification method is introduced.

Experiments

In this section, we validate the proposed method on four datasets (SUN, CUB, AwA and aPY), and compare with a number of state-of-the-art methods. In addition, all experiments are performed on the Windows-7 operating systems (Intel Core i7-990x CPU @ 3.46 GHz 64 GB RAM).

Conclusions

In this paper, a Label-Activating Framework (LAF) is proposed to solve semantic-based classification problem. The proposed framework consists of two mappings i.e. visuallabel activating and semanticlabel activating. For the mapping from visual feature space to label space, various methods can be used, such as linear regression with Frobenius norm regularization in this paper. For the mapping from semantic space to label space, we mainly aims to learn to a projection from semantic space to

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant 61906141, 61773302, 61432014 and 61772402, in part by the China Postdoctoral Science Foundation (Grant 2019M653564) and in part by the Fundamental Research Funds for the Central Universities.

References (53)

  • AkataZ. et al.

    Label-embedding for image classification

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2016)
  • AkataZ. et al.

    Evaluation of output embeddings for fine-grained image classification

  • AnnadaniY. et al.

    Preserving semantic relations for zero-shot learning

  • BartelsR.H. et al.

    Solution of the matrix equation ax+ xb= c [f4]

    Communications of the ACM

    (1972)
  • BiedermanI.

    Recognition-by-components: a theory of human image understanding. psychol rev

    Psychological Review

    (1987)
  • BoydS. et al.

    Distributed optimization and statistical learning via the alternating direction method of multipliers

    Foundations and Trends in Machine Learning

    (2011)
  • CaiJ.-F. et al.

    A singular value thresholding algorithm for matrix completion

    SIAM Journal on Optimization

    (2010)
  • CandèsE.J. et al.

    Robust principal component analysis?

    Journal of the ACM

    (2011)
  • ChangpinyoS. et al.

    Synthesized classifiers for zero-shot learning

  • ChaoW.L. et al.

    An empirical study and analysis of generalized zero-shot learning for object recognition in the wild

    Frontiers of Information Technology and Electronic Engineering

    (2016)
  • ChenX. et al.

    Neil: Extracting visual knowledge from web data

  • ChenL. et al.

    Zero-shot visual recognition using semantics-preserving adversarial embedding network

  • DeutschS. et al.

    Zero shot learning via multi-scale manifold regularization

  • ElisseeffA. et al.

    A kernel method for multi-labelled classification

  • FarhadiA. et al.

    Describing objects by their attributes

  • FromeA. et al.

    Devise: A deep visual-semantic embedding model

  • GongC.

    Exploring commonality and individuality for multi-modal curriculum learning

  • GongC. et al.

    Learning with inadequate and incorrect supervision

  • HeK. et al.

    Deep residual learning for image recognition

    Computer Vision and Pattern Recognition

    (2016)
  • JayaramanD. et al.

    Decorrelating semantic visual attributes by resisting the urge to share

  • JiangH. et al.

    Learning class prototypes via structure alignment for zero-shot recognition

  • KodirovE. et al.

    Unsupervised domain adaptation for zero-shot learning

  • KodirovE. et al.

    Semantic autoencoder for zero-shot learning

    Computer Vision and Pattern Recognition

    (2017)
  • LampertC.H. et al.

    Learning to detect unseen object classes by between-class attribute transfer

    (2009)
  • LampertC.H. et al.

    Attribute-based classification for zero-shot visual object categorization

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2014)
  • Lin, Z., Chen, M., & Ma, Y. (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank...
  • Cited by (40)

    • Distribution and gradient constrained embedding model for zero-shot learning with fewer seen samples

      2022, Knowledge-Based Systems
      Citation Excerpt :

      In addition, some methods try to map visual features and semantic vectors to another shared space. Representatively, Li et al. [15] and Yang et al. [16] both thought the label information of classes is helpful for the learning of the projection. Therefore, they chose the label space as the shared space to learn the relationship between visual features and semantic vectors.

    • A semi-supervised zero-shot image classification method based on soft-target

      2021, Neural Networks
      Citation Excerpt :

      Recent years have witness the rapid progress of image classification technique thanks to the renaissance of deep learning (He, Zhang, Ren, & Sun, 2016; Russakovsky et al., 2015). Most methods are data-driven, relying on a large amount of labeled images to train the deep neural network model (Liu, Gao, Gao, Han, & Shao, 2020). However, manual annotation for images in tens of thousands of categories in the real world requires extremely high human and time cost, which is quite different from the way human beings understand the world (Wang & Chen, 2020).

    View all citing articles on Scopus
    View full text