Label-activating framework for zero-shot learning
Introduction
While humans can distinguish 30,000 basic object categories (Biederman, 1987) and lots of subordinate ones (e.g. breeds of cats), and even recognized new classes dynamically from few examples with little effort, it is not easy for computer-based machine learning models, because it requires hundreds of labeled samples for each object category. Due to the high cost of collection and annotation of training samples, motivated by the ability of humans to recognize without seeing examples, the research area of transfer learning (Chen et al., 2014, Peng et al., 2018) has received increasing interests, which aims to make good use of previously learned knowledge to recognize new classes. Particularly, zero-shot learning (ZSL) tries to learn a model capable of recognizing unseen categories without labeled training data.
An effective solution to zero-shot learning is that introducing the semantic (attribute) space as an intermediate layer (Farhadi et al., 2009, Lampert et al., 2009). Semantic is the high-level description of a class or an instance so that it can be shared by multiple categories. The semantic description of a class bridges a gap between the low-level features and high-level class concepts (Palatucci, Pomerleau, Hinton, & Mitchell, 2009). Zero-shot learning utilizes the attribute relationship between seen and unseen classes to finish the recognition task. Generally, in the training time, probability models or mappings are built from samples in seen categories to establish the connection between visual and semantic space. In the testing time, embedded features in the semantic space are utilized to match the correct class prototype by some search methods. In addition, test samples can also be considered from both seen and unseen categories in the testing phrase, which is called Generalized Zero-Shot Learning (GZSL) (Chao, Changpinyo, Gong, & Sha, 2016). In real-world applications, seen categories are usually more common than unseen ones, so that GZSL is more realistic and challenging than ZSL for practical recognition tasks.
In early works of ZSL, the semantic within a two-stage approach is used to predict the label of an image that belongs to one of the unseen classes. In most cases, the semantic of an input image sample is inferred in the first stage, then its class label is predicted by searching the class that includes the most similar set of attributes. For example, Direct Attributes Prediction (DAP) (Lampert, Nickisch, & Harmeling, 2014) first estimates the posterior of each attribute for an image sample by learning probabilistic attribute classifiers. Then it calculates the class posteriors and infers the class label with maximum a posterior (MAP) estimate. With aid of semantic description, many ZSL methods usually learn a model by embedding the visual features and attributes into a common space. To preserve the geometry structure of features in the common space, Deutsch, Kolouri, Kim, Owechko, and Soatto (2017) and Xu et al. (2017) take the graph information into account, but it increases the computational complexity. Moreover, many methods directly establish the connection between visual and attribute space, however, due to the strong correlation among attributes, the result is not satisfied (Jayaraman, Sha, & Grauman, 2014). The reason is that the learned mapping from feature space to attribute space can be seen as a multi-label classifier (Elisseeff & Weston, 2001), which is more complicated than the single-label classifier.
In contrast, if the label space is introduced between the visual feature and attribute space, the aforementioned problem can be alleviated to some extent. As far as we know, there are few methods utilizing the label space to alleviate the visual feature and attribute space. Indirect Attribute Prediction (IAP) (Lampert et al., 2014) is an effective method that indirectly obtains each attribute posterior probability through the label space, and then obtains each probability of the unseen category and accomplishes the ZSL task. However, there are two main disadvantages for IAP, which may make it unsuitable for the GZSL task. First, the labels of seen and unseen categories are defined in different spaces. Second, these labels have no specific meanings and are only signs of classes, resulting in less discriminative.
To handle the above problems, a Label-Activating Framework (LAF) is proposed in this paper. The framework treats the label space as a common space and learns two mappings for activating the original label information. Then, the labels of seen and unseen categories are defined in the activated space. An example is shown in Fig. 1 as the illustration of our framework.
Assume magpie and perch are both seen categories while penguin and striped killifish are two unseen categories. In the training time, visual features, labels and semantic descriptions of seen categories are given, we learn two mappings to embed visual space and semantics into the label space respectively. In addition, labels and attributes are required to reconstruct each other. In the testing time, given visual features and semantic descriptions of unseen samples, the projected attributes of unseen classes in the label space are used as their corresponding labels. For example, penguin refers to both magpie and perch due to the two attributes “wings” and “aquatic”. Similarly, the semantic descriptions of Striped killifish contain “aquatic”, so its projection refer to perch in the label space. In the new activated label space, labels of penguin and striped killifish are not one-hot vector yet and can be regarded as the linear combination of seen categories, which makes the label space meaningful and discriminative especially in GZSL. We develop a specific model under our framework and compare it with existing state-of-the-art ZSL models on four datasets. Experimental results show our framework has better performance in most cases, especially for the GZSL task. Our main contributions are summarized as follows:
- •
We propose a novel ZSL framework that activates the label space by learning two mappings. Then, labels of unseen classes can be regarded as the linear combination of labels of seen classes, which are more suitable for the GZSL task.
- •
The label space is continuous rather than discrete after the label space is activated. It is more natural for the generation of new categories and has stronger discriminative property.
- •
We develop a specific model based on the proposed framework, which establishes the new state-of-the-art performance on ZSL and GZSL on all conventional benchmark datasets.
Section snippets
Related works
In this section, we briefly review some works closely related to ours.
Our framework and method
In this section, we firstly present a label-activating framework. Then one specific model and corresponding algorithms are developed. Finally, the classification method is introduced.
Experiments
In this section, we validate the proposed method on four datasets (SUN, CUB, AwA and aPY), and compare with a number of state-of-the-art methods. In addition, all experiments are performed on the Windows-7 operating systems (Intel Core i7-990x CPU @ 3.46 GHz 64 GB RAM).
Conclusions
In this paper, a Label-Activating Framework (LAF) is proposed to solve semantic-based classification problem. The proposed framework consists of two mappings i.e. visuallabel activating and semanticlabel activating. For the mapping from visual feature space to label space, various methods can be used, such as linear regression with Frobenius norm regularization in this paper. For the mapping from semantic space to label space, we mainly aims to learn to a projection from semantic space to
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grant 61906141, 61773302, 61432014 and 61772402, in part by the China Postdoctoral Science Foundation (Grant 2019M653564) and in part by the Fundamental Research Funds for the Central Universities.
References (53)
- et al.
Label-embedding for image classification
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2016) - et al.
Evaluation of output embeddings for fine-grained image classification
- et al.
Preserving semantic relations for zero-shot learning
- et al.
Solution of the matrix equation ax+ xb= c [f4]
Communications of the ACM
(1972) Recognition-by-components: a theory of human image understanding. psychol rev
Psychological Review
(1987)- et al.
Distributed optimization and statistical learning via the alternating direction method of multipliers
Foundations and Trends in Machine Learning
(2011) - et al.
A singular value thresholding algorithm for matrix completion
SIAM Journal on Optimization
(2010) - et al.
Robust principal component analysis?
Journal of the ACM
(2011) - et al.
Synthesized classifiers for zero-shot learning
- et al.
An empirical study and analysis of generalized zero-shot learning for object recognition in the wild
Frontiers of Information Technology and Electronic Engineering
(2016)
Neil: Extracting visual knowledge from web data
Zero-shot visual recognition using semantics-preserving adversarial embedding network
Zero shot learning via multi-scale manifold regularization
A kernel method for multi-labelled classification
Describing objects by their attributes
Devise: A deep visual-semantic embedding model
Exploring commonality and individuality for multi-modal curriculum learning
Learning with inadequate and incorrect supervision
Deep residual learning for image recognition
Computer Vision and Pattern Recognition
Decorrelating semantic visual attributes by resisting the urge to share
Learning class prototypes via structure alignment for zero-shot recognition
Unsupervised domain adaptation for zero-shot learning
Semantic autoencoder for zero-shot learning
Computer Vision and Pattern Recognition
Learning to detect unseen object classes by between-class attribute transfer
Attribute-based classification for zero-shot visual object categorization
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cited by (40)
Adaptive Relation-Aware Network for zero-shot classification
2024, Neural NetworksLearning cross-domain semantic-visual relationships for transductive zero-shot learning
2023, Pattern RecognitionGenerating diverse augmented attributes for generalized zero shot learning
2023, Pattern Recognition LettersDistribution and gradient constrained embedding model for zero-shot learning with fewer seen samples
2022, Knowledge-Based SystemsCitation Excerpt :In addition, some methods try to map visual features and semantic vectors to another shared space. Representatively, Li et al. [15] and Yang et al. [16] both thought the label information of classes is helpful for the learning of the projection. Therefore, they chose the label space as the shared space to learn the relationship between visual features and semantic vectors.
Zero-shot learning via a specific rank-controlled semantic autoencoder
2022, Pattern RecognitionA semi-supervised zero-shot image classification method based on soft-target
2021, Neural NetworksCitation Excerpt :Recent years have witness the rapid progress of image classification technique thanks to the renaissance of deep learning (He, Zhang, Ren, & Sun, 2016; Russakovsky et al., 2015). Most methods are data-driven, relying on a large amount of labeled images to train the deep neural network model (Liu, Gao, Gao, Han, & Shao, 2020). However, manual annotation for images in tens of thousands of categories in the real world requires extremely high human and time cost, which is quite different from the way human beings understand the world (Wang & Chen, 2020).