Label-activating framework for zero-shot learning

doi:10.1016/j.neunet.2019.08.023

Neural Networks

Volume 121, January 2020, Pages 1-9

https://doi.org/10.1016/j.neunet.2019.08.023 Get rights and content

Abstract

Existing zero-shot learning (ZSL) models usually learn mappings between visual space and semantic space. However, few of them take the label information into account. Indirect Attribute Prediction (IAP) learns the posterior probability of each attribute by label space, but labels of seen and unseen classes are defined in different spaces, which is not suitable for Generalized ZSL (GZSL). We propose a Label-Activating Framework (LAF) for semantic-based classification. The purpose of the proposed framework is to activate the label space by learning mappings from vision and semantics to labels. In the training phase, the original label space made up of one-hot vectors is used as common space, on which visual features and semantic information are embedded. After the label space is activated, labels of unseen classes can be regarded as the linear combination of labels of seen classes. In this case, seen and unseen labels are defined in the same space, and the label space has specific meanings rather than only signs of each class. Doing so makes the activated label space become very discriminative, especially for GZSL, which is therefore more challenging and reasonable for real-world tasks. In addition, we develop a specific model based on the framework, which effectively mitigate the projection domain shift problem. Extensive experiments show our framework outperforms state-of-the-art methods and also its suitability for GZSL.

Introduction

While humans can distinguish 30,000 basic object categories (Biederman, 1987) and lots of subordinate ones (e.g. breeds of cats), and even recognized new classes dynamically from few examples with little effort, it is not easy for computer-based machine learning models, because it requires hundreds of labeled samples for each object category. Due to the high cost of collection and annotation of training samples, motivated by the ability of humans to recognize without seeing examples, the research area of transfer learning (Chen et al., 2014, Peng et al., 2018) has received increasing interests, which aims to make good use of previously learned knowledge to recognize new classes. Particularly, zero-shot learning (ZSL) tries to learn a model capable of recognizing unseen categories without labeled training data.

An effective solution to zero-shot learning is that introducing the semantic (attribute) space as an intermediate layer (Farhadi et al., 2009, Lampert et al., 2009). Semantic is the high-level description of a class or an instance so that it can be shared by multiple categories. The semantic description of a class bridges a gap between the low-level features and high-level class concepts (Palatucci, Pomerleau, Hinton, & Mitchell, 2009). Zero-shot learning utilizes the attribute relationship between seen and unseen classes to finish the recognition task. Generally, in the training time, probability models or mappings are built from samples in seen categories to establish the connection between visual and semantic space. In the testing time, embedded features in the semantic space are utilized to match the correct class prototype by some search methods. In addition, test samples can also be considered from both seen and unseen categories in the testing phrase, which is called Generalized Zero-Shot Learning (GZSL) (Chao, Changpinyo, Gong, & Sha, 2016). In real-world applications, seen categories are usually more common than unseen ones, so that GZSL is more realistic and challenging than ZSL for practical recognition tasks.

In early works of ZSL, the semantic within a two-stage approach is used to predict the label of an image that belongs to one of the unseen classes. In most cases, the semantic of an input image sample is inferred in the first stage, then its class label is predicted by searching the class that includes the most similar set of attributes. For example, Direct Attributes Prediction (DAP) (Lampert, Nickisch, & Harmeling, 2014) first estimates the posterior of each attribute for an image sample by learning probabilistic attribute classifiers. Then it calculates the class posteriors and infers the class label with maximum a posterior (MAP) estimate. With aid of semantic description, many ZSL methods usually learn a model by embedding the visual features and attributes into a common space. To preserve the geometry structure of features in the common space, Deutsch, Kolouri, Kim, Owechko, and Soatto (2017) and Xu et al. (2017) take the graph information into account, but it increases the computational complexity. Moreover, many methods directly establish the connection between visual and attribute space, however, due to the strong correlation among attributes, the result is not satisfied (Jayaraman, Sha, & Grauman, 2014). The reason is that the learned mapping from feature space to attribute space can be seen as a multi-label classifier (Elisseeff & Weston, 2001), which is more complicated than the single-label classifier.

In contrast, if the label space is introduced between the visual feature and attribute space, the aforementioned problem can be alleviated to some extent. As far as we know, there are few methods utilizing the label space to alleviate the visual feature and attribute space. Indirect Attribute Prediction (IAP) (Lampert et al., 2014) is an effective method that indirectly obtains each attribute posterior probability through the label space, and then obtains each probability of the unseen category and accomplishes the ZSL task. However, there are two main disadvantages for IAP, which may make it unsuitable for the GZSL task. First, the labels of seen and unseen categories are defined in different spaces. Second, these labels have no specific meanings and are only signs of classes, resulting in less discriminative.

To handle the above problems, a Label-Activating Framework (LAF) is proposed in this paper. The framework treats the label space as a common space and learns two mappings for activating the original label information. Then, the labels of seen and unseen categories are defined in the activated space. An example is shown in Fig. 1 as the illustration of our framework.

Assume magpie and perch are both seen categories while penguin and striped killifish are two unseen categories. In the training time, visual features, labels and semantic descriptions of seen categories are given, we learn two mappings to embed visual space and semantics into the label space respectively. In addition, labels and attributes are required to reconstruct each other. In the testing time, given visual features and semantic descriptions of unseen samples, the projected attributes of unseen classes in the label space are used as their corresponding labels. For example, penguin refers to both magpie and perch due to the two attributes “wings” and “aquatic”. Similarly, the semantic descriptions of Striped killifish contain “aquatic”, so its projection refer to perch in the label space. In the new activated label space, labels of penguin and striped killifish are not one-hot vector yet and can be regarded as the linear combination of seen categories, which makes the label space meaningful and discriminative especially in GZSL. We develop a specific model under our framework and compare it with existing state-of-the-art ZSL models on four datasets. Experimental results show our framework has better performance in most cases, especially for the GZSL task. Our main contributions are summarized as follows:

•
We propose a novel ZSL framework that activates the label space by learning two mappings. Then, labels of unseen classes can be regarded as the linear combination of labels of seen classes, which are more suitable for the GZSL task.
•
The label space is continuous rather than discrete after the label space is activated. It is more natural for the generation of new categories and has stronger discriminative property.
•
We develop a specific model based on the proposed framework, which establishes the new state-of-the-art performance on ZSL and GZSL on all conventional benchmark datasets.

Section snippets

Related works

In this section, we briefly review some works closely related to ours.

Our framework and method

In this section, we firstly present a label-activating framework. Then one specific model and corresponding algorithms are developed. Finally, the classification method is introduced.

Experiments

In this section, we validate the proposed method on four datasets (SUN, CUB, AwA and aPY), and compare with a number of state-of-the-art methods. In addition, all experiments are performed on the Windows-7 operating systems (Intel Core i7-990x CPU @ 3.46 GHz 64 GB RAM).

Conclusions

In this paper, a Label-Activating Framework (LAF) is proposed to solve semantic-based classification problem. The proposed framework consists of two mappings i.e. visual $\to$ label activating and semantic $\to$ label activating. For the mapping from visual feature space to label space, various methods can be used, such as linear regression with Frobenius norm regularization in this paper. For the mapping from semantic space to label space, we mainly aims to learn to a projection from semantic space to

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant 61906141, 61773302, 61432014 and 61772402, in part by the China Postdoctoral Science Foundation (Grant 2019M653564) and in part by the Fundamental Research Funds for the Central Universities.

References (53)

AkataZ. et al.
Label-embedding for image classification
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2016)
AkataZ. et al.
Evaluation of output embeddings for fine-grained image classification
AnnadaniY. et al.
Preserving semantic relations for zero-shot learning
BartelsR.H. et al.
Solution of the matrix equation ax+ xb= c [f4]
Communications of the ACM
(1972)
BiedermanI.
Recognition-by-components: a theory of human image understanding. psychol rev
Psychological Review
(1987)
BoydS. et al.
Distributed optimization and statistical learning via the alternating direction method of multipliers
Foundations and Trends in Machine Learning
(2011)
CaiJ.-F. et al.
A singular value thresholding algorithm for matrix completion
SIAM Journal on Optimization
(2010)
CandèsE.J. et al.
Robust principal component analysis?
Journal of the ACM
(2011)
ChangpinyoS. et al.
Synthesized classifiers for zero-shot learning
ChaoW.L. et al.
An empirical study and analysis of generalized zero-shot learning for object recognition in the wild
Frontiers of Information Technology and Electronic Engineering
(2016)

ChenX. et al.

Neil: Extracting visual knowledge from web data

ChenL. et al.

Zero-shot visual recognition using semantics-preserving adversarial embedding network

DeutschS. et al.

Zero shot learning via multi-scale manifold regularization

ElisseeffA. et al.

A kernel method for multi-labelled classification

FarhadiA. et al.

Describing objects by their attributes

FromeA. et al.

Devise: A deep visual-semantic embedding model

GongC.

Exploring commonality and individuality for multi-modal curriculum learning

GongC. et al.

Learning with inadequate and incorrect supervision

HeK. et al.

Deep residual learning for image recognition

Computer Vision and Pattern Recognition

(2016)

JayaramanD. et al.

Decorrelating semantic visual attributes by resisting the urge to share

JiangH. et al.

Learning class prototypes via structure alignment for zero-shot recognition

KodirovE. et al.

Unsupervised domain adaptation for zero-shot learning

KodirovE. et al.

Semantic autoencoder for zero-shot learning

Computer Vision and Pattern Recognition

(2017)

LampertC.H. et al.

Learning to detect unseen object classes by between-class attribute transfer

(2009)

LampertC.H. et al.

Attribute-based classification for zero-shot visual object categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2014)

Lin, Z., Chen, M., & Ma, Y. (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank...

Cited by (40)

Adaptive Relation-Aware Network for zero-shot classification
2024, Neural Networks
Supervised learning-based image classification in computer vision relies on visual samples containing a large amount of labeled information. Considering that it is labor-intensive to collect and label images and construct datasets manually, Zero-Shot Learning (ZSL) achieves knowledge transfer from seen categories to unseen categories by mining auxiliary information, which reduces the dependence on labeled image samples and is one of the current research hotspots in computer vision. However, most ZSL methods fail to properly measure the relationships between classes, or do not consider the differences and similarities between classes at all. In this paper, we propose Adaptive Relation-Aware Network (ARAN), a novel ZSL approach that incorporates the improved triplet loss from deep metric learning into a VAE-based generative model, which helps to model inter-class and intra-class relationships for different classes in ZSL datasets and generate an arbitrary amount of high-quality visual features containing more discriminative information. Moreover, we validate the effectiveness and superior performance of our ARAN through experimental evaluations under ZSL and more practical GZSL settings on three popular datasets AWA2, CUB, and SUN.
Learning cross-domain semantic-visual relationships for transductive zero-shot learning
2023, Pattern Recognition
Zero-Shot Learning (ZSL) learns models for recognizing new classes. One of the main challenges in ZSL is the domain discrepancy caused by the category inconsistency between training and testing data. Domain adaptation is the most intuitive way to address this challenge. However, existing domain adaptation techniques cannot be directly applied into ZSL due to the disjoint label space between source and target domains. This work proposes the Transferrable Semantic-Visual Relation (TSVR) approach towards transductive ZSL. TSVR redefines image recognition as predicting the similarity/dissimilarity labels for semantic-visual fusions consisting of class attributes and visual features. After the above transformation, the source and target domains can have the same label space, which hence enables to quantify domain discrepancy. For the redefined problem, the number of similar semantic-visual pairs is significantly smaller than that of dissimilar ones. To this end, we further propose to use Domain-Specific Batch Normalization to align the domain discrepancy.
Generating diverse augmented attributes for generalized zero shot learning
2023, Pattern Recognition Letters
Generalized Zero-Shot Learning (GZSL) has become an important research due to its powerful ability of recognizing unseen objects. Generative methods, converting conventional GZSL to fully supervised learning, can achieve competing performance, and most of them use semantic attributes plus Gaussian noise to enrich generated features. The visual features obtained in this way are consistent with the semantic description. However, the reality is that the semantic description of the visual features of the same category should be different, because the appearance of images differs from each other although they belong to a same category, i.e., mapping from semantic attributes to visual features should be a “many to many” relationship rather than “one to many”. Therefore, we propose a novel method to generate diverse augmented attribute, which are subsequently utilized to synthesize features. We construct a semantic generator based on a pre-trained semantic mapper, which augments the category semantics. Using the augmented category semantics to generate visual features will result in a better fit of the generated visual features to the distribution of real features. The proposed method can well solve the pseudo diversity of visual features generated by most generative GZSL methods. We evaluate the proposed method on five popular benchmark datasets, and the results show that it can achieve the state-of-the-art performance.
Distribution and gradient constrained embedding model for zero-shot learning with fewer seen samples
2022, Knowledge-Based Systems
Citation Excerpt :
In addition, some methods try to map visual features and semantic vectors to another shared space. Representatively, Li et al. [15] and Yang et al. [16] both thought the label information of classes is helpful for the learning of the projection. Therefore, they chose the label space as the shared space to learn the relationship between visual features and semantic vectors.
Zero-Shot Learning (ZSL), which aims to recognize unseen classes with no training data, has made great progress in recent years. However, established ZSL methods implicitly assumed that there exist sufficient labeled samples for each seen class, which is quite idealistic in general as collecting sufficient labeled samples is a labor-intensive task and may even be naturally impractical for some low-probability events. Accordingly, we investigate how to perform ZSL with fewer seen samples. Specifically, we propose a Distribution and Gradient constrained Embedding Model (DGEM), which aims to predict the visual prototypes (means) for the given semantic vectors of seen classes. Specifically, we summarize the main challenges brought by limited seen samples as the representation bias problem and the over-fitting problem. Correspondingly, two regularizers are proposed to solve them: (1) a prototype refinement loss which uses the relative distribution of class semantics to constrain that of the predicted visual prototypes; (2) a projection smoothing constraint that prevents the model from forming sharp decision boundaries. We validate the effectiveness of DGEM on five ZSL datasets and compare it with several representative ZSL methods. Experimental results show that DGEM outperforms the other established methods when each seen class has only 1/5 sample(s).
Zero-shot learning via a specific rank-controlled semantic autoencoder
2022, Pattern Recognition
Existing embedding zero-shot learning models usually learn a projection function from the visual feature space to the semantic embedding space, e.g. attribute space or word vector space. However, the projection learned based on seen samples may not generalize well to unseen classes, which is known as the projection domain shift problem in ZSL. To address this issue, we propose a method named Low-rank Semantic Autoencoder (LSA) to consider the low-rank structure of seen samples to maintain the sparse feature of reconstruction error, which can further improve zero-shot learning capability. Moreover, to obtain a more robust projection for unseen classes, we propose a Specific Rank-controlled Semantic Autoencoder (SRSA) to accurately control of the projection’s rank. Extensive experiments on six benchmarks demonstrate the superiority of the proposed models over most existing embedding ZSL models under the standard zero-shot setting and the more realistic generalized zero-shot setting.
A semi-supervised zero-shot image classification method based on soft-target
2021, Neural Networks
Citation Excerpt :
Recent years have witness the rapid progress of image classification technique thanks to the renaissance of deep learning (He, Zhang, Ren, & Sun, 2016; Russakovsky et al., 2015). Most methods are data-driven, relying on a large amount of labeled images to train the deep neural network model (Liu, Gao, Gao, Han, & Shao, 2020). However, manual annotation for images in tens of thousands of categories in the real world requires extremely high human and time cost, which is quite different from the way human beings understand the world (Wang & Chen, 2020).
Zero-shot learning (ZSL) aims at training a classification model with data only from seen categories to recognize data from disjoint unseen categories. Domain shift and generalization capability are two fundamental challenges in ZSL. In this paper, we address them with a novel Soft-Target Semi-supervised Classification (STSC) model. Specifically, an autoencoder network is leveraged, where both labeled seen data from the seen categories and unlabeled ancillary data collected from Internet or other datasets are employed as two branches, respectively. For the branch of labeled seen data, side information are employed as the latent vectors to separately connect the input of encoder and the output of decoder. In this way, visual and side information are implicitly aligned. For the branch of unlabeled ancillary data, it explicitly strengthens the reconstruction ability of the network. Meanwhile, these ancillary data can be viewed as a smooth to the domain distribution, which contributes to the alleviation of the domain shift problem. To further guarantee the generation ability, a Softmax-T loss function is proposed by making full use of the soft target. Extensive experiments on three benchmark datasets show the superiority of the proposed approach under tasks of both traditional zero-shot learning and generalized zero-shot learning.

View all citing articles on Scopus

View full text

Label-activating framework for zero-shot learning

Abstract

Introduction

Section snippets

Related works

Our framework and method

Experiments

Conclusions

Acknowledgments

Label-embedding for image classification

IEEE Transactions on Pattern Analysis and Machine Intelligence

Evaluation of output embeddings for fine-grained image classification

Preserving semantic relations for zero-shot learning

Solution of the matrix equation ax+ xb= c [f4]

Communications of the ACM

Recognition-by-components: a theory of human image understanding. psychol rev

Psychological Review

Distributed optimization and statistical learning via the alternating direction method of multipliers

Foundations and Trends in Machine Learning

A singular value thresholding algorithm for matrix completion

SIAM Journal on Optimization

Robust principal component analysis?

Journal of the ACM

Synthesized classifiers for zero-shot learning

An empirical study and analysis of generalized zero-shot learning for object recognition in the wild

Frontiers of Information Technology and Electronic Engineering

Neil: Extracting visual knowledge from web data

Zero-shot visual recognition using semantics-preserving adversarial embedding network

Zero shot learning via multi-scale manifold regularization

A kernel method for multi-labelled classification

Describing objects by their attributes

Devise: A deep visual-semantic embedding model

Exploring commonality and individuality for multi-modal curriculum learning

Learning with inadequate and incorrect supervision

Deep residual learning for image recognition

Computer Vision and Pattern Recognition

Decorrelating semantic visual attributes by resisting the urge to share

Learning class prototypes via structure alignment for zero-shot recognition

Unsupervised domain adaptation for zero-shot learning

Semantic autoencoder for zero-shot learning

Computer Vision and Pattern Recognition

Learning to detect unseen object classes by between-class attribute transfer

Attribute-based classification for zero-shot visual object categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence