Multi-label classification using hierarchical embedding
Introduction
The objective of multi-label classification is to build a classifier that can automatically tag an example with the most relevant subset of labels. This problem can be seen as a generalization of the single label classification where an instance is associated with a unique class label from a set of disjoint labels . Formally, given n training examples in the form of a pair of feature matrix and label matrix where each example is a row of and its associated labels is the corresponding row of the task of multi-label classification is to learn a parameterization that maps each instance to a set of labels. Recent years have witnessed extensive applications of multi-label classification in machine learning (Read, Pfahringer, Holmes, Frank, 2009, Zhang, Zhou, 2006), in computer vision (Boutell, Luo, Shen, Brown, 2004, Cabral, Torre, Costeira, Bernardino, 2011), and in data mining (Schapire, Singer, 2000, Tsoumakas, Vlahavas, 2007).
Existing methods of multi-label classification can be broadly divided into two categories (Sorower, 2010, Zhang, Zhou, 2014) - methods based on problem transformation and methods based on algorithm adaptation. Former approach transforms the multi-label classification problem into single label classification problems so that existing single-label classification algorithms can be applied. During the last decade, a number of problem transformation techniques are proposed in the literature such as Binary Relevance (BR) (Boutell et al., 2004), Calibrated Label Ranking (Fürnkranz, Hüllermeier, Mencía, & Brinker, 2008), Classifier Chains (Read et al., 2009), Random k-labelsets (Tsoumakas & Vlahavas, 2007). On the other hand, methods based on algorithm adaption extend or adapt the learning techniques to deal with multi-label data directly. Representative algorithms include AdaBoost.MH and AdaBoost.MR (Schapire & Singer, 2000) which are two simple extensions of AdaBoost, ML-DT (Clare & King, 2001) adapting decision tree techniques, lazy learning techniques such as ML-kNN (Zhang & Zhou, 2007) and BR-kNN (Spyromitros, Tsoumakas, & Vlahavas, 2008) to name a few.
To cope with the challenge of exponential-sized output space, modeling inter-label correlations has been the major thrust of research in the area of multi-label classification in recent years (Bi, Kwok, 2014, Huang, Zhou, Zhou, 2012, Li, Zhao, Zhang, Wu, Zhuang, Wang, Li, 2016) and for this, use of parametrization and embedding have been the prime focus (Cabral, Torre, Costeira, Bernardino, 2011, Huang, Li, Huang, Wu, 2015, Huang, Zhou, Zhou, 2012, Li, Zhao, Zhang, Wu, Zhuang, Wang, Li, 2016, Yu, Jain, Kar, Dhillon, 2014). There are two strategies of embedding for exploiting inter-label correlation. The first is to learn label-specific features for each class label. In Huang et al. (2015, 2016), a parametrized approach is suggested to transform data from original feature space to label-specific feature space with the assumption that each class label is associated with a sparse label specific feature. The second approach models inter-label correlation implicitly using low-rank parametrization (Cabral, Torre, Costeira, Bernardino, 2011, Yu, Jain, Kar, Dhillon, 2014). The debate is going on as to whether it is the low-rank embedding or the label-specific sparse transformation that models the label correlation accurately when the size of label set is sufficiently large. It can be seen that both the approaches are essentially a process of parametrization to overcome the complexity of multi-label classification and most often it is proposed to adopt linear parametrization. Some researchers (Kimura, Kudo, Sun, 2016, Li, Guo, 2015) suggest a natural extension of their proposal of linear parametrization to nonlinear cases but no detailed study is undertaken in this direction. Moreover, all these approaches do not yield results beyond a particular level of accuracy for problems with large data and large number of labels.
Our experimental and theoretical study of the recent approaches for multi-label classification reveals many important aspects of the problem. It is clear that a single linear embedding h may not take us very far in finding accurate multi-label classification. There are several reasons for this: the diversity of the training set, the correlation among labels, the feature-label relationship, and most importantly, the learning algorithm to determine the mapping h. Normally, h is determined by a process of nonlinear optimization. We conclude, from our experience with all the major algorithms proposed so far, that the use of the entire training set for training a single h is not appropriate and single embedding h for all instances is not suitable when inter-label correlation exists. Thus, a research question that naturally arises is whether there can be a parametrization which is piecewise-linear. In this paper, we investigate this aspect and propose a novel method that generates optimal embeddings for subsets of training examples. Our method is novel in the sense that it judiciously selects a subset of training examples for training and then it assigns a suitable subset of the training set to an embedding. Using multiple embeddings and their assigned training sets, a new instance is classified and we show that the proposed method outperforms all major algorithms on all major benchmark datasets.
The rest of the paper is organized as follows. Section 2 briefly reviews the earlier research on multi-label learning. The outline of the proposed method is described in Section 3. We introduce our proposed method, termed as MLC-HMF in Section 4. Experimental analysis of proposed method is reported in Section 5. Finally, Section 6 concludes and indicates several issues for future work.
Section snippets
Related work
Given a feature matrix and a label matrix the goal of linear parametrization is to learn the parameter W and a common formulation is the following optimization problem with regularized loss. where ℓ( · ) is a loss function that measures how well approximates R( · ) is a regularization function that promotes various desired properties in W (low-rank, sparsity, group-sparsity, etc.) and the constant λ ≥ 0 is the regularization parameter which controls the
Outline of the proposed approach
In this section we introduce the underlying principle of the proposed method. We start with the formulation given in Eq. (2). For exploiting correlations in the labels, one way is to factor the matrix where can be interpreted as a embedding of the features into a k dimensional latent space and is a linear classifier on this space. Regularization is provided by constraining the dimensionality of the latent space (k). The minimization in U and V is unfortunately non-convex,
MLC-HMF: the proposed method
In this section, a novel method for multi-label classification is proposed based on a tree structure constructed by using k-means clustering. Algorithm 1 outlines the main flow of the proposed method. For each node in the tree, a joint learning framework given in Eq. (4) with low-rank constraint on the parametrization (embedding) and multi-label classification is performed simultaneously. At every node, we maintain the mapping U and label feature matrix V along with the training examples whose
Experimental analysis
This section discusses the experimental setup. We use twelve multi-label benchmark datasets for experiments, and the detailed characteristics of these datasets are summarized in Table 1. All of these datasets can be downloaded from labic1, meka2 and mulan3.
To measure the performance of the different algorithms, we employed six evaluation metrics popularly used in
Conclusions and discussion
This paper presented a new multi-label classification method, called MLC-HMF, which learns piecewise-linear embedding with a low-rank constraint on parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space. Extensive comparative studies validate the effectiveness of MLC-HMF against the state-of-the-art multi-label learning approaches.
In multi-label classification problem, infrequently occurring (tail) labels are associated with few training
References (29)
- et al.
Learning multi-label scene classification
Pattern Recognition
(2004) - et al.
Collaborative filtering using multiple binary maximum margin matrix factorizations
Information Sciences
(2017) - et al.
Ml-knn: A lazy learning approach to multi-label learning
Pattern Recognition
(2007) - et al.
Multilabel classification with label correlations and missing labels.
AAAI
(2014) - et al.
Matrix completion for multi-label image classification
Advances in neural information processing systems
(2011) - et al.
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology
(2011) - et al.
Knowledge discovery in multi-label phenotype data
European conference on principles of data mining and knowledge discovery
(2001) Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)- et al.
A rank minimization heuristic with application to minimum order system approximation
American control conference, 2001. proceedings of the 2001
(2001) - et al.
Multilabel classification via calibrated label ranking
Machine Learning
(2008)
Learning label specific features for multi-label classification
Data mining (ICDM), 2015 IEEE international conference on
Learning label-specific features and class-dependent labels for multi-label classification
IEEE Transactions on Knowledge and Data Engineering
Multi-label learning by exploiting label correlations locally
Twenty-sixth AAAI conference on artificial intelligence
Data clustering: A review
ACM Computing Surveys (CSUR)
Cited by (24)
Multilabel naïve Bayes classification considering label dependence
2020, Pattern Recognition LettersCitation Excerpt :Jing et al. [6] introduced semisupervised multilabel classification that applies singular value decomposition for label matrix factorization. Similarly, Kumar et al. [7] proposed a hierarchical embedding-based multilabel classifier that is based on k-means clustering and low-rank matrix factorization. Zhu et al. [24] developed multilabel learning with a global and local label correlation (GLOCAL) strategy that used the correlation among labels in the global and local viewpoints using low-rank matrix factorization.
A comparative study on network alignment techniques
2020, Expert Systems with ApplicationsCitation Excerpt :On the other hand, network representation learning methods require an intermediate step in which the nodes in the networks are represented as embeddings (Donnat, Zitnik, Hallac, & Leskovec, 2018; Huang, Li, & Hu, 2017; Perozzi, Al-Rfou, & Skiena, 2014; Wang, Cui, & Zhu, 2016). These embeddings allow to capture network structure information (Coucheiro-Limeres, Ferreiros-Lopez, San-Segundo, & Cordoba, 2018; García & Brézillon, 2018; Zhang et al., 2017) and possible node features if available (Grover & Leskovec, 2016; Kumar, Pujari, Padmanabhan, Sahu, & Kagita, 2018; Wan, Chen, & Zhang, 2018). Given the embeddings of the two networks, a mapping step is performed to identify node correspondences.
Global and local multi-view multi-label learning
2020, NeurocomputingCitation Excerpt :In order to process multi-label data sets, multi-label learning has also been developed and many methods are proposed. For example, Weng et al. have developed a multi-label learning based on label-specific features and local pairwise label correlation (LF-LPLC) [21], Kumar et al. have developed a multi-label classification machine with hierarchical embedding (MLCHE) to process practical multi-label classification problems such as image annotation, text categorization and sentiment analysis [22], and Zhu et al. have proposed a multi-label learning named GLOCAL which takes both global and local label correlation into consideration [11]. Since multi-view learning and multi-label learning have a strong pertinence and they cannot process multi-view multi-label data sets, thus some scholars develop corresponding solutions.
Feature ranking for enhancing boosting-based multi-label text categorization
2018, Expert Systems with ApplicationsCitation Excerpt :For example, a news article about “education” may also relate to “economy” and/or “politics”. Several multi-label classification algorithms have been proposed which extend the single-label classification algorithms to solve the multi-label problem, such as binary relevance (Boutell, Luo, Shen, & Brown, 2004), classifier chains (Read, Pfahringer, Holmes, & Frank, 2011), label powerset (Tsoumakas & Vlahavas, 2007), ranking by pairwise comparison (Hüllermeier, Fürnkranz, Cheng, & Brinker, 2008), calibrated ranking by pairwise comparison (Fürnkranz, Hüllermeier, Mencía, & Brinker, 2008), hierarchical embedding (Kumar, Pujari, Padmanabhan, Sahu, & Kagita, 2018), clustered intrinsic label correlations (Kumar et al., 2018) and label correlation exploitation algorithms (Yu, Pedrycz, & Miao, 2014). AdaBoost.MH (Freund & Schapire, 1997), the multi-label version of AdaBoost (Schapire, Freund, Bartlett, & Lee, 1998), is accurate and considered to be one of the state-of-the-art multi-label classification algorithms.
Text classification using embeddings: a survey
2023, Knowledge and Information Systems