Elsevier

Expert Systems with Applications

Volume 91, January 2018, Pages 263-269
Expert Systems with Applications

Multi-label classification using hierarchical embedding

https://doi.org/10.1016/j.eswa.2017.09.020Get rights and content

Highlights

  • Multi-label learning deals with the classification of data with multiple labels.

  • Output space with many labels is tackle by modeling inter-label correlations.

  • Use of parametrization and embedding have been the prime focus.

  • A piecewise-linear embedding using maximum margin matrix factorization is proposed.

  • Our experimental analysis manifests the superiority of our proposed method.

Abstract

Multi-label learning is concerned with the classification of data with multiple class labels. This is in contrast to the traditional classification problem where every data instance has a single label. Multi-label classification (MLC) is a major research area in the machine learning community and finds application in several domains such as computer vision, data mining and text classification. Due to the exponential size of the output space, exploiting intrinsic information in feature and label spaces has been the major thrust of research in recent years and use of parametrization and embedding have been the prime focus in MLC. Most of the existing methods learn a single linear parametrization using the entire training set and hence, fail to capture nonlinear intrinsic information in feature and label spaces. To overcome this, we propose a piecewise-linear embedding which uses maximum margin matrix factorization to model linear parametrization. We hypothesize that feature vectors which conform to similar embedding are similar in some sense. Combining the above concepts, we propose a novel hierarchical matrix factorization method for multi-label classification. Practical multi-label classification problems such as image annotation, text categorization and sentiment analysis can be directly solved by the proposed method. We compare our method with six well-known algorithms on twelve benchmark datasets. Our experimental analysis manifests the superiority of our proposed method over state-of-art algorithm for multi-label learning.

Introduction

The objective of multi-label classification is to build a classifier that can automatically tag an example with the most relevant subset of labels. This problem can be seen as a generalization of the single label classification where an instance is associated with a unique class label from a set of disjoint labels L. Formally, given n training examples in the form of a pair of feature matrix X and label matrix Y where each example xiRd,1in, is a row of X and its associated labels Yi{±1}L is the corresponding row of Y, the task of multi-label classification is to learn a parameterization h:Rd{±1}L that maps each instance to a set of labels. Recent years have witnessed extensive applications of multi-label classification in machine learning  (Read, Pfahringer, Holmes, Frank, 2009, Zhang, Zhou, 2006), in computer vision  (Boutell, Luo, Shen, Brown, 2004, Cabral, Torre, Costeira, Bernardino, 2011), and in data mining  (Schapire, Singer, 2000, Tsoumakas, Vlahavas, 2007).

Existing methods of multi-label classification can be broadly divided into two categories (Sorower, 2010, Zhang, Zhou, 2014) - methods based on problem transformation and methods based on algorithm adaptation. Former approach transforms the multi-label classification problem into single label classification problems so that existing single-label classification algorithms can be applied. During the last decade, a number of problem transformation techniques are proposed in the literature such as Binary Relevance (BR) (Boutell et al., 2004), Calibrated Label Ranking (Fürnkranz, Hüllermeier, Mencía, & Brinker, 2008), Classifier Chains (Read et al., 2009), Random k-labelsets (Tsoumakas & Vlahavas, 2007). On the other hand, methods based on algorithm adaption extend or adapt the learning techniques to deal with multi-label data directly. Representative algorithms include AdaBoost.MH and AdaBoost.MR (Schapire & Singer, 2000) which are two simple extensions of AdaBoost, ML-DT (Clare & King, 2001) adapting decision tree techniques, lazy learning techniques such as ML-kNN (Zhang & Zhou, 2007) and BR-kNN (Spyromitros, Tsoumakas, & Vlahavas, 2008) to name a few.

To cope with the challenge of exponential-sized output space, modeling inter-label correlations has been the major thrust of research in the area of multi-label classification in recent years (Bi, Kwok, 2014, Huang, Zhou, Zhou, 2012, Li, Zhao, Zhang, Wu, Zhuang, Wang, Li, 2016) and for this, use of parametrization and embedding have been the prime focus (Cabral, Torre, Costeira, Bernardino, 2011, Huang, Li, Huang, Wu, 2015, Huang, Zhou, Zhou, 2012, Li, Zhao, Zhang, Wu, Zhuang, Wang, Li, 2016, Yu, Jain, Kar, Dhillon, 2014). There are two strategies of embedding for exploiting inter-label correlation. The first is to learn label-specific features for each class label. In Huang et al. (2015, 2016), a parametrized approach is suggested to transform data from original feature space to label-specific feature space with the assumption that each class label is associated with a sparse label specific feature. The second approach models inter-label correlation implicitly using low-rank parametrization  (Cabral, Torre, Costeira, Bernardino, 2011, Yu, Jain, Kar, Dhillon, 2014). The debate is going on as to whether it is the low-rank embedding or the label-specific sparse transformation that models the label correlation accurately when the size of label set is sufficiently large. It can be seen that both the approaches are essentially a process of parametrization to overcome the complexity of multi-label classification and most often it is proposed to adopt linear parametrization. Some researchers (Kimura, Kudo, Sun, 2016, Li, Guo, 2015) suggest a natural extension of their proposal of linear parametrization to nonlinear cases but no detailed study is undertaken in this direction. Moreover, all these approaches do not yield results beyond a particular level of accuracy for problems with large data and large number of labels.

Our experimental and theoretical study of the recent approaches for multi-label classification reveals many important aspects of the problem. It is clear that a single linear embedding h may not take us very far in finding accurate multi-label classification. There are several reasons for this: the diversity of the training set, the correlation among labels, the feature-label relationship, and most importantly, the learning algorithm to determine the mapping h. Normally, h is determined by a process of nonlinear optimization. We conclude, from our experience with all the major algorithms proposed so far, that the use of the entire training set for training a single h is not appropriate and single embedding h for all instances is not suitable when inter-label correlation exists. Thus, a research question that naturally arises is whether there can be a parametrization which is piecewise-linear. In this paper, we investigate this aspect and propose a novel method that generates optimal embeddings for subsets of training examples. Our method is novel in the sense that it judiciously selects a subset of training examples for training and then it assigns a suitable subset of the training set to an embedding. Using multiple embeddings and their assigned training sets, a new instance is classified and we show that the proposed method outperforms all major algorithms on all major benchmark datasets.

The rest of the paper is organized as follows. Section 2 briefly reviews the earlier research on multi-label learning. The outline of the proposed method is described in Section 3. We introduce our proposed method, termed as MLC-HMF in Section 4. Experimental analysis of proposed method is reported in Section 5. Finally, Section 6 concludes and indicates several issues for future work.

Section snippets

Related work

Given a feature matrix X and a label matrix Y, the goal of linear parametrization is to learn the parameter W and a common formulation is the following optimization problem with regularized loss. minW(Y,XW)+λR(W)where WRd×L, ℓ( · ) is a loss function that measures how well XW approximates Y, R( · ) is a regularization function that promotes various desired properties in W (low-rank, sparsity, group-sparsity, etc.) and the constant λ ≥ 0 is the regularization parameter which controls the

Outline of the proposed approach

In this section we introduce the underlying principle of the proposed method. We start with the formulation given in Eq. (2). For exploiting correlations in the labels, one way is to factor the matrix W=UV where URd×k can be interpreted as a embedding of the features X into a k dimensional latent space and VRk×L is a linear classifier on this space. Regularization is provided by constraining the dimensionality of the latent space (k). The minimization in U and V is unfortunately non-convex,

MLC-HMF: the proposed method

In this section, a novel method for multi-label classification is proposed based on a tree structure constructed by using k-means clustering. Algorithm 1 outlines the main flow of the proposed method. For each node in the tree, a joint learning framework given in Eq. (4) with low-rank constraint on the parametrization (embedding) and multi-label classification is performed simultaneously. At every node, we maintain the mapping U and label feature matrix V along with the training examples whose

Experimental analysis

This section discusses the experimental setup. We use twelve multi-label benchmark datasets for experiments, and the detailed characteristics of these datasets are summarized in Table 1. All of these datasets can be downloaded from labic1, meka2 and mulan3.

To measure the performance of the different algorithms, we employed six evaluation metrics popularly used in

Conclusions and discussion

This paper presented a new multi-label classification method, called MLC-HMF, which learns piecewise-linear embedding with a low-rank constraint on parametrization to capture nonlinear intrinsic relationships that exist in the original feature and label space. Extensive comparative studies validate the effectiveness of MLC-HMF against the state-of-the-art multi-label learning approaches.

In multi-label classification problem, infrequently occurring (tail) labels are associated with few training

References (29)

  • M.R. Boutell et al.

    Learning multi-label scene classification

    Pattern Recognition

    (2004)
  • V. Kumar et al.

    Collaborative filtering using multiple binary maximum margin matrix factorizations

    Information Sciences

    (2017)
  • M.-L. Zhang et al.

    Ml-knn: A lazy learning approach to multi-label learning

    Pattern Recognition

    (2007)
  • W. Bi et al.

    Multilabel classification with label correlations and missing labels.

    AAAI

    (2014)
  • R.S. Cabral et al.

    Matrix completion for multi-label image classification

    Advances in neural information processing systems

    (2011)
  • C.-C. Chang et al.

    LIBSVM: A library for support vector machines

    ACM Transactions on Intelligent Systems and Technology

    (2011)
  • A. Clare et al.

    Knowledge discovery in multi-label phenotype data

    European conference on principles of data mining and knowledge discovery

    (2001)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • M. Fazel et al.

    A rank minimization heuristic with application to minimum order system approximation

    American control conference, 2001. proceedings of the 2001

    (2001)
  • J. Fürnkranz et al.

    Multilabel classification via calibrated label ranking

    Machine Learning

    (2008)
  • J. Huang et al.

    Learning label specific features for multi-label classification

    Data mining (ICDM), 2015 IEEE international conference on

    (2015)
  • J. Huang et al.

    Learning label-specific features and class-dependent labels for multi-label classification

    IEEE Transactions on Knowledge and Data Engineering

    (2016)
  • S.-J. Huang et al.

    Multi-label learning by exploiting label correlations locally

    Twenty-sixth AAAI conference on artificial intelligence

    (2012)
  • A.K. Jain et al.

    Data clustering: A review

    ACM Computing Surveys (CSUR)

    (1999)
  • Cited by (24)

    • Multilabel naïve Bayes classification considering label dependence

      2020, Pattern Recognition Letters
      Citation Excerpt :

      Jing et al. [6] introduced semisupervised multilabel classification that applies singular value decomposition for label matrix factorization. Similarly, Kumar et al. [7] proposed a hierarchical embedding-based multilabel classifier that is based on k-means clustering and low-rank matrix factorization. Zhu et al. [24] developed multilabel learning with a global and local label correlation (GLOCAL) strategy that used the correlation among labels in the global and local viewpoints using low-rank matrix factorization.

    • A comparative study on network alignment techniques

      2020, Expert Systems with Applications
      Citation Excerpt :

      On the other hand, network representation learning methods require an intermediate step in which the nodes in the networks are represented as embeddings (Donnat, Zitnik, Hallac, & Leskovec, 2018; Huang, Li, & Hu, 2017; Perozzi, Al-Rfou, & Skiena, 2014; Wang, Cui, & Zhu, 2016). These embeddings allow to capture network structure information (Coucheiro-Limeres, Ferreiros-Lopez, San-Segundo, & Cordoba, 2018; García & Brézillon, 2018; Zhang et al., 2017) and possible node features if available (Grover & Leskovec, 2016; Kumar, Pujari, Padmanabhan, Sahu, & Kagita, 2018; Wan, Chen, & Zhang, 2018). Given the embeddings of the two networks, a mapping step is performed to identify node correspondences.

    • Global and local multi-view multi-label learning

      2020, Neurocomputing
      Citation Excerpt :

      In order to process multi-label data sets, multi-label learning has also been developed and many methods are proposed. For example, Weng et al. have developed a multi-label learning based on label-specific features and local pairwise label correlation (LF-LPLC) [21], Kumar et al. have developed a multi-label classification machine with hierarchical embedding (MLCHE) to process practical multi-label classification problems such as image annotation, text categorization and sentiment analysis [22], and Zhu et al. have proposed a multi-label learning named GLOCAL which takes both global and local label correlation into consideration [11]. Since multi-view learning and multi-label learning have a strong pertinence and they cannot process multi-view multi-label data sets, thus some scholars develop corresponding solutions.

    • Feature ranking for enhancing boosting-based multi-label text categorization

      2018, Expert Systems with Applications
      Citation Excerpt :

      For example, a news article about “education” may also relate to “economy” and/or “politics”. Several multi-label classification algorithms have been proposed which extend the single-label classification algorithms to solve the multi-label problem, such as binary relevance (Boutell, Luo, Shen, & Brown, 2004), classifier chains (Read, Pfahringer, Holmes, & Frank, 2011), label powerset (Tsoumakas & Vlahavas, 2007), ranking by pairwise comparison (Hüllermeier, Fürnkranz, Cheng, & Brinker, 2008), calibrated ranking by pairwise comparison (Fürnkranz, Hüllermeier, Mencía, & Brinker, 2008), hierarchical embedding (Kumar, Pujari, Padmanabhan, Sahu, & Kagita, 2018), clustered intrinsic label correlations (Kumar et al., 2018) and label correlation exploitation algorithms (Yu, Pedrycz, & Miao, 2014). AdaBoost.MH (Freund & Schapire, 1997), the multi-label version of AdaBoost (Schapire, Freund, Bartlett, & Lee, 1998), is accurate and considered to be one of the state-of-the-art multi-label classification algorithms.

    • Text classification using embeddings: a survey

      2023, Knowledge and Information Systems
    View all citing articles on Scopus
    View full text