Elsevier

Neurocomputing

Volume 460, 14 October 2021, Pages 385-398
Neurocomputing

Multi-label text classification via joint learning from label embedding and label correlation

https://doi.org/10.1016/j.neucom.2021.07.031Get rights and content

Abstract

For the multi-label text classification problems with many classes, many existing multi-label classification algorithms become infeasible or suffer an unaffordable cost. Some researches hence perform the Label Space Dimension Reduction(LSDR) to solve this problem, but a number of methods ignore the sequence information of texts and the label correlation in the original label space, and treat the label as a meaningless multi-hot vector. In this paper, we put forward a multi-label text classification algorithm LELC(joint learning from Label Embedding and Label Correlation) based on the multi-layer attention and label correlation to solve the issue of multi-label text classification with a large number of class labels. Specifically, we firstly extract features through Bidirectional Gated Recurrent Unit Network(Bi-GRU), multi-layer attention and linear layers. Bi-GRU will capture the content information and sequence information of the text at the same time, and the attention mechanism can help us select the valid features related to labels. Then, we use matrix factorization to perform LSDR, and consider label correlation of the original label space in this process, which allows us to implicitly encode the latent space and simplify the model learning. Finally, Deep Canonical Correlation Analysis(CCA) technology is exploited to couple features and the latent space in an end-to-end pattern, so that these two can influence each other to learn the mapping of feature space to latent space. Experiments on 11 real-world datasets show the comparability between our proposed model and the state-of-the-art methods.

Introduction

Multi-label classification is an extension of traditional multi-class classification. Different from multi-class classification, only one label can be allocated to an instance, multi-label classification will use multiple labels to describe an instance in more detail. Multi-label classification has extremely wide applications in the real world, such as text classification [1] and information retrieval [2] in natural language processing, and image classification [3] and semantic image annotation [4] in image processing. We mainly aim at related applications of text classification in natural language processing.

We can decompose multi-label text classification into two phases. First of all, learn feature representation of the text documents, and then perform multi-label classification algorithms on the extracted feature representation. The representation of text features includes the representation of words and that of documents. For word representation, we can use a one-hot vector or a word vector trained by an off-the-shelf network, such as neural language models(nnlm) [5], word2vector [6], glove [7] and other networks. For the representation of documents, we can use bag of words(BoW) or neural network representation, such as convolutional neural networks, recurrent neural networks, etc. Multi-label classification algorithms can be roughly divided into two groups. One is the problem transformation method, which is to convert multi-label problems into other off-the-shelf learning scenarios, such as multi-class problems, ranking problems and so on. The core idea is to fit data to the algorithm, BR is the representative algorithm [3]. The other is the algorithm adaptation method, which adapts the popular algorithm to a multi-label algorithm. The core idea is to fit the algorithm to data, and representative algorithms include Rank-SVM adapting SVM algorithm [8] and ML-kNN adapting k-nearest neighbor algorithm [9].

However, multi-label text classification suffers from some challenges. Firstly, text feature dimensions are very huge, which will make text feature extraction much more difficult [10]. Moreover, multi-label classification is far harder than single label classification, due to the overwhelming size of output space. That is to say, the size of label sets grows exponentially with the number of class labels, especially for multi-label classification problems with a large number of class labels [11]. So traditional multi-label algorithms like BR [3] and Classifier Chains(CC) [12] will suffer from unaffordable or extremely high computational costs.

To tackle the multi-label text classification problems with high feature dimensions or with a large number of class labels, dimensional reduction-based techniques are widely used, such as PCA, LDA, CCA, NMF [13], [14], [15], [16] and so on. Multi-label linear discriminant analysis (MLDA) [17] is an extension of LDA for multi-label classification. Principal label space transformation [18] applies CCA technology for multi-label classification. Generally speaking, dimension reduction methods can be divided into the following two categories, selection and extraction based approaches. Selection based approaches remove irrelevant or redundant dimensions, while the extraction based methods are to transform the original data into a smaller space [19]. In short, the application of dimension reduction in multi-label classification is mainly to reduce the dimension of the feature space or label space, so as to reduce the computational cost and maintain an acceptable classification performance. In detail, as shown in Fig. 1, for LSDR, the high-dimensional label space of any training instance is transformed into a low-dimensional latent space. Then, the prediction model is trained to map the feature space to the low-dimensional latent space. The dimension of the latent space is much smaller than the original label space, so the training cost can be significantly reduced. For the prediction of an unseen instance, a learned prediction model is firstly used to obtain a low-dimensional vector from its features, and then the vector is decoded to recover its label vector [20].

However, most LSDR based methods have a certain degree of flaw. For instance, principal label space transformation[18] is a feature-unaware algorithm that only considers dimension reduction in label space. As a feature-aware LSDR algorithm, conditional principal label space transformation(CPLST) [21] is based on principal label space transformation (PLST) and further considers predictability of the latent space, but it requires a linear encoding function, which has many shortcomings as described in reference [20]. In general, these LSDR based methods either are only feature-unaware approaches or simply consider the relationship between the feature space and the latent space, so they cannot capture the context relations in the texts. In addition, they assume that the latent space is irrelevant, and does not or just indirectly consider label correlation in the original label space. However, effective exploitation of label correlation can better determine the labels of an instance, and it is critical for improving the performance of multi-label classification. Based on the above view points, we propose a novel model LELC (Joint Learning from Label Embedding and Label Correlation) based on multi-layer attention and label correlation for multi-label text classification, which can perform feature extraction and label space dimension reduction simultaneously in an end-to-end pattern. Specifically, we use Bi-GRU [22] for feature extraction, so we can capture the context information in the texts. Inspired by the user attention in reference [23], we add a label embedding layer based on attention to make better use of potential label information. Then we send the label embedding generated by the label embedding layer and the features extracted by BI-GRU into a soft attention module. For LSDR, we use matrix factorization to implicitly encode the latent space and preserve label correlation of original labels with the decoding matrix. The main contributions of this paper are highlighted as follows:

  • (1)

    Use a multi-layer attention framework during the course of feature extraction, so as to capture the context relation and content information in the texts and select valid features related to the labels.

  • (2)

    The label correlation matrix is taken into account when performing LSDR, which can preserve the label correlation of original labels in the latent space and simplify the process of model learning.

  • (3)

    We map the extracted features and latent space to the same dimension in an end-to-end manner, and use DeepCCA technology for coupling, which can effectively reduce the cost of model learning. Experiments on 11 real-world datasets show the comparability between our proposed model and state-of-the-art multi-label classification algorithms.

The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes notations and basic concepts used in this paper. Then we describe our model in detail in Section 4. Section 5 reports experimental results of our model and the comparing methods on real-world multi-label text datasets. Finally, we conclude the paper in Section 6.

Section snippets

Related work

With the popularity of Internet technology, information has shown an exponential explosion, making data high dimensional and redundant. For multi-label learning, high dimensional means that the number of labels and the feature dimensions are extremely large. For example, in applications such as twitter, a tweet needs to be labeled with multiple labels from millions of candidate labels. This makes traditional multi-label classification algorithms [3], [8], [9], [12], [24] infeasible, because

Preliminaries

Let Dtr={xi,yi}i=1N{X,Y} denote a set of N training instances and their corresponding labels, where xi represents the dx-dimensional feature vector in the feature space X, i.e., xiXRdx,yi is the dy-dimensional label vector in the label space Y, i.e., yiY{0,1}dy, and X=[x1,x2,xN]TRN×dx,Y{0,1}N×dy,N,dx and dy are the number of training instances, dimension size of feature vector and dimension size of label vector, respectively. Multi-label classification algorithms will utilize training

The LELC approach

In this section, we will describe our model in detail. Most of the existing LSDR-based multi-label text classification methods extract text features, only considering the content information but not the sequence information of the text. Moreover, the label is regarded as a meaningless multi-hot vector, which causes the loss of potential label information for selecting valid features. Additionally, they do not directly consider the label correlation in original label space when performing LSDR,

Datasets

To verify the effectiveness of our model LELC, we select 11 multi-label datasets, i.e., Enron, Medical, Ohsumed, Movielens, Slashdot, IMDB, AAPD, AAPD-3K, RCV1-V2, RCV1-V2-6K and TJ. These datasets all belong to the text domain. AAPD and RCV1-V2 are obtained from the public website provided by [38] proposed by P. Yang, X. Sun, W. Li, et al. We randomly extract 3000 examples from AAPD to form the AAPD-3K dataset, and 6000 examples from RCV1-V2 to form the RCV1-V2-6K dataset. The distribution of

Conclusion

In this paper, we put forward a multi-label classification model LELC based on the multi-layer attention and label correlation, to solve the problem of multi-label text classification with many classes. In model LELC, both the label co-occurring matrix and the label correlation matrix are considered to explore the potential label information and label correlation. Specially, we firstly use BI-GRU to extract basic features, and then apply the multi-layer attention framework to further select

CRediT authorship contribution statement

Huiting Liu: Conceptualization, Methodology, Writing - review & editing. Geng Chen: Software, Validation, Formal analysis, Data curation, Writing - original draft, Visualization. Peipei Li: Investigation. Peng Zhao: Resources. Xindong Wu: Supervision.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research has been supported by the National Key Research and Development Program of China No. 2016YFB1000901, the National Natural Science Foundation of China Nos. 61202227 and 61602004, Natural Science Foundation of Anhui Province, No. 2008085MF219 and Provincial Natural Science Foundation of Anhui Higher Education Institution of China, No. KJ2018A0013.

Dr. Huiting Liu is an associate professor in School of Computer Science and Technology, Anhui University, Hefei, P. R. China. She received her Ph.D. degree from Hefei University of Technology, Hefei, P. R. China in 2008. Her current research focuses on personalized recommendation, text information processing and data mining.

References (45)

  • M.R. Boutell et al.

    Learning multi-label scene classification

    Pattern Recognit.

    (2004)
  • M.L. Zhang et al.

    ML-KNN: A lazy learning approach to multi-label learning

    Pattern Recognit.

    (2007)
  • S. Wold et al.

    Principal component analysis

    Chemometrics and intelligent laboratory systems

    (1987)
  • K.S. Jones

    Index term weighting

    Inf. Storage Retr.

    (1973)
  • J. Maruthupandi et al.

    Multi-label text classification using optimised feature sets

    Int. J. Data Min. Model. Manag.

    (2017)
  • K. Sarinnapakorn et al.

    Induction from multi-label examples in information retrieval systems: A case study

    Appl. Artif. Intell.

    (2008)
  • X. Cao et al.

    Saliency-aware nonparametric foreground annotation based on weakly labeled data

    IEEE Trans. Neural Networks Learn. Syst.

    (2015)
  • Y. Bengio et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • T. Mikolov et al.

    Efficient estimation of word representations in vector space

  • J. Pennington et al.

    Glove: Global vectors for word representation

  • J. Pennington et al.

    Glove: Global vectors for word representation

  • J. Wu

    Leveraging Label Information in Representation Learning for Multi-label Text Classification

    (2019)
  • M.L. Zhang et al.

    A review on multi-label learning algorithms

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • J. Read et al.

    Classifier chains for multi-label classification

    Mach. Learn.

    (2011)
  • R.A. Fisher, Sc.D., F.R.S., The use of multiple measurements in taxonomic problems, Ann. Eugen. 1936(7) (2), pp....
  • H. Hotelling

    Relations between two sets of variates

  • S. Tsuge, M. Shishibori, S. Kuroiwa, et al, Dimensionality reduction using non-negative matrix factorization for...
  • H. Wang et al.
  • F. Tai et al.

    Multilabel classification with principal label space transformation

    Neural Comput.

    (2012)
  • J. Mańdziuk, A. Żychowski, Dimensionality Reduction in Multilabel Classification with Neural Networks, in: Proceedings...
  • Z. Lin et al.

    End-to-end feature-aware label space encoding for multilabel classification with many classes

    IEEE Trans. Neural Networks Learn. Syst.

    (2018)
  • Y.N. Chen, H.T. Lin, Feature-aware label space dimension reduction for multi-label classification, in: Advances in...
  • Cited by (45)

    • A Multi-Objective online streaming Multi-Label feature selection using mutual information

      2023, Expert Systems with Applications
      Citation Excerpt :

      Real-world problems usually have many features making it hard to model and describe the data. Multi-label data can be found in many real-world applications, such as text classification (Liu, Chen, Li, Zhao, & Wu, 2021), and music tagging (Shrivastava, Yin, Shah, & Zimmermann, 2020), image recognition (Liang, Xu, & Yu, 2022), and biology (Liang, Yu, & Luo, 2019). Each instance belongs to multiple labels at the same time in multi-label learning.

    View all citing articles on Scopus

    Dr. Huiting Liu is an associate professor in School of Computer Science and Technology, Anhui University, Hefei, P. R. China. She received her Ph.D. degree from Hefei University of Technology, Hefei, P. R. China in 2008. Her current research focuses on personalized recommendation, text information processing and data mining.

    Geng Chen is currently a M.S. candidate of Anhui University, Hefei, P.R. China. His main research interests include multi-label classification and text processing.

    Peipei Li is currently an associate professor at Hefei University of Technology, China. She received her B.S., M.S. and Ph.D. degrees from Hefei University of Technology in 2005, 2008, 2013, respectively. She was a research fellow at Singapore Management University from 2008 to 2009. She was a student intern at Microsoft Research Asia between Aug. 2011 and Dec. 2012. Her research interests are in data mining and knowledge engineering.

    Dr. Peng Zhao is an associate professor in the Department of Software Engineering, School of Computer Science and Technology, Anhui University, Hefei, P.R. China. She received her Ph.D. degree in Department of Computer Science and Technology, University of Science and Technology of China (USTC), Hefei, P.R. China in 2006. Her current research focuses on intelligent information processing, image annotation, and machine learning.

    Xindong Wu received the Ph.D. degree in artificial intelligence from The University of Edinburgh, Edinburgh, U.K. His current research interests include data mining, knowledge-based systems, and Web information exploration. He is the Steering Committee chair of IEEE International Conference on Data Mining(ICDM). He is the editor-in-chief of Knowledge and Information Systems (KAIS) and ACM Transactions on Knowledge Discovery from Data (TKDD). He is a fellow of IEEE and the AAAS.

    View full text