Deep neural network for hierarchical extreme multi-label text classification☆
Graphical abstract
Introduction
The classification of natural language texts is a key aspect in many tasks and in different domains. This kind of classification problem consists in applying one or more labels to each document of a text collection. In literature this task has been approached by means of several different techniques, ranging from ontology-based methods to Machine Learning (ML) systems, or through the adoption of hybrid approaches integrating ontological knowledge and ML [1], [2]. The increase of computational power and the availability of huge amounts of data, along with the active research and developments in the field of Deep Neural Networks (DNN), have recently led to the definition of DNN models able to outperform the previous state of the art systems.
In accordance with the literature, it is possible to identify a taxonomy of Natural Language text classification problems, composed of the following four classes:
- •
Binary Classification, where the labels belong to a binary set (Positive and Negative, True and False, etc.);
- •
Multi-Class Classification, where the single classification label belongs to a set with more than two elements;
- •
Multi-Label Classification, when the labels belong to a multi-class domain, but differently from the previous case, each document can be tagged with a variable number of labels, ranging from one to a total class number; and
- •
Extreme Multi-Label Text Classification (XMTC) [3] refers to the automatic assignment of the most relevant subset of labels to a text document, but differently from the classic multi-label problem where the label set size is usually in the order of ten, in this case the labels belong to an extremely large set, in the order of thousands, or ten of thousands of elements. If the label set is hierarchically organized, a hierarchical XMTC problem is defined.
The huge XMTC label space raises many research challenges, such as data sparsity and scalability. The availability of Big Data and the application of XMTC to real world problems have attracted a growing attention of researchers from ML and Deep Learning (DL) fields. Significant advances in multi-label classification methodologies have been made in recent years, thanks to the development of specific ML methods, although DL methods have not yet been widely explored on account of this particular problem.
In this paper we analyze a DL approach based on a Convolutional Neural Network (CNN), devoted to the hierarchical XMTC problem. We define a methodology that expands the label set of each document integrating all the missing labels along the label hierarchy. This operation is necessary because usually only the leaves of the tree and a few labels along the hierarchy are considered for indexing purposes by human experts who manually label the documents. The lack of all the classes along the hierarchy can lead to an incorrect training of the DNN, due to label inconsistencies.
We also analyze the impact of the use and combination of different types of embeddings for the representation of the input training text. In more detail, we evaluate the impact of semi-supervised embedding models. These latter models are able to explicitly infer grammatical and syntactic information in the obtained word vectors and can provide a performance boost in other tasks, such as Word Analogy/Similarity Querying, Named Entity Recognition (NER), Relation Extraction and Sentence Classification [4], [5], [6], [7].
All the results have been evaluated using the PubMed1 scientific articles collection as a test case. PubMed is a search engine maintained by the US National Library of Medicine (NLM), specifically devoted to medical and biological scientific papers. We have considered only the text of the title and the abstract of each paper, along with the corresponding labels, due to their free availability in PubMed. Each paper has been manually tagged by domain experts with a variable number of classes from the (MeSH) set, a hierarchical label set characterized by a total number of labels equal to 27,775. For these reasons, the automatic classification of PubMed papers with MeSH belongs to the hierarchical XMTC case.
The automatic classification of PubMed papers is also a task required by the NLM in order to help the domain experts in their tedious and time-consuming work. To achieve this objective, the NLM supports BioASQ,2 a distributed challenge for the research community; one of the aims of BioASQ is the advance of the state of the art systems devoted to the automatic application of MeSH to PubMed indexed articles [8].
The XMTC problem is involved in many real world applications, such as the one above described, confirming the utility of any efforts focused on searching for new solutions. The results of our experiments prove the usefulness of the proposed HLSE method and provide many interesting findings resulting from the analysis of the different performances of the neural network in relation to the embedding models used. This analysis could also be considered as a starting point for an emerging problem, which may be addressed by what is called explainable-AI [9], [10], namely that of correlating the input data representation and the label structure with their impact on the DL model performance.
The paper is organized as follows: in the next section an overview of the current state of the art is presented; then, all the details of the DNN used and the proposed methodologies are explained, followed by the experimental results, where the dataset details, a description of the evaluation measures and the instruments used to implement the whole architecture are also included; and finally, the obtained results are discussed and analyzed, in comparison with the state of the art.
Section snippets
Related works
The text classification problem has been addressed in the literature with many different approaches [1]. Some of them are based on ML methods with manual feature engineering, such as Latent Dirichlet Allocation (LDA) or K-Nearest Neighborhood (K-NN) [11], [12]. More recently, various DNN approaches have been proposed, obtaining very promising results. In [13] a simple Neural Network (NN) approach for large-scale multi-label text classification has been presented, evidencing the usefulness of
Methodology
In this Section we first provide a brief overview of the DNN architecture that, for the sake of clarity, we have divided into three main modules: an Embeddings module for text encoding, a Feature Extraction module implemented through CNNs and a Classification module composed of fully connected neural networks. We also highlight the details of the loss function and the hyper-parameters used to train the network. Next, we describe the details of the proposed Hierarchical Label Set Expansion
Experimental results
In this Section we first describe all the characteristics of the datasets used for the experimental assessment. Next, we report the details of the systems used for the implementation of the proposed architecture are described, listing and explaining all the corresponding parameter settings. We also describe the methods used to obtain the unsupervised and semi-supervised embedding models described in the previous Section 3. Next, we provide a complete overview of the evaluation metrics for the
Conclusions
In this paper we have presented an analysis of a Deep Learning architecture devoted to text classification, considering the extreme multi-class and multi-label text classification problem, when a hierarchical label set is defined. We have described a methodology named Hierarchical Label Set Expansion (HLSE) used to regularize the data labels and have reported an analysis of the impact of the use and combination of different Word Embedding (WE) models that explicitly incorporate grammatical and
References (67)
- et al.
A recent overview of the state-of-the-art elements of text classification
Expert Syst. Appl.
(2018) - et al.
Biomedical event trigger detection by dependency-based word embedding
BMC Med. Genomics
(2016) - et al.
Text classification method based on self-training and lda topic models
Expert Syst. Appl.
(2017) - et al.
Pubmed related articles: a probabilistic topic-based model for content similarity
BMC Bioinformatics
(2007) Some methods of speeding up the convergence of iteration methods
USSR Comput. Math. Math. Phys.
(1964)- et al.
A clustering based methodology to support the translation of medical specifications to software models
Appl. Soft Comput.
(2018) - et al.
A systematic analysis of performance measures for classification tasks
Inf. Process. Manage.
(2009) - et al.
Multiple hierarchical classification of free-text clinical guidelines
Artif. Intell. Med.
(2006) - et al.
A distributed architecture to integrate ontological knowledge into information extraction
Int. J. Grid Utility Comput.
(2016) - et al.
Deep learning for extreme multi-label text classification
Dependency based embeddings for sentence classification tasks
Dependency-based word embeddings
Results of the fifth edition of the BioASQ Challenge
Current advances, trends and challenges of machine learning and knowledge extraction: from machine learning to explainable AI
Explainable AI: the new 42?
LF-LDA: a topic model for multi-label classification
Large-scale multi-label text classification - revisiting neural networks
Medical text classification using convolutional neural networks
CoRR
Very deep convolutional networks for text classification
LSTM : multi-label ranking for document classification
Neural Process. Lett.
Recurrent residual learning for sequence classification
Ensemble application of convolutional and recurrent neural networks for multi-label text categorization
Multi-label classification of patient notes: case study on ICD code assignment
Applying deep learning to ICD-9 multi-label classification from medical records
Large-scale hierarchical text classification with recursively regularized deep graph-cnn
Recent enhancements to the NLM medical text indexer
Using learning-to-rank to enhance NLM medical text indexer results
Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program
Deepmesh: deep semantic representation for improving large-scale mesh indexing
Bioinformatics
Meshlabeler and DeepMeSH: recent progress in large-scale mesh indexing
Distributed representations of sentences and documents
Cited by (115)
Strategies and conditions for crafting managerial responses to online reviews
2024, Tourism ManagementSGBA: A stealthy scapegoat backdoor attack against deep neural networks
2024, Computers and SecurityLabel correlations-based multi-label feature selection with label enhancement
2024, Engineering Applications of Artificial IntelligenceHigh-accuracy recognition of gas–liquid two-phase flow patterns: A Flow–Hilbert–CNN hybrid model
2023, Geoenergy Science and EngineeringMFSJMI: Multi-label feature selection considering join mutual information and interaction weight
2023, Pattern RecognitionClassifying spam emails using agglomerative hierarchical clustering and a topic-based approach
2023, Applied Soft Computing
- ☆
This paper is an extended and improved version of the paper: Francesco Gargiulo, Stefano Silvestri and Mario Ciampi, Deep Convolution Neural Network for Extreme Multi-label Text Classification, presented at the AI4Health 2018 workshop and published in: BIOSTEC 2018, Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies, Volume 5: HEALTHINF, Funchal, Madeira, Portugal, 19-21 January, 2018, pp. 641-650, ISBN: 978-989-758-281-3, INSTICC, 2018.