Elsevier

Expert Systems with Applications

Volume 57, 15 September 2016, Pages 232-247
Expert Systems with Applications

Ensemble of keyword extraction methods and classifiers in text classification

https://doi.org/10.1016/j.eswa.2016.03.045Get rights and content

Highlights

Abstract

Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.

Introduction

Automatic keyword extraction is the process of identifying key terms, key phrases, key segments or keywords from a document that can appropriately represent the subject of the document (Beliga, Mestrovic, & Martincic-Ipsic, 2015). The Web is a very rich source of information which is progressively expanding. Hence, the number of digital documents available has been progressively expanding and the manual keyword extraction can be an infeasible task. Keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Since keyword extraction provides a compact representation of the document, many applications, such as automatic indexing, automatic summarization, automatic classification, automatic clustering, and automatic filtering can benefit from the keyword extraction process (Zhang et al., 2008).

Automatic keyword generation process can be broadly divided into two categories as keyword assignment and keyword extraction (Siddiqi & Sharan, 2015). In keyword assignment, a set of possible keywords is selected from a controlled vocabulary of words, whereas keyword extraction identifies the most relevant words available in the examined document (Beliga et al., 2015). Keyword extraction methods can be broadly grouped into four categories as statistical approaches, linguistic approaches, machine learning approaches and other approaches (Han & Kamber, 2006).

Text classification is an important subfield of text mining which assigns a text document into one or more predefined classes or categories. Several forms of text collections, such as news articles, digital libraries and Web pages are important sources of information (Han & Kamber, 2006). Hence, text classification is an important research direction in library science, information science and computer science (Jain, Raghuvanshi, & Shrivastava, 2012). Many applications of text mining can be modelled as a text classification problem. These applications include news filtering, organization, document organization, retrieval, opinion mining (sentiment analysis), and spam filtering (Aggarwal & Zhai, 2012).

High dimensional feature space is a typical challenge of text classification applications (Joachims, 2002). When all the words of the training documents are used as the features, text classification process becomes computationally intensive task (Onan & Korukoğlu, 2015). Hence, keywords of a text collection, which are the most important/relevant words about the content of the documents, can be good candidates to select as features in classification model construction (Liu and Wang, 2007, Rossi et al., 2014). Machine learning algorithms, such as Naïve Bayes, k-nearest neighbour algorithm, support vector machines and artificial neural networks, have been successfully applied in classifying text documents (Sebastiani, 2002). Ensemble methods are a set of learning algorithms, which combine the decisions of these algorithms so that a more robust classification model can be built with higher predictive performance (Dietterich, 2000).

Considering these issues, this paper examines the effectiveness of statistical keyword extraction methods, base learning algorithms and ensemble methods in scientific text document classification. To the best of our knowledge, this is the first attempt, which empirically evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. In comparative evaluation, five popular ensemble methods (Boosting, Bagging, Dagging, Random Subspace and Voting) are utilized. Naïve Bayes algorithm, support vector machines, logistic regression and Random Forest algorithm are utilized as the base learning algorithms. In the experimental analysis, the domain independent statistical keyword extraction framework proposed in (Rossi et al., 2014) is utilized. In summary, the experimental study aims to answer the following research questions:

  • (1) Which configuration of statistical keyword extraction, classification and ensemble learning algorithms yield the highest performance in scientific text document classification?

  • (2) Is there an optimal number of keywords to represent the text documents and which number of keywords obtains promising results?

To the best of our knowledge, this is the first extensive empirical analysis which examines the predictive performance of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The presented classification scheme, which integrates Bagging ensemble of Random Forest with the most-frequent based keyword extraction method, yields very promising results on scientific text classification. The rest of this paper is organized as follows. Section 2 briefly reviews the literature on keyword extraction and ensemble methods. Section 3 presents the statistical keyword extraction methods utilized in the experimental evaluations. Section 4 briefly describes the classification algorithms and Section 5 describes the ensemble learning methods. Section 6 presents the experimental results, discussion and statistical analysis of empirical results on ACM document collection. Section 7 presents the results of ensemble classification schemes on a larger text document collection. Finally, Section 8 presents the concluding remarks.

Section snippets

Literature review

This section briefly reviews the literature on keyword extraction methods and the ensemble methods.

Keyword extraction methods

Keyword extraction methods can be broadly divided into two categories as domain-dependent and domain-independent keyword extraction methods. Domain-dependent keyword extraction methods require to keep track of all the words within the text collection, whereas the domain-independent keyword extraction methods do not require the analysis of the entire text collections (Rossi et al., 2014). Domain-independent keyword extraction methods can have comparable high performance and do not require using

Classification algorithms

Machine learning algorithms have been successfully utilized in text classification. Machine learning classifiers can be broadly classified as decision trees (such as C4.5, ID3 and Random Forest), rule-based methods (such as RIPPER, PART and genetic algorithms), perceptron-based methods (such as artificial neural networks, radial basis function networks), statistical learning methods (such as Bayesian Networks and Naïve Bayes classifier), instance-based classifiers (such as k-nearest neighbour

Ensemble methods

Ensemble methods are popular research directions in machine learning and pattern recognition (Onan, 2016, Ranawana and Palade, 2006). Ensemble methods aim to combine decisions from a set of weak learning algorithms (base learners) so that the accuracy and robustness of the built classification model can be enhanced. The generalization ability of ensemble methods is better compared to the single base learners. There are statistical, computational and representational reasons to build multiple

ACM document collection

To make a comprehensive experimental evaluation about the performance of statistical keyword extraction methods on the document collections in scientific text classification (categorization), eight collections of the ACM Digital Library are used. In the empirical analysis, the statistical keyword extraction framework presented in Rossi et al. (2014) is adopted. All of the eight datasets have documents in five classes. In Table 1, the basic descriptive information (the number of classes, the

Experimental results on Reuters-21578 document collection

To better understand the performance of the ensemble learning methods in keyword-based text classification, we divide the experimental analysis into two sections. In the first section (Section 6), the predictive performance of five statistical keyword extraction methods, classification algorithms and ensemble learning methods are extensively analysed on ACM document collection. Text classification is characterized by high dimensionality of the feature space. In the second section (Section 7),

Conclusion

This paper presents an empirical analysis for five statistical keyword extraction methods (the most frequent measure based keyword extraction, the term frequency-inverse sentence frequency based keyword extraction, the co-occurrence statistical information based keyword extraction, the eccentricity-based keyword extraction and the TextRank algorithm) in conjunction with classification algorithms and ensemble learning methods.

The main contributions of this study can be summarized as follows.

References (65)

  • XiaR. et al.

    Ensemble of feature sets and classification algorithms for sentiment classification

    Information Sciences

    (2011)
  • YangB. et al.

    Classifying text streams by keywords using classifier ensemble

    Data & Knowledge Engineering

    (2011)
  • AggarwalC.C. et al.

    A survey of text classification algorithms

  • AmancioD.R. et al.

    A systematic comparison of supervised classifiers

    PLoS One

    (2014)
  • AsuncionA. et al.

    UCI machine learning repository

    (2007)
  • BeligaS. et al.

    An overview of graph-based keyword extraction methods and approaches

    Journal of Information and Organizational Sciences

    (2015)
  • BreimanL.

    Bagging predictors

    Machine Learning

    (1996)
  • BreimanL.

    Random forests

    Machine Learning

    (2001)
  • De SilvaN.F.F. et al.

    Tweet sentiment analysis with classifier ensembles

    Decision Support Systems

    (2014)
  • DietterichT.G.

    Ensemble methods in machine learning

  • FioriA.

    Innovative document summarization techniques: Revolutionizing knowledge understanding

    (2014)
  • GrinevaM. et al.

    Extracting key terms from noisy and multi-theme documents

  • HaCohen-KernerY.

    Automatic extraction of keyword from abstracts

    Lecture Notes in Computer Science

    (2003)
  • HaCohen-KernerY. et al.

    Automatic extraction and learning of keyphrases from scientific articles

    Lecture Notes in Computer Science

    (2005)
  • HanJ. et al.

    Data mining: Concepts and techniques

    (2006)
  • HoT.K.

    The random subspace method for constructing decision forests

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • HuanC. et al.

    Keyphrase extraction using semantic network structure analysis

  • HulthA.

    Improved automatic keyword extraction given more linguistic knowledge

  • IkonomakisM. et al.

    Text classification using machine learning techniques

    WSEAS Transactions on Computers

    (2005)
  • JainA. et al.

    Analysis of query based text classification approach

    International Journal of Advanced Research in Computer Science and Software Engineering

    (2012)
  • JoachimsT.

    Text categorization with support vector machines: Learning with many relevant features

  • JoachimsT.

    Learning to classify text using support vector machines

    (2002)
  • Cited by (531)

    • Innovative agricultural ontology construction using NLP methodologies and graph neural network

      2024, Engineering Science and Technology, an International Journal
    • Bangla text normalization for text-to-speech synthesizer using machine learning algorithms

      2024, Journal of King Saud University - Computer and Information Sciences
    View all citing articles on Scopus
    View full text