Ensemble of keyword extraction methods and classifiers in text classification

doi:10.1016/j.eswa.2016.03.045

Expert Systems with Applications

Volume 57, 15 September 2016, Pages 232-247

https://doi.org/10.1016/j.eswa.2016.03.045 Get rights and content

Highlights

•
Text classification is a domain with high dimensional feature space.
•
Extracting the keywords as the features can be extremely useful in text classification.
•
An empirical analysis of five statistical keyword extraction methods.
•
A comprehensive analysis of classifier and keyword extraction ensembles.
•
For ACM collection, a classification accuracy of 93.80% with Bagging ensemble of Random Forest.

Abstract

Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.

Introduction

Automatic keyword extraction is the process of identifying key terms, key phrases, key segments or keywords from a document that can appropriately represent the subject of the document (Beliga, Mestrovic, & Martincic-Ipsic, 2015). The Web is a very rich source of information which is progressively expanding. Hence, the number of digital documents available has been progressively expanding and the manual keyword extraction can be an infeasible task. Keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Since keyword extraction provides a compact representation of the document, many applications, such as automatic indexing, automatic summarization, automatic classification, automatic clustering, and automatic filtering can benefit from the keyword extraction process (Zhang et al., 2008).

Automatic keyword generation process can be broadly divided into two categories as keyword assignment and keyword extraction (Siddiqi & Sharan, 2015). In keyword assignment, a set of possible keywords is selected from a controlled vocabulary of words, whereas keyword extraction identifies the most relevant words available in the examined document (Beliga et al., 2015). Keyword extraction methods can be broadly grouped into four categories as statistical approaches, linguistic approaches, machine learning approaches and other approaches (Han & Kamber, 2006).

Text classification is an important subfield of text mining which assigns a text document into one or more predefined classes or categories. Several forms of text collections, such as news articles, digital libraries and Web pages are important sources of information (Han & Kamber, 2006). Hence, text classification is an important research direction in library science, information science and computer science (Jain, Raghuvanshi, & Shrivastava, 2012). Many applications of text mining can be modelled as a text classification problem. These applications include news filtering, organization, document organization, retrieval, opinion mining (sentiment analysis), and spam filtering (Aggarwal & Zhai, 2012).

High dimensional feature space is a typical challenge of text classification applications (Joachims, 2002). When all the words of the training documents are used as the features, text classification process becomes computationally intensive task (Onan & Korukoğlu, 2015). Hence, keywords of a text collection, which are the most important/relevant words about the content of the documents, can be good candidates to select as features in classification model construction (Liu and Wang, 2007, Rossi et al., 2014). Machine learning algorithms, such as Naïve Bayes, k-nearest neighbour algorithm, support vector machines and artificial neural networks, have been successfully applied in classifying text documents (Sebastiani, 2002). Ensemble methods are a set of learning algorithms, which combine the decisions of these algorithms so that a more robust classification model can be built with higher predictive performance (Dietterich, 2000).

Considering these issues, this paper examines the effectiveness of statistical keyword extraction methods, base learning algorithms and ensemble methods in scientific text document classification. To the best of our knowledge, this is the first attempt, which empirically evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. In comparative evaluation, five popular ensemble methods (Boosting, Bagging, Dagging, Random Subspace and Voting) are utilized. Naïve Bayes algorithm, support vector machines, logistic regression and Random Forest algorithm are utilized as the base learning algorithms. In the experimental analysis, the domain independent statistical keyword extraction framework proposed in (Rossi et al., 2014) is utilized. In summary, the experimental study aims to answer the following research questions:

(1) Which configuration of statistical keyword extraction, classification and ensemble learning algorithms yield the highest performance in scientific text document classification?
(2) Is there an optimal number of keywords to represent the text documents and which number of keywords obtains promising results?

To the best of our knowledge, this is the first extensive empirical analysis which examines the predictive performance of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The presented classification scheme, which integrates Bagging ensemble of Random Forest with the most-frequent based keyword extraction method, yields very promising results on scientific text classification. The rest of this paper is organized as follows. Section 2 briefly reviews the literature on keyword extraction and ensemble methods. Section 3 presents the statistical keyword extraction methods utilized in the experimental evaluations. Section 4 briefly describes the classification algorithms and Section 5 describes the ensemble learning methods. Section 6 presents the experimental results, discussion and statistical analysis of empirical results on ACM document collection. Section 7 presents the results of ensemble classification schemes on a larger text document collection. Finally, Section 8 presents the concluding remarks.

Section snippets

Literature review

This section briefly reviews the literature on keyword extraction methods and the ensemble methods.

Keyword extraction methods

Keyword extraction methods can be broadly divided into two categories as domain-dependent and domain-independent keyword extraction methods. Domain-dependent keyword extraction methods require to keep track of all the words within the text collection, whereas the domain-independent keyword extraction methods do not require the analysis of the entire text collections (Rossi et al., 2014). Domain-independent keyword extraction methods can have comparable high performance and do not require using

Classification algorithms

Machine learning algorithms have been successfully utilized in text classification. Machine learning classifiers can be broadly classified as decision trees (such as C4.5, ID3 and Random Forest), rule-based methods (such as RIPPER, PART and genetic algorithms), perceptron-based methods (such as artificial neural networks, radial basis function networks), statistical learning methods (such as Bayesian Networks and Naïve Bayes classifier), instance-based classifiers (such as k-nearest neighbour

Ensemble methods

Ensemble methods are popular research directions in machine learning and pattern recognition (Onan, 2016, Ranawana and Palade, 2006). Ensemble methods aim to combine decisions from a set of weak learning algorithms (base learners) so that the accuracy and robustness of the built classification model can be enhanced. The generalization ability of ensemble methods is better compared to the single base learners. There are statistical, computational and representational reasons to build multiple

ACM document collection

To make a comprehensive experimental evaluation about the performance of statistical keyword extraction methods on the document collections in scientific text classification (categorization), eight collections of the ACM Digital Library are used. In the empirical analysis, the statistical keyword extraction framework presented in Rossi et al. (2014) is adopted. All of the eight datasets have documents in five classes. In Table 1, the basic descriptive information (the number of classes, the

Experimental results on Reuters-21578 document collection

To better understand the performance of the ensemble learning methods in keyword-based text classification, we divide the experimental analysis into two sections. In the first section (Section 6), the predictive performance of five statistical keyword extraction methods, classification algorithms and ensemble learning methods are extensively analysed on ACM document collection. Text classification is characterized by high dimensionality of the feature space. In the second section (Section 7),

Conclusion

This paper presents an empirical analysis for five statistical keyword extraction methods (the most frequent measure based keyword extraction, the term frequency-inverse sentence frequency based keyword extraction, the co-occurrence statistical information based keyword extraction, the eccentricity-based keyword extraction and the TextRank algorithm) in conjunction with classification algorithms and ensemble learning methods.

The main contributions of this study can be summarized as follows.

References (65)

AbellanJ. et al.
Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring
Expert Systems with Applications
(2014)
AburommanA.A. et al.
A novel SVM-kNN-PSO ensemble method for intrusion detection system
Applied Soft Computing
(2016)
CatalC. et al.
On the use of ensemble of classifiers for accelerometer-based activity recognition
Applied Soft Computing
(2015)
FersiniE. et al.
Sentiment analysis: Bayesian ensemble learning
Decision Support Systems
(2014)
KimM.J. et al.
Classifier selection in ensemble using genetic algorithms for bankruptcy prediction
Expert Systems with Applications
(2012)
PrabowoR. et al.
Sentiment analysis: A combined approach
Journal of Informetrics
(2009)
Reboiro-JatoM. et al.
A novel ensemble of classifiers that use biological relevant gene sets for microarray classification
Applied Soft Computing
(2014)
TsaiC.F. et al.
Predicting stock returns by classifier ensemble
Applied Soft Computing
(2011)
UysalA.K.
An improved global feature selection scheme for text classification
Expert Systems with Applications
(2016)
WangG. et al.
Sentiment classification: The contribution of ensemble learning
Decision Support Systems
(2014)

XiaR. et al.

Ensemble of feature sets and classification algorithms for sentiment classification

Information Sciences

(2011)

YangB. et al.

Classifying text streams by keywords using classifier ensemble

Data & Knowledge Engineering

(2011)

AggarwalC.C. et al.

A survey of text classification algorithms

AmancioD.R. et al.

A systematic comparison of supervised classifiers

PLoS One

(2014)

AsuncionA. et al.

UCI machine learning repository

(2007)

BeligaS. et al.

An overview of graph-based keyword extraction methods and approaches

Journal of Information and Organizational Sciences

(2015)

BreimanL.

Bagging predictors

Machine Learning

(1996)

BreimanL.

Random forests

Machine Learning

(2001)

De SilvaN.F.F. et al.

Tweet sentiment analysis with classifier ensembles

Decision Support Systems

(2014)

DietterichT.G.

Ensemble methods in machine learning

FioriA.

Innovative document summarization techniques: Revolutionizing knowledge understanding

(2014)

GrinevaM. et al.

Extracting key terms from noisy and multi-theme documents

HaCohen-KernerY.

Automatic extraction of keyword from abstracts

Lecture Notes in Computer Science

(2003)

HaCohen-KernerY. et al.

Automatic extraction and learning of keyphrases from scientific articles

Lecture Notes in Computer Science

(2005)

HanJ. et al.

Data mining: Concepts and techniques

(2006)

HoT.K.

The random subspace method for constructing decision forests

IEEE Transactions on Pattern Analysis and Machine Intelligence

(1998)

HuanC. et al.

Keyphrase extraction using semantic network structure analysis

HulthA.

Improved automatic keyword extraction given more linguistic knowledge

IkonomakisM. et al.

Text classification using machine learning techniques

WSEAS Transactions on Computers

(2005)

JainA. et al.

Analysis of query based text classification approach

International Journal of Advanced Research in Computer Science and Software Engineering

(2012)

JoachimsT.

Text categorization with support vector machines: Learning with many relevant features

JoachimsT.

Learning to classify text using support vector machines

(2002)

Cited by (531)

Innovative agricultural ontology construction using NLP methodologies and graph neural network
2024, Engineering Science and Technology, an International Journal
Advancements in technology brought various innovations to agricultural practices. As a part of the development, establishing an agricultural ontology would unleash the growth of cross-domain agriculture and Natural Language Processing (NLP). For constructing such domain-based ontology, semantic and syntactic understanding of the domain data is needed. In agriculture, the availability of pre-determined domain-based data is not sufficient hence, a standard methodology with syntactic and general semantic features are required for processing the data. In this research work, Agricultural Domain based Ontology Construction (ADOC) is proposed and the overall framework has three approaches for establishing the agriculture domain based ontologies. The input text documents undergo anaphora resolution phase utilizing the semantic-based method. In the first method of ADOC the ontology is developed using the terms and relationships that are extracted from the NLP techniques. The second method of ADOC uses pretrained BERT model and Hearst patterns while the third model of ADOC is based on pretrained BERT with regular expressions and unsupervised Graph Neural Network (GNN) for creating the agricultural ontology. The efficacy of the proposed ADOC utilizing BERT with regular expressions and GNN method shows an outstanding result when compared to other proposed and prevailing systems, with a precision and recall of 96.67% and 98.31%.
SARS-CoV-2 virus variant detection and mortality prediction through symptom analysis using machine learning
2024, Engineering Applications of Artificial Intelligence
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), is no exception when it comes to Viruses mutating continuously when changes in the genetic code (genetic mutations) arise during genome replication. With the likelihoods of new strains of variants arising, it is extremely important to predict the course of mutation of the virus so that effective remedies can be taken promptly. The present study presents a novel, critical and important system with the objective to detect the infection of SARS-CoV-2 Virus Variant and predict mortality with accuracy. The research aims and attempts to obtain the SARS-CoV-2 variant detection through symptomatic analysis using Random Forest Classifier and Extreme Gradient Boost (XGBoost) machine learning algorithms for most accurate results. It brings a correlation between patients’ symptoms and the Virus variants associated with it using Machine Learning models. Furthermore, it predicts mortality in hospitalized cases with severe comorbidities due to different variants using K- Nearest Neighbor (KNN) and Support Vector Machine (SVM) machine learning algorithms. It, thus, presents binary classification of discharged and expired patients. Further stacking ensembling was performed for both classification tasks, and optimum results were obtained.
Uncovering the dynamics of enterprises digital transformation research: A comparative review on literature before and after the COVID-19 pandemic
2024, Heliyon
The COVID-19 pandemic has greatly changed global practices of enterprise digital transformation (EDT). However, the impact of the pandemic on EDT research patterns remains unexplored. This study examines the overall development and research pattern shift of literature on EDT in the field of business and economics. A bibliometric analysis with CiteSpace was conducted on a total of 140 journal articles indexed the SSCI and SCIE databases on Web of Science prior to the pandemic and 621 articles published after the pandemic. The results suggest that following the outbreak of COVID-19 pandemic, there has been a significantly rapid growth of EDT-related publications, and the contributing role in EDT research of influential countries has undergone significant changes. Furthermore, the changes in keyword patterns were identified before and after the pandemic. Specifically, EDT research after the COVID-19 outbreak has been focusing on emerging topics, such as corporate governance, sustainable development, platform ecosystems, and dynamic capabilities. Finally, recommendations for future research are provided at individual, organizational, and ecosystem levels. Overall, this study is one of the first studies to uncover the dynamics of EDT research patterns due to the COVID-19 Pandemic, thus enhancing our understanding of the features and structures of digital transformation research in uncertain environment.
Climate bonds toward achieving net zero emissions and carbon neutrality: Evidence from machine learning technique
2024, Journal of Management Science and Engineering
The Conference of the Parties (COP26 and 27) placed significant emphasis on climate financing policies with the objective of achieving net zero emissions and carbon neutrality. However, studies on the implementation of this policy proposition are limited. To address this gap in the literature, this study employs machine learning techniques, specifically natural language processing (NLP), to examine 77 climate bond (CB) policies from 32 countries within the context of climate financing. The findings indicate that “sustainability” and “carbon emissions control” are the most outlined policy objectives in these CB policies. Additionally, the study highlights that most CB funds are invested toward energy projects (i.e., renewable, clean, and efficient initiatives). However, there has been a notable shift in the allocation of CB funds from climate-friendly energy projects to the construction sector between 2015 and 2019. This shift raises concerns about the potential redirection of funds from climate-focused investments to the real estate industry, potentially leading to the greenwashing of climate funds. Furthermore, policy sentiment analysis revealed that a minority of policies hold skeptical views on climate change, which may negatively influence climate actions. Thus, the findings highlight that the effective implementation of CB policies depends on policy goals, objectives, and sentiments. Finally, this study contributes to the literature by employing NLP techniques to understand policy sentiments in climate financing.
A study on classifying Stack Overflow questions based on difficulty by utilizing contextual features
2024, Journal of Systems and Software
Technical question-answering sites like Stack Overflow are gaining enormous attention from practitioners of specialized fields looking to exchange their programming knowledge. They ask questions on different topics with varying degrees of complexity and difficulty. All practitioners do not have the same level of expertise on those topics to respond to such questions. However, the current approach used by Stack Overflow mostly filters questions based on topics alone and does not take difficulty into account. For this reason, a large percentage of questions fail to attract the attention of appropriate users, resulting in questions having no answer or a significant delay in response time. To address these limitations, we incorporate three models, TF-IDF, LDA, and Doc2Vec, to extract semantic and context-dependent features that can measure the difficulty of questions. Each of these models is paired with different classifiers along with other features to classify the questions based on difficulty. Extensive experiments on three different datasets exhibit the effectiveness of our models, and Doc2Vec outperforms the other models. We also identified that the contextual features are correlated with question difficulty, and one subset of features outperforms others. The proposed approach can be beneficial for building an automatic tagger based on question difficulty.
Bangla text normalization for text-to-speech synthesizer using machine learning algorithms
2024, Journal of King Saud University - Computer and Information Sciences
Text normalization (TN) for text-to-speech (TTS) synthesizer is the transformation of non-standard words like times, ordinal numbers, equations, ranges, dates, etc. into standard words that have similarities with their pronunciations. An essential part of all TTS synthesizers is text normalization. Without text normalization, generated voice from the TTS synthesizer will be unintelligible. For the unsatisfactory performance of previous research, a text normalization method for the Bangla language is proposed in this paper. At first, we have produced a tokenized dataset with a semiotic class using regular expressions from a Bangla corpus. Then, each token has been trained using the XGBClassifier algorithm. After that, it identifies the semiotic class for each token in a new Bangla text corpus using the trained XGBClassifier model. Finally, it produces a normalized text for each token by calling the class function according to the predicted class. This text normalization method will help the Bangla TTS synthesizer in producing more intelligible voices. The token classification accuracy of this method is 99.997%.

View all citing articles on Scopus

View full text

Ensemble of keyword extraction methods and classifiers in text classification

Highlights

Abstract

Introduction

Section snippets

Literature review

Keyword extraction methods

Classification algorithms

Ensemble methods

ACM document collection

Experimental results on Reuters-21578 document collection

Conclusion

Expert Systems with Applications

Applied Soft Computing

Applied Soft Computing

Decision Support Systems

Expert Systems with Applications

Journal of Informetrics

Applied Soft Computing

Applied Soft Computing

Expert Systems with Applications

Decision Support Systems

Information Sciences

Data & Knowledge Engineering

A survey of text classification algorithms

A systematic comparison of supervised classifiers

PLoS One

UCI machine learning repository

An overview of graph-based keyword extraction methods and approaches

Journal of Information and Organizational Sciences

Bagging predictors

Machine Learning

Random forests

Machine Learning

Tweet sentiment analysis with classifier ensembles

Decision Support Systems

Ensemble methods in machine learning

Innovative document summarization techniques: Revolutionizing knowledge understanding

Extracting key terms from noisy and multi-theme documents

Automatic extraction of keyword from abstracts

Lecture Notes in Computer Science

Automatic extraction and learning of keyphrases from scientific articles

Lecture Notes in Computer Science

Data mining: Concepts and techniques

The random subspace method for constructing decision forests

IEEE Transactions on Pattern Analysis and Machine Intelligence

Keyphrase extraction using semantic network structure analysis

Improved automatic keyword extraction given more linguistic knowledge

Text classification using machine learning techniques

WSEAS Transactions on Computers

Analysis of query based text classification approach

International Journal of Advanced Research in Computer Science and Software Engineering

Text categorization with support vector machines: Learning with many relevant features

Learning to classify text using support vector machines