Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and SMOTE

https://doi.org/10.1016/j.eswa.2021.116051Get rights and content

Highlights

  • Based on SMOTE, we propose a new oversampling method by utilizing 3WD.

  • We introduce classification confidence with 3WD to handle the unbalanced data.

  • Our proposed oversampling method can be applied to achieve keyword extraction.

  • We find supervised methods can achieve better performance in keyword extraction.

Abstract

The e-government platform not only enables the government department to publish policy texts online, but also makes it easier for users to access the policy, especially for the convenience of understanding the policies by reading the keywords. For a given policy text, keywords take up only a small proportion, which can be seen as an unbalanced data set. Therefore, in this paper, we try to design automatic keyword extraction method of policy text with unbalanced data set. In order to achieve this goal, we firstly propose a new ensemble oversampling method to synthesize new data. In this case, we sample data from the training set by using Bagging method. During each sampling process, we train a logistic regression model to classify the training set. Based on the predicted probabilities, we utilize the classification confidence to divide training set into three regions by using three-way decisions (3WD). Then, we implement different strategies to synthesize new data. Besides, for keyword extraction of policy text, we conduct a series of experiments by using the classical supervised and unsupervised methods. In our experiment results, we can find that both in the public data sets and manual data sets, our sampling method can achieve better performance of F-measure and G-mean indexes, no matter what the supervised machine learning method is. This can also explain the advantage of 3WD. Different regions have different strategies to synthesize new data.

Introduction

With the development of information technology, the e-government makes the government provide more effective services to citizens and the companies in China. Nowadays, the local governments are willing to publish policy information on the online platforms. This platform can provide policy inquiry service for enterprises or individuals and improve the transparency of the government. Since the Chinese government attaches great importance to the development of small and medium-sized enterprises (SMEs), the Chinese government has issued many new supporting policies. To publish these supporting policies online, some cities have launched policy service recommendation platforms, like Xiamen city1 and Chengdu city.2 However, in these recommendation platforms, the staff members manually extract the key information of the policy texts. Take Chengdu city as an example, its policy service recommendation platform is developed in the WeChat applet (see Fig. 1). In Fig. 1, the words in the red box are the keywords, which are manually extracted from the corresponding policy texts by staff. In real situation, this manual extraction work is time-consuming and inefficient. Hence, in this paper, we develop an automated method to extract key information for these platforms.

When people read a policy text, they initially prefer to catch the key information rather than read through the policy text due to the limited time and energy. Understanding the subjects of the policy text is the primary requirement of reading (Firoozeh, Nazarenko, Alizon, & Daille, 2020). Keywords contain the main information and can help people understand the content of the policy text (Zhang, Tuo et al., 2020). Moreover, keywords play an important role in many applications, such as Text Mining, Information Retrieval (IR), and Natural Language Processing (NLP) (Firoozeh et al., 2020). However, it is a hard task to manually extract keywords from a large size of texts in practice. Therefore, it is necessary to extract keywords automatically (Biswas et al., 2018, Onan et al., 2016).

Automatic keyword extraction task is the process of identifying key terms, key phrases, key segments or keywords from documents (Onan et al., 2016). In general, there are two main directions in keyword extraction task: supervised and unsupervised methods. For the supervised methods, keyword extraction can be regarded as a classification problem and each candidate keyword is labeled as either a keyword or a non-keyword (Firoozeh et al., 2020). Some machine learning algorithms have been used in keyword extraction, such as support vector machine (Zhang, Xu, Tang, & Li, 2006), Naïve Base classifier (Wang, Wang, Gao, & Yu, 2014). For the unsupervised methods, it can be subdivided into two subcategories. One is based on statistical methods, such as TF-IDF (Yu & Shan, 2015). The other is based on graph methods, such as latent dirichlet allocation (LDA) (Liu, Xie, & Wu, 2016), TextRank (Mihalcea & Tarau, 2004), centrality measure (Vega-Oliveros, Gomes, Milios, & Berton, 2019). In these unsupervised approaches, candidate keywords are scored by using different kinds of techniques (Firoozeh et al., 2020). Usually, the supervised methods can acquire much better performance than the unsupervised methods because of the labeled data. Therefore, in this paper, we mainly investigate the supervised methods for the keyword extraction of policy text. However, for the keyword extraction task, we are faced with an unbalanced data set, which consists of a little keywords but too many non-keywords in the given document. Hence, it is necessary to address the imbalance problem for the keyword extraction of policy text.

In our real life, unbalanced data sets exist in many application fields, including biomedical science (Herndon & Caragea, 2016), finance (Zakaryazad & Duman, 2016), information security (Zhong, Raahemi, & Liu, 2013). Unbalanced data means that the number of samples in the majority class is much more than that of samples in the minority class (Xu, Shen, Nie, & Kou, 2020). However, compared to the majority class, we are more interested in the minority class. In this paper, keywords of policy text are our minority class and non-keywords are our majority class. With respect to policy text, traditional classification methods usually assume that the distribution of data categories is balanced and the cost of misclassification is equal. Inevitably, the classification model is biased towards the majority class and ignores the minority class, resulting in a low classification accuracy of the minority class (Li, Chai, Hu, & Yin, 2019).

To address the imbalance problem, two basic strategies are often employed, i.e., cost-sensitive learning and resampling (Guo, Li, Shang, Gu, Huang, & Gong, 2017). Cost-sensitive learning assumes higher costs for the misclassification of minority class compared with majority class samples. Although cost-sensitive learning is computationally efficient, it is still much less popular than re-sampling methods due to the difficulty of cost matrix value setting and learning algorithm modification (Guo et al., 2017). For resampling, the undersampling technique reduces the imbalance of the data set by reducing the samples of the majority class and the oversampling methods add the minority class samples to an unbalanced data set (Liang, Jiang, Li, et al., 2020). Among all sampling technologies, the SMOTE algorithm is widely used and expanded as a basic oversampling method (Fernández et al., 2018, Han et al., 2005, Hu et al., 2018, Liang et al., 2020, Pan et al., 2020). These variant methods divide samples into Danger or Inner areas and different strategies are implemented. In recent years, three-way decisions (3WD) proposed by Yao, 2007, Yao, 2010, Yao, 2012 is widely used as a new decision model. 3WD gives three different solutions for a decision, i.e., the acceptance, rejection, and deferment decision. When the information of an object is insufficient, 3WD will not make a decision and put it into a deferment region. Thus, by using 3WD, we can reduce the decision risk and avoid misclassification (Jia et al., 2019, Li et al., 2017, Shen, Wei et al., 2020, Yu et al., 2020). Inspired by the results of Hu et al. (2018) and Yao (2007), we divide samples into three regions with the help of 3WD and design different strategies to synthesize new data in the framework of SMOTE.

With regard to keyword extraction of policy text, this paper not only investigates different sampling methods, but also compares the performance of supervised methods with unsupervised methods, because the keyword extraction can be seen as the imbalanced classification task. The main contributions of this paper are summarized as follows:

  • Based on SMOTE, we propose a new oversampling method by combining machine learning method and 3WD.

  • In order to handle the unbalanced data, we introduce the classification confidence to divide data based on 3WD, which can apply different synthesis strategies for different sample data and enhance decision effectiveness.

  • Our proposed new oversampling method can be applied to achieve the automatic extraction of keywords for policy text.

  • According to the results of experimental analysis, we find that the supervised methods can achieve better performance in keyword extraction of policy text.

The rest of paper is organized as follows: Section 2 introduces related works. In Section 3, we describe research question and present our model in detail. In Section 4, we design some experiments and provide the corresponding discussion. Finally, we draw some conclusions and some possible future works in Section 5.

Section snippets

Three-way decisions (3WD)

From a semantic interpretation, 3WD can assign different actions to objects in different regions (Yao, 2007, Yao, 2010). By combining with loss function, the action with the least expected loss will be adopted. Thus, 3WD can always give the decision-making scheme with the minimum expected loss. Recently, 3WD has been widely applied in many fields, including the medical diagnosis (Chu et al., 2020, Li et al., 2020, Wang et al., 2020), the credit risk management (Maldonado et al., 2020, Shen,

Research question and methodology

In this section, we propose ensemble keyword extraction imbalanced learning methods for policy text. In particular, with respect to the imbalance problem of policy text, we focus on integrating over-sampling methods. First, we briefly describe our research question. Then, we introduce the calculation process of classic oversampling algorithm, i.e., SMOTE. Finally, we improve SMOTE based on logistic regression model and 3WD. For clarity, we present the flow chart of oversampling method in Fig. 2.

Experimental analysis and discussions

In this section, we can show our data sets, some keyword extraction methods and some classic sampling methods in advance. Then, we compare our proposed sampling method with other sampling methods based on these data sets. With respect to keyword extraction of policy text, we also deeply compare supervised methods with unsupervised methods in order to verify the effectiveness of our proposed method. Finally, we conduct sensitivity analysis and discussions.

Conclusions

In the e-government platform, keyword extraction of policy text can be regarded as a classification problem of unbalanced data sets. For an unbalanced data challenge, we develop a new ensemble oversampling method in policy text analysis. Logistic regression can provide the predicted probabilities of positive and negative classes. Based on the gap of predicted probabilities, we introduce the classification confidence and divide the training data into three regions with the help of 3WD. For each

CRediT authorship contribution statement

Decui Liang: Conceptualization, Supervision, Writing – review & editing. Bochun Yi: Data curation, Investigation, Methodology, Writing – original draft. Wen Cao: Visualization, Writing – review & editing. Qiang Zheng: Experimental analysis, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Key R&D Program of China (No. 2020YFB1711900), the National Natural Science Foundation of China (No. 72071030), the Planning Fund for the Humanities and Social Sciences of Ministry of Education of China (No. 19YJA630042) and the Social Science Planning Project of the Sichuan Province (No. SC20C007).

References (57)

  • OnanA. et al.

    A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification

    Information Processing & Management

    (2017)
  • PanT. et al.

    Learning imbalanced datasets based on SMOTE and Gaussian distribution

    Information Sciences

    (2020)
  • ShenW. et al.

    Three-way decisions based blocking reduction models in hierarchical classification

    Information Sciences

    (2020)
  • WangY. et al.

    BWM and MULTIMOORA-based multigranulation sequential three-way decision model for multi-attribute group decision-making problem

    International Journal of Approximate Reasoning

    (2020)
  • XuY. et al.

    Three sequential multi-class three-way decision models

    Information Sciences

    (2020)
  • YaoY.Y.

    Three-way decisions with probabilistic rough sets

    Information Sciences

    (2010)
  • YuH. et al.

    An active three-way clustering method via low-rank matrices for multi-view data

    Information Sciences

    (2020)
  • ZakaryazadA. et al.

    A profit-driven artificial neural network (ANN) with applications to fraud detection and direct marketing

    Neurocomputing

    (2016)
  • ZhangY. et al.

    Keywords extraction with deep neural network model

    Neurocomputing

    (2020)
  • BatistaG. et al.

    A study of the behavior of several methods for balancing machine learning training data

    ACM SIGKDD Explorations Newsletter

    (2004)
  • ChawlaN.V. et al.

    SMOTE: Synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • FernándezA. et al.

    SMOTE For learning from imbalanced data: progress and challenges, marking the 15-year anniversary

    Journal of Artificial Intelligence Research

    (2018)
  • FiroozehN. et al.

    Keyword extraction: Issues and methods

    Natural Language Engineering

    (2020)
  • FrumosuF.D. et al.

    Cost-sensitive learning classification strategy for predicting product failures

    Expert Systems with Applications

    (2020)
  • GuY.J. et al.

    Study on keyword extraction with LDA and textrank combination

    New Technology of Library and Information Service

    (2014)
  • GuoH.X. et al.

    Learning from class-imbalanced data: Review of methods and applications

    Expert Systems with Applications

    (2017)
  • Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets...
  • He, H. B., Bai, Y., Garcia, E. A., & Li, S. T. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced...
  • Cited by (14)

    • Human–machine collaborative scoring of subjective assignments based on sequential three-way decisions

      2023, Expert Systems with Applications
      Citation Excerpt :

      Inspired by the S3WD theory and hierarchical rough sets, Qian, Tang, Yu, Yang, and Gao (2022) proposed a hierarchical sequential three-way decision model. In addition, 3WD theory and its extensions have been widely used in many domains, such as decision-making (Feng, Wan, Alcantud, & Garg, 2022; Liu, Mai, Li, Huang, & Liu, 2022; Wang, Liu, & Yao, 2022; Zhang, Yang and Wang, 2021), sentiment classification (Yang, Li, Li, Liu and Li, 2022; Zhang, Zhang, Miao, & Wang, 2019), medical diagnosis (Chen, Yue, Fujita, & Fu, 2017; Chu, Sun, Huang, & Zhang, 2022), opinion mining (Subhashini et al., 2022), keyword extraction (Liang, Yi, Cao, & Zheng, 2022) and recommendation systems (Ye & Liu, 2021, 2022). Fortunately, the S3WD theory can plays important role in the designing of the human-machine task allocation in the next section.

    • Minority-prediction-probability-based oversampling technique for imbalanced learning

      2023, Information Sciences
      Citation Excerpt :

      Algorithm-level approaches focus on modifying standard learners based on cost-sensitive learning to rectify their preference for majority-class samples [12–15]. Ensemble approaches are used to construct powerful ensemble classifiers using data- or algorithm-level approaches [16–26]. In general, ensemble approaches may have better generalization performance; however, they have a higher computational complexity.

    View all citing articles on Scopus
    View full text