Sample imbalance disease classification model based on association rule feature selection

doi:10.1016/j.patrec.2020.03.016

Pattern Recognition Letters

Volume 133, May 2020, Pages 280-286

https://doi.org/10.1016/j.patrec.2020.03.016 Get rights and content

Highlights

•
Sorted out a diabetes data set.
•
A feature selection algorithm based on association rules is proposed.
•
A comprehensive classification algorithm based on stochastic equilibrium sampling is proposed.

Abstract

In the research of computer-aided diagnosis, the shortage of disease feature dimension curse and the imbalance of medical samples have always been the focus of research on diagnostic decision support systems. For these two problems, we propose a feature selection algorithm based on association rules and an integrated classification algorithm based on random equilibrium sampling. We extracted and cleaned the electronic medical record text obtained from the hospital to obtain a diabetes data set. The proposed algorithm was verified in this data set and the public data set UCI. Experimental results show that the feature selection algorithm based on association rules is better than the CART, ReliefF and RFE-SVM algorithms in terms of feature dimension and classification accuracy. The proposed integrated classification algorithm based on random equalization sampling is superior to the comparative SMOTE-Boost and SMOTE-RF algorithms in macro precision, macro-full rate and macro F1 value, which embodies the robustness of the algorithm.

Introduction

Diagnostic decision support system, as an application branch of Clinical Decision Support System (Clinical Decision Support System, CDSS) [2], has been a highline of research and application. Machine learning is widely used in medical information mining [11] and diagnostic [12], [20], [28], [31], [32] research. According to the basis of CDSS decision support, it can be divided into two categories: knowledge-based CDSS system and non-knowledge CDSS system. The knowledge-based CDSS system is essentially an expert system in the medical field. It contains a large number of medical experts’ knowledge and experience. It usually consists of human-machine interface, knowledge base and inference engine. Unlike knowledge-based CDSS,non-knowledge-based CDSS is integrated with the Electronic Medical Record System(Electronic Medical Record System, EMR), and its decision support is based on algorithms such as machine learning or other statistical algorithms for pattern recognition. EMR data sets for association rule mining, classification, regression, etc. can continuously discover new knowledge to help doctors make better decisions in the disease diagnosis process. With the continuous development of data mining, pattern recognition and machine learning algorithms, and the continuous accumulation of data in electronic medical records, the research of CDSS based on machine learning has gradually become the hotspot and mainstream direction of current diagnostic decision support systems.

In the current study, two issues remain. First, it is difficult to construct a diagnostic model directly for medical cases with fewer training samples. And as the diagnostic capabilities of diagnostic models improve, the required features will continue to expand, resulting in a dimensionality of the feature matrix, resulting in more redundant and uncorrelated features, excessive computation, sparse training samples, and overfitting. Other issues that ultimately affect the classification quality of the classifier. In the disease diagnosis model, the number of diagnosed disease categories is small, and the classification accuracy rate is low, which makes the diagnostic decision model practicable. At the same time, in the electronic medical record, because there are a large number of characteristic attributes, the selection of corresponding features is different for different diseases. How to select the optimal features is the difficulty to improve the multi-classification task of diseases. Second, most existing diseases have the characteristics of disease distribution imbalance [5]. In the disease multi-classification problem, sample imbalance is a key factor restricting the disease multi-classification problem, because for some specific diseases In the case of the case, the number of sample sets is relatively small, which leads to sparse training samples when classifying, affecting classification accuracy and generalization performance of multi-classification tasks.

First, we searched and analyzed the association rules between diabetes and its complications and symptoms in the EMR dataset. Then we sorted the symptom feature attributes based on the two frequent sets of confidence parameters of the disease-disease, and adopted the sequence forward selection method, and the classification performance of the classifier is selected to overcome the problems of large computational complexity, training sample sparseness and over-fitting caused by the dimensionality disaster encountered in the feature matrix. Finally, the characteristics of the sample imbalance in the EMR data set are adopted. The subset of the training samples is divided into categories according to the category, and the feature vectors selected in the previous stage are used to train the unbalanced sample set, and then the random equalization sampling is performed in the iterative process, and then the training base classifier and the evaluation base classifier are passed. The F1 value is used for weighted voting, and finally an integrated classifier with the best classification performance is output, and the integrated classifier is used to complete the classification task of multiple diseases, thereby improving the quality of disease diagnosis decision.

Overall, the main contributions of our work are:

•
We propose a disease feature selection algorithm based on association rules, which can screen out feature vectors for multi-disease classification and effectively improve the quality of multi-disease classification.
•
We propose an integrated algorithm based on stochastic equalization, which can effectively improve the multi-disease classification of sample imbalance.

The rest of this paper is organized as follows. Section II reviews the related work for image captioning and radiology report generation. Section III shows data preprocessing and details of the data. Section IV details the design of the proposed algorithm. Section V and VI present and discuss the experimental settings and results, respectively. Finally, we draw conclusion in Section VII.

Section snippets

Related work

Feature Selection. John et al. [13] considers feature selection to be a process of reducing feature dimensions without reducing classification accuracy. Koller et al. [17] defines feature selection to select as small a feature subset as possible, while ensuring that the result class distribution is as similar as possible to the original data class distribution. Dash et al. [7] gave a comprehensive overview of the feature selection problem in the field of data mining, and gave the basic

Association rules features selection

In order to reduce the size of the feature subset and improve the efficiency of the feature selection algorithm without reducing the classification accuracy, we propose an ARFS) Association Rules Features Selection, ARFS) Algorithm 1. The ARFS algorithm first takes the maximum value strategy to calculate the confidence between the feature and the category, and uses the confidence value to evaluate the correlation between the feature and the category. This correlation affects the selection of

Metrics

We use P(Precision,P), R(Recall,R), F-1 values as our basic evaluation indicators, and their calculation formulas are as follows: $P = \frac{T P}{T P + F P}$ $R = \frac{T P}{T P + F N}$ $F 1 = \frac{2 \times P \times R}{P + R}$

Where TP is True Positive, FP is False Positive, FN is False Negative, and TN is True Negative. To further comprehensively assess the accuracy, recall, and F-mearsure values of a multi-category problem, we can also use Macro-averaging to first calculate the metrics for each category and then all categories. The evaluation indicators are

Results and discussions

To verify the robustness of algorithms, we also experimented with the unpublished diabetes dataset and the UCI machine learning database¹. For the feature selection algorithm, we chose the cmc dataset (contraceptive method choice), monks-1, monks-2 dataset, spect-heartand and spectf-heart datasets. The monks-1 and monks-2 data sets only use the test set; the spect-heart and spectf-heart data sets combine the test set and the training set for

Conclusion

First, we propose a feature selection algorithm based on association rules in view of the shortcomings of feature attribute dimension disasters in diagnosis decision support systems. Compared with CART, ReliefF, RFE-SVM and ARFS algorithm proposed in the diabetes dataset and UCI public dataset, the experimental results show that the ARFS algorithm is superior to the contrast experimental algorithm in feature dimension and classification accuracy. Secondly, based on the sample imbalance problem

Declaration of Competing Interest

The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Acknowledgments

This work is supported by the National Key Research and Development Project (Nos. 2019YFB2101600) and the National Natural Science Foundation of China (Nos. 61763031).

References (32)

L.H. Chiang et al.
Genetic algorithms combined with discriminant analysis for key variable identification
J. Process Control
(2004)
M. Dash et al.
Feature selection for classification
Intell. Data Anal.
(1997)
W.H. Hsu
Genetic wrappers for feature selection in decision tree induction and variable ordering in bayesian network structure learning
Inf Sci (Ny)
(2004)
G.H. John et al.
Irrelevant Features and the Subset Selection Problem
Machine Learning Proceedings 1994
(1994)
K. Kira et al.
A Practical Approach to Feature Selection
Machine Learning Proceedings 1992
(1992)
Z. Sun et al.
A novel ensemble method for classifying imbalanced data
Pattern Recognit
(2015)
P. Thanathamathee et al.
Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and adaboost techniques
Pattern Recognit. Lett.
(2013)
A. Tsymbal et al.
Ensemble feature selection with the simple bayesian classification
Information fusion
(2003)
Y.-D. Zhang et al.
Multiple sclerosis identification by convolutional neural network with dropout and parametric relu
J. Comput. Sci.
(2018)
R. Agrawal et al.
Mining association rules between sets of items in large databases
Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data
(1993)

E.S. Berner

Clinical decision support systems

(2007)

L. Breiman

Classification and regression trees

(2017)

N.V. Chawla et al.

Smote: synthetic minority over-sampling technique

Journal of artificial intelligence research

(2002)

L.P. CHEN X et al.

Research on disease prediction models based on imbalanced medical data sets

Chinese Journal of Computers

(2019)

Z. Ding, Diversified ensemble classifiers for highly imbalanced data learning and their application in...

I. Guyon et al.

Gene selection for cancer classification using support vector machines

Mach. Learn.

(2002)

Cited by (43)

Temporal-VCA: Simulating urban land use change using coupled temporal data and vector cellular automata
2024, Cities
Vector cellular automata (VCA) are effective models for cadastral-scale land use change modeling, leveraging fine spatial granularity information from cadastral plot data. The temporal dimension has the potential to improve the performance of VCA further. However, it is challenging to precisely capture long sequence information of cadastral plot temporal data for VCA while ensuring accurate capture of fine granularity information simultaneously. Our paper introduces the Temporal-VCA framework, which fully utilizes fine spatial and temporal granularity information of cadastral plot temporal data to enhance the accuracy of VCA. Applying Shenzhen's annual cadastral plot data from 2009 to 2014, this study shows how deep learning techniques can elucidate the temporal aspects of VCA models. Temporal-VCA notably improves precision by up to 22.12 %, outperforming the regular VCA models and traditional raster CA models. It reveals the complex nonlinear temporal patterns within cadastral-scale urban development processes. Designed simulations for 2030, including scenarios of disordered development and ecological protection, highlight the benefits of fully leveraging fine temporal granularity information of temporal data into urban planning, potentially reducing ecological damage by 70 %. Our findings offer a novel methodology for urban land use simulation, with significant implications for urban planning and the advancement of sustainable cities.
LayNet—A multi-layer architecture to handle imbalance in medical imaging data
2023, Computers in Biology and Medicine
In an imbalanced dataset, a machine learning classifier using traditional imbalance handling methods may achieve good accuracy, but in highly imbalanced datasets, it may over-predict the majority class and ignore the minority class. In the medical domain, failing to correctly estimate the minority class might lead to a false negative, which is concerning in cases of life-threatening illnesses and infectious diseases like Covid-19. Currently, classification in deep learning has a single layered architecture where a neural network is employed. This paper proposes a multilayer design entitled LayNet to address this issue. LayNet aims to lessen the class imbalance by dividing the classes among layers and achieving a balanced class distribution at each layer. To ensure that all the classes are being classified, minor classes are combined to form a single new ‘hybrid’ class at higher layers. The final layer has no hybrid class and only singleton(distinct) classes. Each layer of the architecture includes a separate model that determines if an input belongs to one class or a hybrid class. If it fits into the hybrid class, it advances to the following layer, which is further categorized within the hybrid class. The method to divide the classes into various architectural levels is also introduced in this paper. The Ocular Disease Intelligent Recognition Dataset, Covid-19 Radiography Dataset, and Retinal OCT Dataset are used to evaluate this methodology. The LayNet architecture performs better on these datasets when the results of the traditional single-layer architecture and the proposed multilayered architecture are compared.
A novel ensemble feature selection method by integrating multiple ranking information combined with an SVM ensemble model for enterprise credit risk prediction in the supply chain
2022, Expert Systems with Applications
Citation Excerpt :
Therefore, the SBFS-RI algorithm can avoid falling into the local optimum to a certain extent. In addition, compared to the feature selection method proposed by Huang et al. (2020), the SBFS-RI algorithm can further effectively eliminate redundant features. The pseudo-code of the algorithm is shown in Fig. 1, and the specific process is as follows.
Enterprise credit risk prediction in the supply chain context is an important step for decision making and early credit crisis warnings. Improving the prediction performance of this task is an academic and industrial focus. Feature selection and class imbalance can affect prediction performance: redundant and irrelevant features increase the learning difficulty of the prediction model, cause overfitting and reduce prediction performance, whereas class imbalance, with many fewer minority class instances than majority class instances, may cause model failure. Herein, a sequence backward feature selection algorithm based on ranking information (SBFS-RI) and a novel ensemble feature selection method integrating multiple ranking information (FS-MRI) are proposed. The FS-MRI method can realize the automatic threshold function while considering the model performance and then output the best and a more stable feature subset. In addition, an SVM ensemble model with an artificial imbalance rate (SVME-AIR) is proposed to solve the class imbalance problem and realize the effective combination of under-sampling technology and the AdaBoost ensemble method for the first time. Finally, FS-MRI and SVME-AIR are combined through a two-stage model design. The hybrid model can effectively solve the feature selection and class imbalance problems for enterprise credit risk prediction in the supply chain context. Supply chain data of Chinese listed enterprises shows that the FS-MRI method outperforms nine other feature selection methods and provides more robust and efficient feature subsets. The SVME-AIR model has higher AUC and KS values than other ensemble models and single classifiers. When combined, the two methods achieve the best prediction performance, with maximum AUC and KS values of 0.8772 and 0.6363, respectively.
Instance weighted SMOTE by indirectly exploring the data distribution
2022, Knowledge-Based Systems
Citation Excerpt :
Learning from imbalanced data is an important topic in machine learning, because it has been widely applied to diagnose and classify diseases [1,2], detect software defects [3,4], analyze biological and pharmacological data [5,6], evaluate credit risk [7], predict actionable revenue change and bankruptcy [8,9], diagnose faults in industrial procedure [10,11], classify soil types [12,13], predict crash injury severity [14] and analyze crime linkages [15].
The synthetic minority oversampling technique (SMOTE) algorithm is considered a benchmark algorithm for addressing the class imbalance learning (CIL) problem. However, SMOTE fails to observe the distribution of the training data and to explore its internal structure, resulting in an unstable and non-robust classification result. Recently, more than 100 SMOTE variants have been developed to solve this problem. Most of them attempt to directly explore the prior distribution information of the training data, which may provide extremely inaccurate guidance in some classification scenarios. In this study, we present the instance weighted SMOTE (IW-SMOTE) algorithm, a more robust and universal solution for improving SMOTE by exploit distribution data indirectly. In particular, an UnderBagging-alike undersampling ensemble algorithm that uses classification and regression tree (CART) as the base classifier is first adopted to classify each training instance and acquire the corresponding confusing information. We can accurately estimate location information for each instance, including noise, borders and safety, based on the confusing information. Then, the noisy instances can be removed, and the borderline instances can be given more chances than the safe instances to be seed instances in the SMOTE procedure. Finally, the balanced instance set was used to train the CART, K nearest neighbors (KNN) and support vector machine (SVM) classifiers to verify whether the proposed algorithm is irrelevant to the specific classification model. We compare IW-SMOTE with several state-of-the-art SMOTE-based algorithms on many class imbalance data sets, and IW-SMOTE shows promising results.
SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors
2022, Information Sciences
Citation Excerpt :
However, some inherent characteristics of data, such as imbalanced data distribution, pose huge challenges for traditional machine learning techniques. Learning from imbalanced data is an important topic in machine learning, as it is relevant to a wide range of applications, including diagnosis and classification of diseases [21], detection of software defects [3], evaluation of credit risk [41], prediction of actionable revenue change and bankruptcy [29], diagnosis of faults in the industrial procedure [32], classification of soil types [37], and even prediction of crash injury severity [22] or analysis of crime linkages [25]. Class imbalance learning (CIL) is also a challenging task in machine learning, as most supervised learning algorithms are constructed based on the theory of empirical risk minimization.
In recent years, class imbalance learning (CIL) has become an important branch of machine learning. The Synthetic Minority Oversampling TEchnique (SMOTE) is considered to be a benchmark algorithm among CIL techniques. Although the SMOTE algorithm performs well on the vast majority of class-imbalance tasks, it also has the inherent drawback of noise propagation. Many SMOTE-variants have been proposed to address this problem. Generally, the improved solutions conduct a hybrid sampling procedure, i.e., carrying out an undersampling process after SMOTE to remove noises. However, owing to the complexity of data distribution, it is sometimes difficult to accurately identify real instances of noise, resulting in low modeling quality. In this paper, we propose a more robust and universal SMOTE hybrid variant algorithm named SMOTE-reverse k-nearest neighbors (SMOTE-RkNN). The proposed algorithm identifies noise based on probability density but not local neighborhood information. Specifically, the probability density information of each instance is provided by RkNN, a well-known KNN variant. Noisy instances are found and deleted according to their relevant probability density. In experiments on 46 class-imbalanced data sets, SMOTE-RkNN showed promising results in comparison with several popular SMOTE hybrid variant algorithms.
FEATURE SELECTION FOR THE LOW INDUSTRIAL YIELD OF CANE SUGAR PRODUCTION BASED ON RULE LEARNING ALGORITHMS
2023, Journal of Automation, Mobile Robotics and Intelligent Systems

View all citing articles on Scopus

View full text

Sample imbalance disease classification model based on association rule feature selection

Highlights

Abstract

Introduction

Section snippets

Related work

Association rules features selection

Metrics

Results and discussions

Conclusion

Declaration of Competing Interest

Acknowledgments

J. Process Control

Intell. Data Anal.

Inf Sci (Ny)

Pattern Recognit

Pattern Recognit. Lett.

Information fusion

J. Comput. Sci.

Mining association rules between sets of items in large databases

Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data