Sample imbalance disease classification model based on association rule feature selection
Introduction
Diagnostic decision support system, as an application branch of Clinical Decision Support System (Clinical Decision Support System, CDSS) [2], has been a highline of research and application. Machine learning is widely used in medical information mining [11] and diagnostic [12], [20], [28], [31], [32] research. According to the basis of CDSS decision support, it can be divided into two categories: knowledge-based CDSS system and non-knowledge CDSS system. The knowledge-based CDSS system is essentially an expert system in the medical field. It contains a large number of medical experts’ knowledge and experience. It usually consists of human-machine interface, knowledge base and inference engine. Unlike knowledge-based CDSS,non-knowledge-based CDSS is integrated with the Electronic Medical Record System(Electronic Medical Record System, EMR), and its decision support is based on algorithms such as machine learning or other statistical algorithms for pattern recognition. EMR data sets for association rule mining, classification, regression, etc. can continuously discover new knowledge to help doctors make better decisions in the disease diagnosis process. With the continuous development of data mining, pattern recognition and machine learning algorithms, and the continuous accumulation of data in electronic medical records, the research of CDSS based on machine learning has gradually become the hotspot and mainstream direction of current diagnostic decision support systems.
In the current study, two issues remain. First, it is difficult to construct a diagnostic model directly for medical cases with fewer training samples. And as the diagnostic capabilities of diagnostic models improve, the required features will continue to expand, resulting in a dimensionality of the feature matrix, resulting in more redundant and uncorrelated features, excessive computation, sparse training samples, and overfitting. Other issues that ultimately affect the classification quality of the classifier. In the disease diagnosis model, the number of diagnosed disease categories is small, and the classification accuracy rate is low, which makes the diagnostic decision model practicable. At the same time, in the electronic medical record, because there are a large number of characteristic attributes, the selection of corresponding features is different for different diseases. How to select the optimal features is the difficulty to improve the multi-classification task of diseases. Second, most existing diseases have the characteristics of disease distribution imbalance [5]. In the disease multi-classification problem, sample imbalance is a key factor restricting the disease multi-classification problem, because for some specific diseases In the case of the case, the number of sample sets is relatively small, which leads to sparse training samples when classifying, affecting classification accuracy and generalization performance of multi-classification tasks.
First, we searched and analyzed the association rules between diabetes and its complications and symptoms in the EMR dataset. Then we sorted the symptom feature attributes based on the two frequent sets of confidence parameters of the disease-disease, and adopted the sequence forward selection method, and the classification performance of the classifier is selected to overcome the problems of large computational complexity, training sample sparseness and over-fitting caused by the dimensionality disaster encountered in the feature matrix. Finally, the characteristics of the sample imbalance in the EMR data set are adopted. The subset of the training samples is divided into categories according to the category, and the feature vectors selected in the previous stage are used to train the unbalanced sample set, and then the random equalization sampling is performed in the iterative process, and then the training base classifier and the evaluation base classifier are passed. The F1 value is used for weighted voting, and finally an integrated classifier with the best classification performance is output, and the integrated classifier is used to complete the classification task of multiple diseases, thereby improving the quality of disease diagnosis decision.
Overall, the main contributions of our work are:
- •
We propose a disease feature selection algorithm based on association rules, which can screen out feature vectors for multi-disease classification and effectively improve the quality of multi-disease classification.
- •
We propose an integrated algorithm based on stochastic equalization, which can effectively improve the multi-disease classification of sample imbalance.
The rest of this paper is organized as follows. Section II reviews the related work for image captioning and radiology report generation. Section III shows data preprocessing and details of the data. Section IV details the design of the proposed algorithm. Section V and VI present and discuss the experimental settings and results, respectively. Finally, we draw conclusion in Section VII.
Section snippets
Related work
Feature Selection. John et al. [13] considers feature selection to be a process of reducing feature dimensions without reducing classification accuracy. Koller et al. [17] defines feature selection to select as small a feature subset as possible, while ensuring that the result class distribution is as similar as possible to the original data class distribution. Dash et al. [7] gave a comprehensive overview of the feature selection problem in the field of data mining, and gave the basic
Association rules features selection
In order to reduce the size of the feature subset and improve the efficiency of the feature selection algorithm without reducing the classification accuracy, we propose an ARFS) Association Rules Features Selection, ARFS) Algorithm 1. The ARFS algorithm first takes the maximum value strategy to calculate the confidence between the feature and the category, and uses the confidence value to evaluate the correlation between the feature and the category. This correlation affects the selection of
Metrics
We use P(Precision,P), R(Recall,R), F-1 values as our basic evaluation indicators, and their calculation formulas are as follows:
Where TP is True Positive, FP is False Positive, FN is False Negative, and TN is True Negative. To further comprehensively assess the accuracy, recall, and F-mearsure values of a multi-category problem, we can also use Macro-averaging to first calculate the metrics for each category and then all categories. The evaluation indicators are
Results and discussions
To verify the robustness of algorithms, we also experimented with the unpublished diabetes dataset and the UCI machine learning database1. For the feature selection algorithm, we chose the cmc dataset (contraceptive method choice), monks-1, monks-2 dataset, spect-heartand and spectf-heart datasets. The monks-1 and monks-2 data sets only use the test set; the spect-heart and spectf-heart data sets combine the test set and the training set for
Conclusion
First, we propose a feature selection algorithm based on association rules in view of the shortcomings of feature attribute dimension disasters in diagnosis decision support systems. Compared with CART, ReliefF, RFE-SVM and ARFS algorithm proposed in the diabetes dataset and UCI public dataset, the experimental results show that the ARFS algorithm is superior to the contrast experimental algorithm in feature dimension and classification accuracy. Secondly, based on the sample imbalance problem
Declaration of Competing Interest
The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Acknowledgments
This work is supported by the National Key Research and Development Project (Nos. 2019YFB2101600) and the National Natural Science Foundation of China (Nos. 61763031).
References (32)
- et al.
Genetic algorithms combined with discriminant analysis for key variable identification
J. Process Control
(2004) - et al.
Feature selection for classification
Intell. Data Anal.
(1997) Genetic wrappers for feature selection in decision tree induction and variable ordering in bayesian network structure learning
Inf Sci (Ny)
(2004)- et al.
Irrelevant Features and the Subset Selection Problem
Machine Learning Proceedings 1994
(1994) - et al.
A Practical Approach to Feature Selection
Machine Learning Proceedings 1992
(1992) - et al.
A novel ensemble method for classifying imbalanced data
Pattern Recognit
(2015) - et al.
Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and adaboost techniques
Pattern Recognit. Lett.
(2013) - et al.
Ensemble feature selection with the simple bayesian classification
Information fusion
(2003) - et al.
Multiple sclerosis identification by convolutional neural network with dropout and parametric relu
J. Comput. Sci.
(2018) - et al.
Mining association rules between sets of items in large databases
Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data
(1993)
Clinical decision support systems
Classification and regression trees
Smote: synthetic minority over-sampling technique
Journal of artificial intelligence research
Research on disease prediction models based on imbalanced medical data sets
Chinese Journal of Computers
Gene selection for cancer classification using support vector machines
Mach. Learn.
Cited by (43)
LayNet—A multi-layer architecture to handle imbalance in medical imaging data
2023, Computers in Biology and MedicineA novel ensemble feature selection method by integrating multiple ranking information combined with an SVM ensemble model for enterprise credit risk prediction in the supply chain
2022, Expert Systems with ApplicationsCitation Excerpt :Therefore, the SBFS-RI algorithm can avoid falling into the local optimum to a certain extent. In addition, compared to the feature selection method proposed by Huang et al. (2020), the SBFS-RI algorithm can further effectively eliminate redundant features. The pseudo-code of the algorithm is shown in Fig. 1, and the specific process is as follows.
Instance weighted SMOTE by indirectly exploring the data distribution
2022, Knowledge-Based SystemsCitation Excerpt :Learning from imbalanced data is an important topic in machine learning, because it has been widely applied to diagnose and classify diseases [1,2], detect software defects [3,4], analyze biological and pharmacological data [5,6], evaluate credit risk [7], predict actionable revenue change and bankruptcy [8,9], diagnose faults in industrial procedure [10,11], classify soil types [12,13], predict crash injury severity [14] and analyze crime linkages [15].
SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors
2022, Information SciencesCitation Excerpt :However, some inherent characteristics of data, such as imbalanced data distribution, pose huge challenges for traditional machine learning techniques. Learning from imbalanced data is an important topic in machine learning, as it is relevant to a wide range of applications, including diagnosis and classification of diseases [21], detection of software defects [3], evaluation of credit risk [41], prediction of actionable revenue change and bankruptcy [29], diagnosis of faults in the industrial procedure [32], classification of soil types [37], and even prediction of crash injury severity [22] or analysis of crime linkages [25]. Class imbalance learning (CIL) is also a challenging task in machine learning, as most supervised learning algorithms are constructed based on the theory of empirical risk minimization.
FEATURE SELECTION FOR THE LOW INDUSTRIAL YIELD OF CANE SUGAR PRODUCTION BASED ON RULE LEARNING ALGORITHMS
2023, Journal of Automation, Mobile Robotics and Intelligent Systems