Medical data mining by fuzzy modeling with selected features

doi:10.1016/j.artmed.2008.04.004

Artificial Intelligence in Medicine

Volume 43, Issue 3, July 2008, Pages 195-206

https://doi.org/10.1016/j.artmed.2008.04.004 Get rights and content

Summary

Objective

Medical data is often very high dimensional. Depending upon the use, some data dimensions might be more relevant than others. In processing medical data, choosing the optimal subset of features is such important, not only to reduce the processing cost but also to improve the usefulness of the model built from the selected data. This paper presents a data mining study of medical data with fuzzy modeling methods that use feature subsets selected by some indices/methods.

Methods

Specifically, three fuzzy modeling methods including the fuzzy k-nearest neighbor algorithm, a fuzzy clustering-based modeling, and the adaptive network-based fuzzy inference system are employed. For feature selection, a total of 11 indices/methods are used. Medical data mined include the Wisconsin breast cancer dataset and the Pima Indians diabetes dataset. The classification accuracy and computational time are reported. To show how good the best performer is, the globally optimal was also found by carrying out an exhaustive testing of all possible combinations of feature subsets with three features.

Results

For the Wisconsin breast cancer dataset, the best accuracy of 97.17% was obtained, which is only 0.25% lower than that was obtained by exhaustive testing. For the Pima Indians diabetes dataset, the best accuracy of 77.65% was obtained, which is only 0.13% lower than that obtained by exhaustive testing.

Conclusion

This paper has shown that feature selection is important to mining medical data for reducing processing time and for increasing classification accuracy. However, not all combinations of feature selection and modeling methods are equally effective and the best combination is often data-dependent, as supported by the breast cancer and diabetes data analyzed in this paper.

Introduction

Early detection of medical problems such as breast cancer and diabetes is important to increase the chance of successful treatment. Such detection is often formulated as a binary classification problem. Various soft computing methods have been used for the detection of a potential medical problem. This paper specifically focuses on the use of fuzzy modeling methods because of their advantage in discovering human comprehensible knowledge, which is important to the acceptance and usability of a solution derived from the model. Consider a multiple-input single-output system, in which A₁ and C denote the predefined sets of fuzzy terms for the ith input space and the output, respectively. Assuming m input variables, the set of all possible rules constituting a fuzzy model may be presented by the Cartesian product: R:A₁ × … × A_i × … × A_m × C.

Guillaume [1] defined three necessary conditions for a set of fuzzy models to be interpretable as follows:

1.
The fuzzy partition must be readable, in the sense that the fuzzy sets can be interpreted as linguistic labels.
2.
The set of rules must be as small as possible.
3.
The If-part of the rules should be derived from a subset of independent variables rather than the full set.

Various methods have been proposed to learn fuzzy models but not all of them produce human interpretable models. More details can be referred to the review carried out by Liao [2]. Two methods capable of producing human interpretable models are selected for this study. More details are given in Section 2. For comparison, the fuzzy k-nearest neighbor algorithm that falls under the category of lazy learning is also used.

For high dimensional data, as often the case for medical data, an interpretable fuzzy model must use a small subset of features, not the full set. To this end, feature selection methods must be employed together with fuzzy modeling methods. Feature selection methods can be grouped into four categories: the first type encompasses feature selection algorithms built into adaptive systems for data analysis as a decision tree method; the second type of algorithms are wrapped around predictors providing them subset of features and receiving their performance feedback; the third type are algorithms independent of any predictors, filtering out features which are irrelevant or redundant and are not very useful in data analysis; and the fourth type are hybrids of filter and wrapper approaches.

The filter approach evaluates and selects feature subsets based on general characteristics of data, and some statistical analysis without employing any learning model. On the other hand, the wrapper technique involves a learning model, and uses its performance as the evaluation criterion. The wrapper approach is known to be more accurate compared to the filter approach and it is computationally more expensive as well. The hybrid approach, which is a combination of filter and wrapper technique, is designed to trade accuracy with computational speed by applying a wrapper technique to only those subsets pre-selected by a filter technique. For each category, many feature selection methods have been proposed in the past. Their usages together with fuzzy modeling methods, however, have not been studied and reported widely (refer to Section 5 for details). Hence, it is unclear which combination of feature selection and fuzzy modeling method performs better for a particular dataset. Selecting a feature selection and data mining algorithm is one of the important steps in the entire knowledge discovery process. Unfortunately, reports of such a study are still a rarity.

To fill in this gap and shed light on the performance of different combinations of feature selection and fuzzy modeling methods, a study was carried out using two popular medical-related benchmark datasets and one industrial dataset and the results are reported in this paper. Section 2 briefly describes each feature selection method and each fuzzy modeling method employed in this study. Section 3 presents the test results. Section 4 discusses the results and addresses other relevant issues. Related works concerning the use of fuzzy modeling in medical data mining are reviewed in Section 5. Finally the paper is concluded.

Section snippets

Data mining methodologies

This section briefly describes the 12 feature selection methods and the three fuzzy modeling methods chosen for this study.

Test data and results

Two binary class medical datasets available at UCI Repository, Wisconsin breast cancer data and Pima Indians diabetes data, are used to evaluate the performance of each feature selection and fuzzy modeling method described in Section 2. In addition, an industrial dataset, specifically welding flaw data, is also used to further evaluate the performance of each method. Major characteristics of the three datasets are summarized in Table 1. For each dataset, the entire data was first randomized,

Discussion

Reporting only the accuracy values might be misleading and not revealing other important information, as demonstrated by Cios and Moore [15]. To double check seven other performance measures including sensitivity (a.k.a. recall in information retrieval community), specificity, precision, class weighted accuracy, F-measure, geometric mean of accuracies, and area under the receiver operating characteristics (ROC) curve were also computed for the top rank result obtained for each dataset in this

Related works

In this section, we review previous works that mine (specifically classify in the context of this study) medical data using some fuzzy modeling methods without or with the use of some feature selection method. Belacel and Boulassel [18] developed a supervised fuzzy classification procedure, called PROAFTN, and applied it to assist diagnosis of three clinical entities namely acute leukaemia, astrocytic, and bladder tumors. Variable neighborhood search metaheuristic was later proposed to

Conclusions

This paper has presented a study of medical data mining that involves the use of eleven feature selection methods and three fuzzy modeling methods; such methods are not all available in a commercial data mining package. The objective is to determine which combination of feature selection and fuzzy modeling method has the best performance for a given dataset.

Two medical datasets and one industrial dataset were tested with fivefold stratified cross-validation. All combinations of feature

References (41)

C.Y. Tsai et al.
A case-based reasoning system for PCB principal process parameter identification
Expert Systems with Application
(2007)
I. Inza et al.
Filter versus wrapper gene selection approaches in DNA microarray domains
Artificial Intelligence in Medicine
(2004)
T.W. Liao et al.
A fuzzy c-means variant for the generation of fuzzy term sets
Fuzzy Sets and Systems
(2003)
K.J. Cios et al.
Uniqueness of medical data mining
Artificial Intelligence in Medicine
(2002)
N. Belacel et al.
Multicriteria fuzzy assignment method: a useful tool to assist medical diagnosis
Artificial Intelligence in Medicine
(2001)
N. Belacel et al.
Learning multicriteria fuzzy classification method PROAFTN from data
Computers & Operations Research
(2007)
P. Aruna et al.
An investigation of neuro-fuzzy systems in psychosomatic disorders
Expert Systems with Applications
(2005)
D. Nauck et al.
Obtaining interpretable fuzzy classification rules from medical data
Artificial Intelligence in Medicine
(1999)
A. Keles et al.
Neuro-fuzzy classification of prostate cancer using NEFCLASS-J
Computers in Biology and Medicine
(2007)
V. Ravi et al.
Fuzzy rule based classification with FeatureSelector and modified threshold accepting
European Journal of Operational Research
(2000)

V. Ravi et al.

Pattern classification with principal component analysis and fuzzy rule bases

European Journal of Operational Research

(2000)

J. Abonyi et al.

Supervised fuzzy clustering for the identification of fuzzy classifiers

Pattern Recognition Letters

(2003)

K. Polat et al.

An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease

Digital Signal Processing

(2007)

P. Luukka et al.

Similarity classifier with generalized mean applied to medical data

Computers in Biology and Medicine

(2006)

P. Luukka

Similarity classifier using similarity measure derived from Yu's norms in classification of medical data sets

Computers in Biology and Medicine

(2007)

T.W. Liao et al.

Detection of welding flaws from radiographic images with fuzzy clustering methods

Fuzzy Sets and Systems

(1999)

S. Guillaume

Designing fuzzy inference systems from data: an interpretability-oriented review

IEEE Transactions on Fuzzy Systems

(2001)

T.W. Liao

Mining human interpretable knowledge using automatic data-driven fuzzy modeling methods—a review

M. Haindl et al.

Feature selection based on mutual correlation

J.S. Park et al.

Toward modeling lightweight intrusion detection system through correlation-based hybrid feature selection

Cited by (128)

A novel enhanced hybrid clinical decision support system for accurate breast cancer prediction
2023, Measurement: Journal of the International Measurement Confederation
Feature selection is one of the crucial data preprocessing techniques for improving the performance of machine learning (ML) models. Recently, metaheuristic feature selection algorithms have become popular because they select optimal features for ML problems. This paper presents three feature selection strategies based on metaheuristic algorithms: Bacterial Foraging (BFOA), Emperor Penguin (EPO), and a hybrid (hBFEPO) combining BFOA and EPO. The baseline algorithms have been investigated for feature selection in other ML tasks, but not for breast cancer classification. A hybrid of these two has been used for the first time. These strategies were initially tested on the COVID-19 dataset. After achieving satisfactory results, these strategies are evaluated on the WDBC Breast Cancer dataset. The performance of our models on WDBC is compared with recent eighteen state-of-the-art studies. The results indicate that the hBFEPO model outperforms other models, achieving 100% precision and specificity, 98.49% accuracy, 95.43% sensitivity, a 95.99% F1-score, and a 99.60% AUC.
A hybrid model for classification of medical data set based on factor analysis and extreme learning machine: FA + ELM
2022, Biomedical Signal Processing and Control
Data mining techniques such as classification, clustering, and prediction are used extensively for medical diagnosis in epidemiological fields. A hybrid model based on Factor Analysis (FA) and Extreme Learning Machine (ELM) was proposed in this study for diagnosing breast cancer, Lymphography, and erythemato-squamous diseases. The proposed hybrid model consists of two stages. Firstly, FA was used for preprocessing the medical dataset, and then, the factors obtained using FA were used as input features for the ELM model. Dermatology, Lymphography, and Wisconsin Breast Cancer real datasets obtained from the UCI machine learning database were used to test the proposed model. An average success rate of 96.39 % and 96.94 % was obtained after classifying the dermatology dataset with ELM and FA + ELM models. While the success rate obtained by classifying the lymphography data set using ELM is 84.50 %, the result obtained with FA + ELM is 85.10 %. The success rates of 97.10 % and 97.25 % are achieved respectively for Wisconsin Breast Cancer (WBC) using ELM and FA + ELM. As a result, it was observed that preprocessing of the data increased the average classification success in three different medical datasets used for the classification problem. It is considered that the proposed hybrid model will be helpful for the decision-making stage in medical diagnosis systems.
Novel binary logistic regression model based on feature transformation of XGBoost for type 2 Diabetes Mellitus prediction in healthcare systems
2022, Future Generation Computer Systems
The rapidly increasing incidence of Diabetes Mellitus (DM) has shown that DM is a serious disease that endangered human life in all parts of the world. The late stage of Type-II DM (T2DM) in particular is accompanied by complex complications. Healthcare systems with various data mining algorithms can help the endocrinologist to find whether patients have diabetes in the early detection of T2DM. In the present research, a novel and efficient binary logistic regression (BLR) is proposed founding on feature transformation of XGBoost (XGBoost-BLR) for accurately predicting the specific type of T2DM, and making the model adaptive to more than one dataset. In order to raise the identification ratio, the databases are executed by series of preprocessing procedures which include removing outliers, normalization, and missing value processing. We select features that have a more significant effect on the results by $χ^{2}$ test (CST). Then, the selected features are projected into high-dimensional feature space by XGBoost. Finally, the high-dimensional features generated can be modeled by the BLR application. The proposed XGBoost-BLR achieved a 94% and 98% identification rate for diabetes prediction in Pima Indians Diabetes Database (PIDD) and Early-Stage Diabetes Risk Prediction Database (ESDRPD).
Bio-inspired optimization of weighted-feature machine learning for strength property prediction of fiber-reinforced soil
2021, Expert Systems with Applications
The fiber-reinforcement of soil is an effective and reliable ground improvement technique for increasing the strength and stability of soil for various purposes (including retaining structures, embankments, foundations, slopes and pavements). Numerous scholars have developed methods to identify factors that influence the shear strength and to predict the peak friction angle of fiber-reinforced soil (FRS). The accuracy of theoretical and empirical models for predicting the shear strength (peak friction angle) of FRS is questionable because of the difficulty of using these simplified models to describe the complex mechanism of soil-fiber interaction. Solutions to this problem require ever-increasing predictive accuracy, and ML-based methods have been confirmed to provide potential solutions to real-world engineering problems. Therefore, this study develops weighted-feature least squares support vector regression (WFLSSVR) that is optimized by a novel metaheuristic algorithm, jellyfish search (JS) algorithm, to predict the peak friction angle of FRS. Analytical results demonstrate that JS-WFLSSVR outperforms baseline, ensemble, and hybrid machine learning models as well as empirical methods in literature. Notably, analysis of the weight values that were obtained by JS-WFLSSVR enables the identification of new feature combinations that provide much higher accuracy than current models. Therefore, the JS-WFLSSVR model not only significantly provides better predictive accuracy than methods in the literature; it is also a good feature selection method, and can help geotechnical engineers in estimating the shear strength of FRS. Geotechnical engineers can use the proposed model to predict the shear strength and control the quality of FRS structures.
A novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification
2021, European Journal of Operational Research
In recent years, feature selection has become a challenging problem in several machine learning fields, such as classification problems. Support Vector Machine (SVM) is a well-known technique applied in classification tasks. Various methodologies have been proposed in the literature to select the most relevant features in SVM. Unfortunately, all of them either deal with the feature selection problem in the linear classification setting or propose ad-hoc approaches that are difficult to implement in practice. In contrast, we propose an embedded feature selection method based on a min-max optimization problem, where a trade-off between model complexity and classification accuracy is sought. By leveraging duality theory, we equivalently reformulate the min-max problem and solve it without further ado using off-the-shelf software for nonlinear optimization. The efficiency and usefulness of our approach are tested on several benchmark data sets in terms of accuracy, number of selected features and interpretability.
Computer aided diagnostic system based on SVM and K harmonic mean based attribute weighting method
2020, Obesity Medicine
Machine learning techniques are popular tool adopted for medical diagnosis and one of the core component of medical diagnostic system. The objective of machine learning techniques is to provide accurate and timely diagnostic results during disease diagnosis phases. Further, it also helps the physicians and medical practitioner regarding disease diagnosis. The objective of this work is to improve the diagnostic accuracy of computer aided diagnostic system.
Large number of machine learning techniques are integrated in the computer aided diagnostic system for the prediction of the diseases. These machine learning techniques consider different features of disease to diagnosis the disease. It is seen that all features are not equally important in diagnostic process and irrelevant features can lead to low prediction rate. Hence in medical field, identification of irrelevant features is warm area of research. To identify the relevant features for disease prediction, attribute weighting methods are adopted. It is observed relevant features can improve the diagnostic accuracy of computer aided systems. Hence, to improve the diagnostic accuracy rate, a k harmonic mean based attribute weighting method is developed, called KhmAW. Further, the proposed KhmAW method is integrated with SVM method, called KhmAW-SVM. In KhmAW-SVM, KhmAW method is used to identify the relevant features from dataset and SVM method is applied for diagnosis the disease. The proposed method classifies the datasets into healthy and non-healthy classes.
Four datasets are used to validate the proposed KhmAW-SVM based computer aided diagnostic system. These datasets are Statlog heart disease, Parkinson disease, Liver disease and Pima Indian diabetes disease datasets and having non linearly separable data distribution. The simulation results of proposed KhmAW-SVM method are evaluated using accuracy rate. Further, the simulation results are assessed using 50-50 training-testing and 10 fold methods. It is stated that proposed KhmAW-SVM method achieves 94.28%, 99%, 89.93% and 92.38% accuracy rates for heart disease, Parkinson's disease, liver disease and diabetes disease respectively.
The efficacy of the proposed method is evaluated using four well known diseases datasets and compared with large number of existing studies. It is stated that proposed KhmAW-SVM based computer aided diagnostic system achieves better quality results as compared to existing studies. Hence, it is concluded that proposed computer aided diagnostic system can improve the clinical decision making process and also help the physician and doctors regarding different diseases.

View all citing articles on Scopus

View full text

Medical data mining by fuzzy modeling with selected features

Summary

Objective

Methods

Results

Conclusion

Introduction

Section snippets

Data mining methodologies

Test data and results

Discussion

Related works

Conclusions

Expert Systems with Application

Artificial Intelligence in Medicine

Fuzzy Sets and Systems

Artificial Intelligence in Medicine

Artificial Intelligence in Medicine

Computers & Operations Research

Expert Systems with Applications

Artificial Intelligence in Medicine

Computers in Biology and Medicine

European Journal of Operational Research

European Journal of Operational Research

Pattern Recognition Letters

Digital Signal Processing

Computers in Biology and Medicine

Computers in Biology and Medicine

Fuzzy Sets and Systems

Designing fuzzy inference systems from data: an interpretability-oriented review

IEEE Transactions on Fuzzy Systems

Mining human interpretable knowledge using automatic data-driven fuzzy modeling methods—a review

Feature selection based on mutual correlation

Toward modeling lightweight intrusion detection system through correlation-based hybrid feature selection