Medical data mining by fuzzy modeling with selected features
Introduction
Early detection of medical problems such as breast cancer and diabetes is important to increase the chance of successful treatment. Such detection is often formulated as a binary classification problem. Various soft computing methods have been used for the detection of a potential medical problem. This paper specifically focuses on the use of fuzzy modeling methods because of their advantage in discovering human comprehensible knowledge, which is important to the acceptance and usability of a solution derived from the model. Consider a multiple-input single-output system, in which A1 and C denote the predefined sets of fuzzy terms for the ith input space and the output, respectively. Assuming m input variables, the set of all possible rules constituting a fuzzy model may be presented by the Cartesian product: R:A1 × … × Ai × … × Am × C.
Guillaume [1] defined three necessary conditions for a set of fuzzy models to be interpretable as follows:
- 1.
The fuzzy partition must be readable, in the sense that the fuzzy sets can be interpreted as linguistic labels.
- 2.
The set of rules must be as small as possible.
- 3.
The If-part of the rules should be derived from a subset of independent variables rather than the full set.
Various methods have been proposed to learn fuzzy models but not all of them produce human interpretable models. More details can be referred to the review carried out by Liao [2]. Two methods capable of producing human interpretable models are selected for this study. More details are given in Section 2. For comparison, the fuzzy k-nearest neighbor algorithm that falls under the category of lazy learning is also used.
For high dimensional data, as often the case for medical data, an interpretable fuzzy model must use a small subset of features, not the full set. To this end, feature selection methods must be employed together with fuzzy modeling methods. Feature selection methods can be grouped into four categories: the first type encompasses feature selection algorithms built into adaptive systems for data analysis as a decision tree method; the second type of algorithms are wrapped around predictors providing them subset of features and receiving their performance feedback; the third type are algorithms independent of any predictors, filtering out features which are irrelevant or redundant and are not very useful in data analysis; and the fourth type are hybrids of filter and wrapper approaches.
The filter approach evaluates and selects feature subsets based on general characteristics of data, and some statistical analysis without employing any learning model. On the other hand, the wrapper technique involves a learning model, and uses its performance as the evaluation criterion. The wrapper approach is known to be more accurate compared to the filter approach and it is computationally more expensive as well. The hybrid approach, which is a combination of filter and wrapper technique, is designed to trade accuracy with computational speed by applying a wrapper technique to only those subsets pre-selected by a filter technique. For each category, many feature selection methods have been proposed in the past. Their usages together with fuzzy modeling methods, however, have not been studied and reported widely (refer to Section 5 for details). Hence, it is unclear which combination of feature selection and fuzzy modeling method performs better for a particular dataset. Selecting a feature selection and data mining algorithm is one of the important steps in the entire knowledge discovery process. Unfortunately, reports of such a study are still a rarity.
To fill in this gap and shed light on the performance of different combinations of feature selection and fuzzy modeling methods, a study was carried out using two popular medical-related benchmark datasets and one industrial dataset and the results are reported in this paper. Section 2 briefly describes each feature selection method and each fuzzy modeling method employed in this study. Section 3 presents the test results. Section 4 discusses the results and addresses other relevant issues. Related works concerning the use of fuzzy modeling in medical data mining are reviewed in Section 5. Finally the paper is concluded.
Section snippets
Data mining methodologies
This section briefly describes the 12 feature selection methods and the three fuzzy modeling methods chosen for this study.
Test data and results
Two binary class medical datasets available at UCI Repository, Wisconsin breast cancer data and Pima Indians diabetes data, are used to evaluate the performance of each feature selection and fuzzy modeling method described in Section 2. In addition, an industrial dataset, specifically welding flaw data, is also used to further evaluate the performance of each method. Major characteristics of the three datasets are summarized in Table 1. For each dataset, the entire data was first randomized,
Discussion
Reporting only the accuracy values might be misleading and not revealing other important information, as demonstrated by Cios and Moore [15]. To double check seven other performance measures including sensitivity (a.k.a. recall in information retrieval community), specificity, precision, class weighted accuracy, F-measure, geometric mean of accuracies, and area under the receiver operating characteristics (ROC) curve were also computed for the top rank result obtained for each dataset in this
Related works
In this section, we review previous works that mine (specifically classify in the context of this study) medical data using some fuzzy modeling methods without or with the use of some feature selection method. Belacel and Boulassel [18] developed a supervised fuzzy classification procedure, called PROAFTN, and applied it to assist diagnosis of three clinical entities namely acute leukaemia, astrocytic, and bladder tumors. Variable neighborhood search metaheuristic was later proposed to
Conclusions
This paper has presented a study of medical data mining that involves the use of eleven feature selection methods and three fuzzy modeling methods; such methods are not all available in a commercial data mining package. The objective is to determine which combination of feature selection and fuzzy modeling method has the best performance for a given dataset.
Two medical datasets and one industrial dataset were tested with fivefold stratified cross-validation. All combinations of feature
References (41)
- et al.
A case-based reasoning system for PCB principal process parameter identification
Expert Systems with Application
(2007) - et al.
Filter versus wrapper gene selection approaches in DNA microarray domains
Artificial Intelligence in Medicine
(2004) - et al.
A fuzzy c-means variant for the generation of fuzzy term sets
Fuzzy Sets and Systems
(2003) - et al.
Uniqueness of medical data mining
Artificial Intelligence in Medicine
(2002) - et al.
Multicriteria fuzzy assignment method: a useful tool to assist medical diagnosis
Artificial Intelligence in Medicine
(2001) - et al.
Learning multicriteria fuzzy classification method PROAFTN from data
Computers & Operations Research
(2007) - et al.
An investigation of neuro-fuzzy systems in psychosomatic disorders
Expert Systems with Applications
(2005) - et al.
Obtaining interpretable fuzzy classification rules from medical data
Artificial Intelligence in Medicine
(1999) - et al.
Neuro-fuzzy classification of prostate cancer using NEFCLASS-J
Computers in Biology and Medicine
(2007) - et al.
Fuzzy rule based classification with FeatureSelector and modified threshold accepting
European Journal of Operational Research
(2000)
Pattern classification with principal component analysis and fuzzy rule bases
European Journal of Operational Research
Supervised fuzzy clustering for the identification of fuzzy classifiers
Pattern Recognition Letters
An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease
Digital Signal Processing
Similarity classifier with generalized mean applied to medical data
Computers in Biology and Medicine
Similarity classifier using similarity measure derived from Yu's norms in classification of medical data sets
Computers in Biology and Medicine
Detection of welding flaws from radiographic images with fuzzy clustering methods
Fuzzy Sets and Systems
Designing fuzzy inference systems from data: an interpretability-oriented review
IEEE Transactions on Fuzzy Systems
Mining human interpretable knowledge using automatic data-driven fuzzy modeling methods—a review
Feature selection based on mutual correlation
Toward modeling lightweight intrusion detection system through correlation-based hybrid feature selection
Cited by (128)
A novel enhanced hybrid clinical decision support system for accurate breast cancer prediction
2023, Measurement: Journal of the International Measurement ConfederationA hybrid model for classification of medical data set based on factor analysis and extreme learning machine: FA + ELM
2022, Biomedical Signal Processing and ControlNovel binary logistic regression model based on feature transformation of XGBoost for type 2 Diabetes Mellitus prediction in healthcare systems
2022, Future Generation Computer SystemsBio-inspired optimization of weighted-feature machine learning for strength property prediction of fiber-reinforced soil
2021, Expert Systems with ApplicationsA novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification
2021, European Journal of Operational Research