Medical data mining by fuzzy modeling with selected features

https://doi.org/10.1016/j.artmed.2008.04.004Get rights and content

Summary

Objective

Medical data is often very high dimensional. Depending upon the use, some data dimensions might be more relevant than others. In processing medical data, choosing the optimal subset of features is such important, not only to reduce the processing cost but also to improve the usefulness of the model built from the selected data. This paper presents a data mining study of medical data with fuzzy modeling methods that use feature subsets selected by some indices/methods.

Methods

Specifically, three fuzzy modeling methods including the fuzzy k-nearest neighbor algorithm, a fuzzy clustering-based modeling, and the adaptive network-based fuzzy inference system are employed. For feature selection, a total of 11 indices/methods are used. Medical data mined include the Wisconsin breast cancer dataset and the Pima Indians diabetes dataset. The classification accuracy and computational time are reported. To show how good the best performer is, the globally optimal was also found by carrying out an exhaustive testing of all possible combinations of feature subsets with three features.

Results

For the Wisconsin breast cancer dataset, the best accuracy of 97.17% was obtained, which is only 0.25% lower than that was obtained by exhaustive testing. For the Pima Indians diabetes dataset, the best accuracy of 77.65% was obtained, which is only 0.13% lower than that obtained by exhaustive testing.

Conclusion

This paper has shown that feature selection is important to mining medical data for reducing processing time and for increasing classification accuracy. However, not all combinations of feature selection and modeling methods are equally effective and the best combination is often data-dependent, as supported by the breast cancer and diabetes data analyzed in this paper.

Introduction

Early detection of medical problems such as breast cancer and diabetes is important to increase the chance of successful treatment. Such detection is often formulated as a binary classification problem. Various soft computing methods have been used for the detection of a potential medical problem. This paper specifically focuses on the use of fuzzy modeling methods because of their advantage in discovering human comprehensible knowledge, which is important to the acceptance and usability of a solution derived from the model. Consider a multiple-input single-output system, in which A1 and C denote the predefined sets of fuzzy terms for the ith input space and the output, respectively. Assuming m input variables, the set of all possible rules constituting a fuzzy model may be presented by the Cartesian product: R:A1 ×  × Ai ×  × Am × C.

Guillaume [1] defined three necessary conditions for a set of fuzzy models to be interpretable as follows:

  • 1.

    The fuzzy partition must be readable, in the sense that the fuzzy sets can be interpreted as linguistic labels.

  • 2.

    The set of rules must be as small as possible.

  • 3.

    The If-part of the rules should be derived from a subset of independent variables rather than the full set.

Various methods have been proposed to learn fuzzy models but not all of them produce human interpretable models. More details can be referred to the review carried out by Liao [2]. Two methods capable of producing human interpretable models are selected for this study. More details are given in Section 2. For comparison, the fuzzy k-nearest neighbor algorithm that falls under the category of lazy learning is also used.

For high dimensional data, as often the case for medical data, an interpretable fuzzy model must use a small subset of features, not the full set. To this end, feature selection methods must be employed together with fuzzy modeling methods. Feature selection methods can be grouped into four categories: the first type encompasses feature selection algorithms built into adaptive systems for data analysis as a decision tree method; the second type of algorithms are wrapped around predictors providing them subset of features and receiving their performance feedback; the third type are algorithms independent of any predictors, filtering out features which are irrelevant or redundant and are not very useful in data analysis; and the fourth type are hybrids of filter and wrapper approaches.

The filter approach evaluates and selects feature subsets based on general characteristics of data, and some statistical analysis without employing any learning model. On the other hand, the wrapper technique involves a learning model, and uses its performance as the evaluation criterion. The wrapper approach is known to be more accurate compared to the filter approach and it is computationally more expensive as well. The hybrid approach, which is a combination of filter and wrapper technique, is designed to trade accuracy with computational speed by applying a wrapper technique to only those subsets pre-selected by a filter technique. For each category, many feature selection methods have been proposed in the past. Their usages together with fuzzy modeling methods, however, have not been studied and reported widely (refer to Section 5 for details). Hence, it is unclear which combination of feature selection and fuzzy modeling method performs better for a particular dataset. Selecting a feature selection and data mining algorithm is one of the important steps in the entire knowledge discovery process. Unfortunately, reports of such a study are still a rarity.

To fill in this gap and shed light on the performance of different combinations of feature selection and fuzzy modeling methods, a study was carried out using two popular medical-related benchmark datasets and one industrial dataset and the results are reported in this paper. Section 2 briefly describes each feature selection method and each fuzzy modeling method employed in this study. Section 3 presents the test results. Section 4 discusses the results and addresses other relevant issues. Related works concerning the use of fuzzy modeling in medical data mining are reviewed in Section 5. Finally the paper is concluded.

Section snippets

Data mining methodologies

This section briefly describes the 12 feature selection methods and the three fuzzy modeling methods chosen for this study.

Test data and results

Two binary class medical datasets available at UCI Repository, Wisconsin breast cancer data and Pima Indians diabetes data, are used to evaluate the performance of each feature selection and fuzzy modeling method described in Section 2. In addition, an industrial dataset, specifically welding flaw data, is also used to further evaluate the performance of each method. Major characteristics of the three datasets are summarized in Table 1. For each dataset, the entire data was first randomized,

Discussion

Reporting only the accuracy values might be misleading and not revealing other important information, as demonstrated by Cios and Moore [15]. To double check seven other performance measures including sensitivity (a.k.a. recall in information retrieval community), specificity, precision, class weighted accuracy, F-measure, geometric mean of accuracies, and area under the receiver operating characteristics (ROC) curve were also computed for the top rank result obtained for each dataset in this

Related works

In this section, we review previous works that mine (specifically classify in the context of this study) medical data using some fuzzy modeling methods without or with the use of some feature selection method. Belacel and Boulassel [18] developed a supervised fuzzy classification procedure, called PROAFTN, and applied it to assist diagnosis of three clinical entities namely acute leukaemia, astrocytic, and bladder tumors. Variable neighborhood search metaheuristic was later proposed to

Conclusions

This paper has presented a study of medical data mining that involves the use of eleven feature selection methods and three fuzzy modeling methods; such methods are not all available in a commercial data mining package. The objective is to determine which combination of feature selection and fuzzy modeling method has the best performance for a given dataset.

Two medical datasets and one industrial dataset were tested with fivefold stratified cross-validation. All combinations of feature

References (41)

  • V. Ravi et al.

    Pattern classification with principal component analysis and fuzzy rule bases

    European Journal of Operational Research

    (2000)
  • J. Abonyi et al.

    Supervised fuzzy clustering for the identification of fuzzy classifiers

    Pattern Recognition Letters

    (2003)
  • K. Polat et al.

    An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease

    Digital Signal Processing

    (2007)
  • P. Luukka et al.

    Similarity classifier with generalized mean applied to medical data

    Computers in Biology and Medicine

    (2006)
  • P. Luukka

    Similarity classifier using similarity measure derived from Yu's norms in classification of medical data sets

    Computers in Biology and Medicine

    (2007)
  • T.W. Liao et al.

    Detection of welding flaws from radiographic images with fuzzy clustering methods

    Fuzzy Sets and Systems

    (1999)
  • S. Guillaume

    Designing fuzzy inference systems from data: an interpretability-oriented review

    IEEE Transactions on Fuzzy Systems

    (2001)
  • T.W. Liao

    Mining human interpretable knowledge using automatic data-driven fuzzy modeling methods—a review

  • M. Haindl et al.

    Feature selection based on mutual correlation

  • J.S. Park et al.

    Toward modeling lightweight intrusion detection system through correlation-based hybrid feature selection

  • Cited by (128)

    • A novel enhanced hybrid clinical decision support system for accurate breast cancer prediction

      2023, Measurement: Journal of the International Measurement Confederation
    View all citing articles on Scopus
    View full text