HMV: A medical decision support framework using multi-layer classifiers for disease prediction

https://doi.org/10.1016/j.jocs.2016.01.001Get rights and content

Highlights

  • Extensive research has been conducted on disease prediction.

  • However, there is no agreement on which classifier produces the best results.

  • An optimal combination of classifiers is presented with multi-layer classification.

  • The ensemble approach uses bagging with multi-objective optimized weighted.

  • Comparison with existing techniques shows superiority of our ensemble.

Abstract

Decision support is a crucial function for decision makers in many industries. Typically, Decision Support Systems (DSS) help decision-makers to gather and interpret information and build a foundation for decision-making. Medical Decision Support Systems (MDSS) play an increasingly important role in medical practice. By assisting doctors with making clinical decisions, DSS are expected to improve the quality of medical care. Conventional clinical decision support systems are based on individual classifiers or a simple combination of these classifiers which tend to show moderate performance. In this research, a multi-layer classifier ensemble framework is proposed based on the optimal combination of heterogeneous classifiers. The proposed model named “HMV” overcomes the limitations of conventional performance bottlenecks by utilizing an ensemble of seven heterogeneous classifiers. The framework is evaluated on two different heart disease datasets, two breast cancer datasets, two diabetes datasets, two liver disease datasets, one Parkinson's disease dataset and one hepatitis dataset obtained from public repositories. Effectiveness of the proposed ensemble is investigated by comparison of results with several well-known classifiers as well as ensemble techniques. The experimental evaluation shows that the proposed framework dealt with all types of attributes and achieved high diagnosis accuracy. A case study is also presented based on a real time medical dataset in order to show the high performance and effectiveness of the proposed model.

Introduction

Data mining in medical domain, is a process of discovering hidden patterns and information from large medical datasets; analyzes them and uses them for disease prediction [1]. The basic goal of data mining process is to extract hidden information from medical datasets and transform it into an understandable structure for future use [2]. A large number of predictive models can be developed from data mining techniques which enable classification and prediction tasks. After discovering knowledge from data, learning phase starts; where a scientific model is built. This learning method evolves the concept of machine learning and can be formally defined as “the complex computation process of automatic pattern recognition and intelligent decision making based on training sample data” [3]. The machine learning classifiers are further categorized into supervised learning and unsupervised learning depending on the availability of data. In supervised learning, labeled training data is available and a learning model is trained. Some examples include Artificial Neural Network (ANN), Support Vector Machine (SVM), and Decision Trees (DT). In unsupervised learning, there is no class label field in sample data. Examples include, K-mean clustering and Self-Organization Map (SOM). An ensemble approach performs better than individual machine learning techniques by combining the results of individual classifiers [4], [5]. There are multiple techniques that can be utilized for constructing the ensemble model and each result in different diagnosis accuracy. Most common ensemble approaches are bagging [6], boosting [7] and stacking [8].

Significant amount of work has already been done on disease classification and prediction. However, there is no single methodology which shows highest performance for all datasets or diseases, while one classifier shows good performance in a given dataset, another approach outperforms the others for other dataset or disease. The proposed research focuses on a novel combination of heterogeneous classifiers for disease classification and prediction, thus overcoming the limitations of individual classifiers. The novel combination of heterogeneous classifiers is presented which is Naïve Bayes, Linear Regression, Quadratic Discriminant Analysis, K-Nearest Neighbor, Support Vector Machine, Decision tree using Information Gain and Decision tree using the Gini Index. The multiple classifiers are used at multiple layers to further enhance disease prediction accuracy. An application has also been developed for disease prediction. It is based on the proposed HMV ensemble framework. The proposed application can help both doctors and patients in terms of data management and disease prediction.

The rest of the paper is organized as follows: Section 2 relates to literature review. Section 3 presents the proposed ensemble framework. Section 4 provides the results and discussion from the experiments carried out. Section 5 provides a case study about proposed ensemble model, whereas Section 6 is related to the discussion. Medical application for disease diagnosis is detailed in Section 7 and finally the conclusion is provided in Section 8.

Section snippets

Literature review

Extensive amount of work has already been done on disease classification and prediction. However, most of the literature has focused on using a single classifier for a specific disease.

Pattekari and Parveen [9] presented a heart disease prediction system based on a Naïve Bayes algorithm to predict the hidden patterns in a given dataset. The proposed technique limits the use of only categorical data and uses only single classifier. Other data mining techniques such as ensembles, time series,

HMV ensemble framework

The proposed ensemble framework consists of three modules, namely data acquisition and preprocessing, classifier training and HMV (Hierarchical Majority Voting) ensemble model for disease classification and prediction with three layered approach.

Datasets description

The experimental evaluation of HMV ensemble framework is performed on two heart disease datasets, two breast cancer datasets, two diabetes datasets, two liver disease datasets, one hepatitis dataset and one Parkinson's disease dataset. Each dataset contains diverse set of attributes that are ultimately used to determine the disease classification and prediction such healthy or sick. The two heart disease datasets (Cleveland heart disease, Statlog) are taken from UCI data repository [36]. One

Case study: real-time implementation of the proposed framework

The proposed DSS is evaluated on real time dataset of blood CP taken from Pakistan Institute of Medical Science (PIMS) hospital. PIMS is located in Islamabad, Pakistan. It is opening in 1985. P.I.M.S. hospital provides patient health care, medical facilities and as medical appointment hospital also to ways and training of doctors and other health workers in the field of medication and surgery. It consists of multiple departments such as Cardiology, Dental, Urology, Dermatology, Blood bank,

Discussion

We have used heterogeneous classifier ensemble model by combining entirely different type of classifiers and achieved a higher level of diversity. Decision Trees, QDA, LR, SVM and Bayesian classification are eager evaluation methods, whereas kNN is a lazy evaluation method. Combining lazy and eager classification algorithms (hybrid approach) overcomes the limitations of both eager and lazy methods. Eager method suffers from missing rule problem in case when there is no matching exists for given

Application for disease diagnosis

A medical application is also developed with the purpose of assisting the physician in diagnosing diseases. It is based on HMV framework and is divided into different modules in order to maintain simplicity and consistency.

There are three main users of the application: Admin staff, Doctor and Patient. Each user has its login id and password in order to interact with the system. There are four main modules of the proposed application:

  • 1.

    Data Acquisition and Preprocessing Module

  • 2.

    Classifier Training

Conclusion

Accuracy plays a vital role in the medical field as it concerns with the life of an individual. Data mining in the medical domain works on the past experiences and analyzes them to identify the general trends and probable solutions to the present situations. This research paper presents an ensemble framework using hierarchical majority voting and multi-layer classification for disease classification and prediction using data mining techniques. The proposed model overcomes the limitations of

Acknowledgements

This research uses real time data from PIMS (Pakistan Institute of Medical Sciences) hospital, Islamabad, Pakistan. We thank Doctor Lubna Naseem and PIMS hospital staff for the data collection and analysis files.

Saba Bashir is an Assistant Professor in Computer Science Department at Federal Urdu University of Arts, Science and Technology, Pakistan. She is also a Ph.D. research scholar at NUST, Pakistan. Her research interest lies in predictive systems, web services and object oriented computing. She has published more than 8 research papers in international conferences and journals.

References (67)

  • M.A. King et al.

    Ensemble learning methods for pay-per-click campaign management

    Expert Syst. Appl.

    (2015)
  • H. Parvin et al.

    Proposing a classifier ensemble framework based on classifier selection and decision tree

    Eng. Appl. Artif. Intell.

    (2015)
  • J. Mendes-Moreira et al.

    Improving the accuracy of long-term travel time prediction using heterogeneous ensembles

    Neurocomputing

    (2015)
  • S. Bose et al.

    Generalized quadratic discriminant analysis

    Pattern Recognit.

    (2015)
  • D. Lin et al.

    Double-bootstrapping source data selection for instance-based transfer learning

    Pattern Recognit. Lett.

    (2013)
  • S. Datta et al.

    Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs

    Neural Netw.

    (2015)
  • J.H. Chen et al.

    OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records

    J. Am. Med. Inform. Assoc.

    (2015)
  • S. Dua et al.

    Data Mining and Machine Learning in Cyber Security

    (2011)
  • F. Moretti et al.

    Urban traffic flow forecasting through statistical and neural network bagging ensemble hybrid modeling

    Neurocomputing

    (2015)
  • S.A. Pattekari et al.

    Prediction system for heart disease using Naïve Bayes

    Int. J. Adv. Comput. Math. Sci.

    (2012)
  • S. Ghumbre et al.

    Heart disease diagnosis using support vector machine

  • F.M. Ba-Alw et al.

    Comparative study for analysis the prognostic in hepatitis data: data mining approach

    Int. J. Sci. Eng. Res.

    (2013)
  • R. Zolfaghari

    Diagnosis of diabetes in female population of PIMA Indian heritage with ensemble of BP neural network and SVM

    IJCEM

    (2012)
  • S. Sapna et al.

    Data mining-fuzzy neural genetic algorithm in predicting diabetes

    Res. J. Comput. Eng.

    (2008)
  • An introduction to feature extraction

  • Combining SVMs with various feature selection strategies

  • D.R. Wilson et al.

    Improved heterogeneous distance functions

    J. Artif. Intell. Res.

    (1997)
  • K. CHROMIŃSKI et al.

    Comparison of outlier detection methods in biomedical data

    J. Med. Inform. Technol.

    (2010)
  • K. Mardia

    Multivariate Analysis

    (1979)
  • J.F. Díez-Pastor et al.

    Random balance: ensembles of variable priors classifiers for imbalanced data

    Knowl. Based Syst.

    (2015)
  • L. Rokach

    Ensemble-based classifiers

    Artif. Intell. Rev.

    (2010)
  • S. Whalen et al.

    A comparative analysis of ensemble classifiers: case studies in genomics

  • Cited by (65)

    • Hypoglycaemia prediction using information fusion and classifiers consensus

      2023, Engineering Applications of Artificial Intelligence
    • Computer aided diagnostic system based on SVM and K harmonic mean based attribute weighting method

      2020, Obesity Medicine
      Citation Excerpt :

      It is noticed that many computer-aided diagnostic tools are presented in literature to help the physicians. Large number of machine learning techniques is incorporated in diagnostic tools to improve the prediction rate (Nilashi et al., 2017; Kavakiotis et al., 2017; Shickel et al., 2017; Nahar et al., 2013; Bashir et al., 2016). It is seen that all features are not equally important for decision making process.

    • A mixed solution-based high agreement filtering method for class noise detection in binary classification

      2020, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      DT is usually capable of tolerating low quantities of noisy data. Moreover, KNN is an expensive method in terms of computation as it requires a lot of storage space [9,31]. NB is usually considered as a more robust algorithm to noisy samples than RF [32].

    • Clustering algorithms for analysing electronic medical record: A mapping study

      2023, IAES International Journal of Artificial Intelligence
    View all citing articles on Scopus

    Saba Bashir is an Assistant Professor in Computer Science Department at Federal Urdu University of Arts, Science and Technology, Pakistan. She is also a Ph.D. research scholar at NUST, Pakistan. Her research interest lies in predictive systems, web services and object oriented computing. She has published more than 8 research papers in international conferences and journals.

    Usman Qamar is the head of Knowledge and Data Engineering Research Centre (www.kdegroup.wordpress.com) at Department of Computer Engineering, College of Electrical and Mechanical Engineering, NUST, Pakistan. He has done his MS in Computer Systems from UMIST, UK whereas his M.Phil., Ph.D. and Post-Doc are from University of Manchester, UK in Data Engineering. His expertise are in Data and Text Mining, Expert Systems, Knowledge Discovery and Feature Selection.

    Farhan Hassan Khan has been working as a Project Manager in a software development organization in Pakistan since 2005. He is also a Ph.D. research scholar at NUST, Pakistan. His research interest lies in text mining, web service computing and VoIP billing products. He has published many research papers in international conferences and journals.

    View full text