HMV: A medical decision support framework using multi-layer classifiers for disease prediction

doi:10.1016/j.jocs.2016.01.001

Journal of Computational Science

Volume 13, March 2016, Pages 10-25

https://doi.org/10.1016/j.jocs.2016.01.001 Get rights and content

Highlights

•
Extensive research has been conducted on disease prediction.
•
However, there is no agreement on which classifier produces the best results.
•
An optimal combination of classifiers is presented with multi-layer classification.
•
The ensemble approach uses bagging with multi-objective optimized weighted.
•
Comparison with existing techniques shows superiority of our ensemble.

Abstract

Decision support is a crucial function for decision makers in many industries. Typically, Decision Support Systems (DSS) help decision-makers to gather and interpret information and build a foundation for decision-making. Medical Decision Support Systems (MDSS) play an increasingly important role in medical practice. By assisting doctors with making clinical decisions, DSS are expected to improve the quality of medical care. Conventional clinical decision support systems are based on individual classifiers or a simple combination of these classifiers which tend to show moderate performance. In this research, a multi-layer classifier ensemble framework is proposed based on the optimal combination of heterogeneous classifiers. The proposed model named “HMV” overcomes the limitations of conventional performance bottlenecks by utilizing an ensemble of seven heterogeneous classifiers. The framework is evaluated on two different heart disease datasets, two breast cancer datasets, two diabetes datasets, two liver disease datasets, one Parkinson's disease dataset and one hepatitis dataset obtained from public repositories. Effectiveness of the proposed ensemble is investigated by comparison of results with several well-known classifiers as well as ensemble techniques. The experimental evaluation shows that the proposed framework dealt with all types of attributes and achieved high diagnosis accuracy. A case study is also presented based on a real time medical dataset in order to show the high performance and effectiveness of the proposed model.

Introduction

Data mining in medical domain, is a process of discovering hidden patterns and information from large medical datasets; analyzes them and uses them for disease prediction [1]. The basic goal of data mining process is to extract hidden information from medical datasets and transform it into an understandable structure for future use [2]. A large number of predictive models can be developed from data mining techniques which enable classification and prediction tasks. After discovering knowledge from data, learning phase starts; where a scientific model is built. This learning method evolves the concept of machine learning and can be formally defined as “the complex computation process of automatic pattern recognition and intelligent decision making based on training sample data” [3]. The machine learning classifiers are further categorized into supervised learning and unsupervised learning depending on the availability of data. In supervised learning, labeled training data is available and a learning model is trained. Some examples include Artificial Neural Network (ANN), Support Vector Machine (SVM), and Decision Trees (DT). In unsupervised learning, there is no class label field in sample data. Examples include, K-mean clustering and Self-Organization Map (SOM). An ensemble approach performs better than individual machine learning techniques by combining the results of individual classifiers [4], [5]. There are multiple techniques that can be utilized for constructing the ensemble model and each result in different diagnosis accuracy. Most common ensemble approaches are bagging [6], boosting [7] and stacking [8].

Significant amount of work has already been done on disease classification and prediction. However, there is no single methodology which shows highest performance for all datasets or diseases, while one classifier shows good performance in a given dataset, another approach outperforms the others for other dataset or disease. The proposed research focuses on a novel combination of heterogeneous classifiers for disease classification and prediction, thus overcoming the limitations of individual classifiers. The novel combination of heterogeneous classifiers is presented which is Naïve Bayes, Linear Regression, Quadratic Discriminant Analysis, K-Nearest Neighbor, Support Vector Machine, Decision tree using Information Gain and Decision tree using the Gini Index. The multiple classifiers are used at multiple layers to further enhance disease prediction accuracy. An application has also been developed for disease prediction. It is based on the proposed HMV ensemble framework. The proposed application can help both doctors and patients in terms of data management and disease prediction.

The rest of the paper is organized as follows: Section 2 relates to literature review. Section 3 presents the proposed ensemble framework. Section 4 provides the results and discussion from the experiments carried out. Section 5 provides a case study about proposed ensemble model, whereas Section 6 is related to the discussion. Medical application for disease diagnosis is detailed in Section 7 and finally the conclusion is provided in Section 8.

Section snippets

Literature review

Extensive amount of work has already been done on disease classification and prediction. However, most of the literature has focused on using a single classifier for a specific disease.

Pattekari and Parveen [9] presented a heart disease prediction system based on a Naïve Bayes algorithm to predict the hidden patterns in a given dataset. The proposed technique limits the use of only categorical data and uses only single classifier. Other data mining techniques such as ensembles, time series,

HMV ensemble framework

The proposed ensemble framework consists of three modules, namely data acquisition and preprocessing, classifier training and HMV (Hierarchical Majority Voting) ensemble model for disease classification and prediction with three layered approach.

Datasets description

The experimental evaluation of HMV ensemble framework is performed on two heart disease datasets, two breast cancer datasets, two diabetes datasets, two liver disease datasets, one hepatitis dataset and one Parkinson's disease dataset. Each dataset contains diverse set of attributes that are ultimately used to determine the disease classification and prediction such healthy or sick. The two heart disease datasets (Cleveland heart disease, Statlog) are taken from UCI data repository [36]. One

Case study: real-time implementation of the proposed framework

The proposed DSS is evaluated on real time dataset of blood CP taken from Pakistan Institute of Medical Science (PIMS) hospital. PIMS is located in Islamabad, Pakistan. It is opening in 1985. P.I.M.S. hospital provides patient health care, medical facilities and as medical appointment hospital also to ways and training of doctors and other health workers in the field of medication and surgery. It consists of multiple departments such as Cardiology, Dental, Urology, Dermatology, Blood bank,

Discussion

We have used heterogeneous classifier ensemble model by combining entirely different type of classifiers and achieved a higher level of diversity. Decision Trees, QDA, LR, SVM and Bayesian classification are eager evaluation methods, whereas kNN is a lazy evaluation method. Combining lazy and eager classification algorithms (hybrid approach) overcomes the limitations of both eager and lazy methods. Eager method suffers from missing rule problem in case when there is no matching exists for given

Application for disease diagnosis

A medical application is also developed with the purpose of assisting the physician in diagnosing diseases. It is based on HMV framework and is divided into different modules in order to maintain simplicity and consistency.

There are three main users of the application: Admin staff, Doctor and Patient. Each user has its login id and password in order to interact with the system. There are four main modules of the proposed application:

1.
Data Acquisition and Preprocessing Module
2.
Classifier Training

Conclusion

Accuracy plays a vital role in the medical field as it concerns with the life of an individual. Data mining in the medical domain works on the past experiences and analyzes them to identify the general trends and probable solutions to the present situations. This research paper presents an ensemble framework using hierarchical majority voting and multi-layer classification for disease classification and prediction using data mining techniques. The proposed model overcomes the limitations of

Acknowledgements

This research uses real time data from PIMS (Pakistan Institute of Medical Sciences) hospital, Islamabad, Pakistan. We thank Doctor Lubna Naseem and PIMS hospital staff for the data collection and analysis files.

Saba Bashir is an Assistant Professor in Computer Science Department at Federal Urdu University of Arts, Science and Technology, Pakistan. She is also a Ph.D. research scholar at NUST, Pakistan. Her research interest lies in predictive systems, web services and object oriented computing. She has published more than 8 research papers in international conferences and journals.

References (67)

A. Ahmad et al.
Random ordinality ensembles: ensemble methods for multi-valued categorical data
Inf. Sci.
(2015)
B. Sluban et al.
Relating ensemble diversity and performance: a study in class noise detection
Neurocomputing
(2015)
M.J. Kim et al.
Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction
Expert Syst. Appl.
(2015)
S. Kang et al.
Multi-class classification via heterogeneous ensemble of one-class classifiers
Eng. Appl. Artif. Intell.
(2015)
R. Prashanth et al.
Automatic classification and prediction models for early Parkinson's disease diagnosis from SPECT imaging
Expert Syst. Appl.
(2014)
E.D. Übeyli
Implementing automated diagnostic systems for breast cancer detection
Expert Syst. Appl.
(2007)
F. Temurtas
A comparative study on thyroid disease diagnosis using neural networks
Expert Syst. Appl.
(2009)
D.C. Li et al.
A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets
Artif. Intell. Med.
(2011)
S.W. Lin et al.
Particle swarm optimization for parameter determination and feature selection of support vector machines
Expert Syst. Appl.
(2008)
P.J. García-Laencina et al.
K nearest neighbours with mutual information for simultaneous classification and missing data imputation
Neurocomputing
(2009)

M.A. King et al.

Ensemble learning methods for pay-per-click campaign management

Expert Syst. Appl.

(2015)

H. Parvin et al.

Proposing a classifier ensemble framework based on classifier selection and decision tree

Eng. Appl. Artif. Intell.

(2015)

J. Mendes-Moreira et al.

Improving the accuracy of long-term travel time prediction using heterogeneous ensembles

Neurocomputing

(2015)

S. Bose et al.

Generalized quadratic discriminant analysis

Pattern Recognit.

(2015)

D. Lin et al.

Double-bootstrapping source data selection for instance-based transfer learning

Pattern Recognit. Lett.

(2013)

S. Datta et al.

Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs

Neural Netw.

(2015)

J.H. Chen et al.

OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records

J. Am. Med. Inform. Assoc.

(2015)

S. Dua et al.

Data Mining and Machine Learning in Cyber Security

(2011)

F. Moretti et al.

Urban traffic flow forecasting through statistical and neural network bagging ensemble hybrid modeling

Neurocomputing

(2015)

S.A. Pattekari et al.

Prediction system for heart disease using Naïve Bayes

Int. J. Adv. Comput. Math. Sci.

(2012)

S. Ghumbre et al.

Heart disease diagnosis using support vector machine

F.M. Ba-Alw et al.

Comparative study for analysis the prognostic in hepatitis data: data mining approach

Int. J. Sci. Eng. Res.

(2013)

R. Zolfaghari

Diagnosis of diabetes in female population of PIMA Indian heritage with ensemble of BP neural network and SVM

IJCEM

(2012)

S. Sapna et al.

Data mining-fuzzy neural genetic algorithm in predicting diabetes

Res. J. Comput. Eng.

(2008)

An introduction to feature extraction

Combining SVMs with various feature selection strategies

D.R. Wilson et al.

Improved heterogeneous distance functions

J. Artif. Intell. Res.

(1997)

K. CHROMIŃSKI et al.

Comparison of outlier detection methods in biomedical data

J. Med. Inform. Technol.

(2010)

K. Mardia

Multivariate Analysis

(1979)

J.F. Díez-Pastor et al.

Random balance: ensembles of variable priors classifiers for imbalanced data

Knowl. Based Syst.

(2015)

L. Rokach

Ensemble-based classifiers

Artif. Intell. Rev.

(2010)

S. Whalen et al.

A comparative analysis of ensemble classifiers: case studies in genomics

Cited by (65)

Hypoglycaemia prediction using information fusion and classifiers consensus
2023, Engineering Applications of Artificial Intelligence
The recommendation that there must be a balance between insulin, food, and exercise to keep diabetes under control provides an opportunity for developing mobile applications for self-management of the disease. Real predictions can improve the quality of patients’ lives by avoiding unwanted events, namely, hypoglycaemia. We proposed a hypoglycaemia prediction approach combining information fusion and classifiers consensus to predict the risk of hypoglycaemia in a 24-h window. First, we train a multi-classifiers system from different sources of different patients. After using data from a unique patient, we performed the prediction of the risk of hypoglycaemia and evaluate the consensus decision of the single models resulting from the learning process. The predictions were performed for 54 patients from the University of California Irvine diabetes dataset. The results from classifiers consensus decision provide very promising results, which are acceptable considering that we used sparse data and data from self-monitoring blood glucose. Our approach shows that with a 24-h window is possible to catch appropriate patterns associated with the risk of hypoglycaemia and proposed a solution that can improve the hypoglycaemia prediction with a higher specificity, i.e. less false alarms, when compared with similar literature.
A dual-attention based coupling network for diabetes classification with heterogeneous data
2023, Journal of Biomedical Informatics
Diabetes Mellitus (DM) is a group of metabolic disorders characterized by hyperglycaemia in the absence of treatment. Classification of DM is essential as it corresponds to the respective diagnosis and treatment. In this paper, we propose a new coupling network with hierarchical dual-attention that utilizes heterogeneous data, including Flash Glucose Monitoring (FGM) data and biomarkers in electronic medical records. The long short-term memory-based FGM sub-network extracts the time-dependent features of dynamic FGM sequences, while the biomarkers sub-network learns the features of static biomarkers. The convolutional block attention module (CBAM) for dispersing the feature weights of the spatial and channel dimensions is implemented into the FGM sub-network to endure the variability of FGM and allows us to extract high-level discriminative features more accurately. To better adjust the importance weights of the characteristics of the two sub-networks, self-attention is introduced to integrate the characteristics of heterogeneous data. Based on the dataset provided by Peking University People’s Hospital, the proposed method is evaluated through factorial experiments of multi-source heterogeneous data, ablation studies of various attention strategies, time consumption evaluation and quantitative evaluation. The benchmark tests reveal the proposed network achieves a type 1 and 2 diabetes classification accuracy of 95.835% and the comprehensive performance metrics, including Matthews correlation coefficient, F1-score and G-mean, are 91.333%, 94.939% and 94.937% respectively. In the factorial experiments, the proposed method reaches the maximum area under the receiver operating characteristic curve of 0.9428, which indicates the effectiveness of the coupling between the nominated sub-networks. The coupling network with a dual-attention strategy performs better than the one without or only with a single-attention strategy in the ablation study as well. In addition, the model is also tested on another data set, and the accuracy of the test reaches 94.286%, reflecting that the model is robust when it is transferred to untrained diabetes data. The experimental results show that the proposed method is feasible in the classification of diabetes types. The code is available at https://github.com/bitDalei/Diabetes-Classification-with-Heterogeneous-Data.
Computer aided diagnostic system based on SVM and K harmonic mean based attribute weighting method
2020, Obesity Medicine
Citation Excerpt :
It is noticed that many computer-aided diagnostic tools are presented in literature to help the physicians. Large number of machine learning techniques is incorporated in diagnostic tools to improve the prediction rate (Nilashi et al., 2017; Kavakiotis et al., 2017; Shickel et al., 2017; Nahar et al., 2013; Bashir et al., 2016). It is seen that all features are not equally important for decision making process.
Machine learning techniques are popular tool adopted for medical diagnosis and one of the core component of medical diagnostic system. The objective of machine learning techniques is to provide accurate and timely diagnostic results during disease diagnosis phases. Further, it also helps the physicians and medical practitioner regarding disease diagnosis. The objective of this work is to improve the diagnostic accuracy of computer aided diagnostic system.
Large number of machine learning techniques are integrated in the computer aided diagnostic system for the prediction of the diseases. These machine learning techniques consider different features of disease to diagnosis the disease. It is seen that all features are not equally important in diagnostic process and irrelevant features can lead to low prediction rate. Hence in medical field, identification of irrelevant features is warm area of research. To identify the relevant features for disease prediction, attribute weighting methods are adopted. It is observed relevant features can improve the diagnostic accuracy of computer aided systems. Hence, to improve the diagnostic accuracy rate, a k harmonic mean based attribute weighting method is developed, called KhmAW. Further, the proposed KhmAW method is integrated with SVM method, called KhmAW-SVM. In KhmAW-SVM, KhmAW method is used to identify the relevant features from dataset and SVM method is applied for diagnosis the disease. The proposed method classifies the datasets into healthy and non-healthy classes.
Four datasets are used to validate the proposed KhmAW-SVM based computer aided diagnostic system. These datasets are Statlog heart disease, Parkinson disease, Liver disease and Pima Indian diabetes disease datasets and having non linearly separable data distribution. The simulation results of proposed KhmAW-SVM method are evaluated using accuracy rate. Further, the simulation results are assessed using 50-50 training-testing and 10 fold methods. It is stated that proposed KhmAW-SVM method achieves 94.28%, 99%, 89.93% and 92.38% accuracy rates for heart disease, Parkinson's disease, liver disease and diabetes disease respectively.
The efficacy of the proposed method is evaluated using four well known diseases datasets and compared with large number of existing studies. It is stated that proposed KhmAW-SVM based computer aided diagnostic system achieves better quality results as compared to existing studies. Hence, it is concluded that proposed computer aided diagnostic system can improve the clinical decision making process and also help the physician and doctors regarding different diseases.
A mixed solution-based high agreement filtering method for class noise detection in binary classification
2020, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
DT is usually capable of tolerating low quantities of noisy data. Moreover, KNN is an expensive method in terms of computation as it requires a lot of storage space [9,31]. NB is usually considered as a more robust algorithm to noisy samples than RF [32].
Classification of noisy data has been a longstanding topic in data mining and machine learning. Many scientists have proposed effective methods to detect and eliminate such data in diverse real-world datasets. In this paper, we deal with mislabeled instances in supervised learning, including majority voting filtering and consensus voting filtering. The majority voting procedure usually incorrectly identifies many correct instances as noisy, whereas the consensus voting procedure is not able to detect at all many noisy instances. Our new method minimizes the majority and consensus filtering weaknesses by providing a novel class noise detection strategy, namely a high agreement voting filtering with mixed strategy, which proceeds by removing strong and semi-strong noisy records from the dataset as well as by relabeling weak noisy data. The proposed method, designed for binary classification problems, outperforms the high agreement voting filtering procedure. Extensive experiments conducted with 16 real datasets, using four noise filtering methods with two levels of class noise (10% and 15%), prove the superiority of the proposed methodology.
Enhancing Network Intrusion Detection Using an Ensemble Voting Classifier for Internet of Things
2024, Sensors
Clustering algorithms for analysing electronic medical record: A mapping study
2023, IAES International Journal of Artificial Intelligence

View all citing articles on Scopus

Usman Qamar is the head of Knowledge and Data Engineering Research Centre (www.kdegroup.wordpress.com) at Department of Computer Engineering, College of Electrical and Mechanical Engineering, NUST, Pakistan. He has done his MS in Computer Systems from UMIST, UK whereas his M.Phil., Ph.D. and Post-Doc are from University of Manchester, UK in Data Engineering. His expertise are in Data and Text Mining, Expert Systems, Knowledge Discovery and Feature Selection.

Farhan Hassan Khan has been working as a Project Manager in a software development organization in Pakistan since 2005. He is also a Ph.D. research scholar at NUST, Pakistan. His research interest lies in text mining, web service computing and VoIP billing products. He has published many research papers in international conferences and journals.

View full text

HMV: A medical decision support framework using multi-layer classifiers for disease prediction

Highlights

Abstract

Introduction

Section snippets

Literature review

HMV ensemble framework

Datasets description

Case study: real-time implementation of the proposed framework

Discussion

Application for disease diagnosis

Conclusion

Acknowledgements

Inf. Sci.

Neurocomputing

Expert Syst. Appl.

Eng. Appl. Artif. Intell.

Expert Syst. Appl.

Expert Syst. Appl.

Expert Syst. Appl.

Artif. Intell. Med.

Expert Syst. Appl.

Neurocomputing

Expert Syst. Appl.

Eng. Appl. Artif. Intell.

Neurocomputing

Pattern Recognit.

Pattern Recognit. Lett.

Neural Netw.

OrderRex: clinical order decision support and outcome predictions by data-mining electronic medical records

J. Am. Med. Inform. Assoc.

Data Mining and Machine Learning in Cyber Security

Urban traffic flow forecasting through statistical and neural network bagging ensemble hybrid modeling

Neurocomputing

Prediction system for heart disease using Naïve Bayes

Int. J. Adv. Comput. Math. Sci.

Heart disease diagnosis using support vector machine

Comparative study for analysis the prognostic in hepatitis data: data mining approach

Int. J. Sci. Eng. Res.

Diagnosis of diabetes in female population of PIMA Indian heritage with ensemble of BP neural network and SVM

IJCEM

Data mining-fuzzy neural genetic algorithm in predicting diabetes

Res. J. Comput. Eng.

An introduction to feature extraction

Combining SVMs with various feature selection strategies

Improved heterogeneous distance functions

J. Artif. Intell. Res.

Comparison of outlier detection methods in biomedical data

J. Med. Inform. Technol.

Multivariate Analysis

Random balance: ensembles of variable priors classifiers for imbalanced data

Knowl. Based Syst.

Ensemble-based classifiers

Artif. Intell. Rev.

A comparative analysis of ensemble classifiers: case studies in genomics