Elsevier

Expert Systems with Applications

Volume 67, January 2017, Pages 239-251
Expert Systems with Applications

Performance analysis of classification algorithms on early detection of liver disease

https://doi.org/10.1016/j.eswa.2016.08.065Get rights and content

Highlights

  • In this research UCI Indian Liver Patient Dataset (ILPD) used.

  • Boosted C5.0 and CHAID algorithms are used to identify liver disease risk factors.

  • This research shows females have more chance of liver disease than males.

  • Common risk factors of liver disease were extracted by data mining.

  • This research produced quite simple rules.

Abstract

The human liver is one of the major organs in the body and liver disease can cause many problems in human life. Fast and accurate prediction of liver disease allows early and effective treatments. In this regard, various data mining techniques help in better prediction of this disease. Because of the importance of liver disease and increase the number of people who suffer from this disease, we studied on liver disease through using two well-known methods in data mining area.

In this paper, novel decision tree based algorithms is used which leads to considering more factors in general and predictions with high accuracy compared to other studies in liver disease. In this application, 583 UCI instances of liver disease dataset from the UCI repository are considered. This dataset consists of 416 records of liver disease and 167 records of healthy liver. This dataset is analyzed by two algorithms named Boosted C5.0 and CHAID algorithms. Until now there is no work in the literature that uses boosted C5.0 and CHAID for creating the rules in liver disease. Our results show that in both algorithms, the DB, ALB, SGPT, TB and A/G factors have a significant impact on predicting liver disease which according to the rules generated by both algorithms important ranges are DB = [10.900–1.200], ALB [4.00–4.300], SGPT = [34–37], TB = [0.600–1.200] (by boosted C5.0), A/G = [1.180–1.390], as well as in the Boosted C5.0 algorithm, Alkphos, SGOT and Age have significant impact in prediction of liver disease. By comparing the performance of these algorithms, it becomes clear that C5.0 algorithm via Boosting technique has an accuracy of 93.75% and this result reveals that it has a better performance than the CHAID algorithm which is 65.00%. Another important achievement of this paper is about the ability of both algorithms to produce rules in one class for liver disease. The results of our assessment show that Boosted C5.0 and CHAID algorithms are capable to produce rules for liver disease. Our results also show that boosted C5.0 considers the gender in liver disease, a factor which is missing in many other studies. Meanwhile, using the rules generated in boosted C5.0 algorithm, we obtained the important result about low susceptibility of female to liver disease than male. This factor is missing in other studies of liver disease. Therefore, our proposed computer-aided diagnostic methods as an expert and intelligent system have impressive impact on liver disease detection. Based on obtained results, we observed that our model had better performance compared to existing methods in the literature.

Introduction

In recent years, we have faced with an increasing number of data stored in various organizations such as banks, hospitals, universities and etc. that encourages us to find a way to extract knowledge from this large amount of data and to efficiently use them. Data mining is defined as a method to discover and extract knowledge from large volumes of data that is useful, practical and understandable (Han, Kamber, & Pei, 2011). It is also defined as a semi-automated way to find hidden patterns among data (Han & Kamber, 2001). One of the most important uses of data mining is the extraction of knowledge from data more accurately in a less time, less cost and possibly to have comprehensive and more complete results. This knowledge is used in various fields such as medical application, web mining, security, prevention of crime and many other fields (Witten & Frank, 2005). Medical science is one of the important areas where data mining is used. Since this branch of science deals with human life, it is highly sensitivities. In recent years, a lot of researches have been done on a variety of diseases using data mining. Looking more closely at the research done in recent years in this field, specifically, in the medical field, we can see many works that use data mining for forecasting, prevention and treatment of patients (Das, 2010, Riganello et al., 2010; Gauthier, Alemayehu & Berger, 2016; Kasabov and Capecci, 2015, Marateb et al., 2014, Nahar et al., 2013, Patidar et al., 2015, Souillard-Mandar et al., 2015, Tanha et al., 2015 Rodríguez‐Jiménez et al., 2016, Tomczak and Zięba, 2015). In medical science, accuracy and speed are two important factors that should be considered chiefly in dealing with any disease. In this regard, data mining techniques can be of great help to physicians.

The organization of this paper is as follows. In Section 2, some background on data mining, liver disease, classification algorithms, and related works are provided. Section 3 describes our method in the implementation of boosted C5.0 and CHAID classification algorithms for the early detection of liver disease. Finally, we conclude our paper in Section 4 with some discussion and suggestion for future works.

Section snippets

Data mining

With advances in science, several machines have entered in our lives. One of the most famous areas where computers as the mostly used machines can be helpful is knowledge extraction with the help of a machine (machine learning). This approach that can be of great help to all scientific fields is called data mining or Knowledge Discovery of the Databases (KDD). Supervised and unsupervised learning are two main methods for machine learning (Han et al., 2011). The purpose of these methods is to

The implementation of classification algorithms

In this paper, we have used Boosted C5.0 and CHAID algorithms that are relevant to the Decision Trees in order to discover hidden knowledge in the liver disease dataset in UCI repository. In this regard, we benefited from IBM SPSS Modeler 14.2 software (Firat University license) and evaluated the algorithms. For our purpose, the data are divided into two groups: training and testing. For more clarity, all stages of this research is presented in Fig. 3:

In this regard, the implementation steps

Conclusion and discussion

According to the statistics published by the relevant agencies, liver disease is among the most fatal disease which puts human life at risk. Decision trees are one of the most important and most well-known algorithms in data mining algorithms and therefore in this paper we used two algorithms named C5.0 and CHAID which are based on decision trees. One of the important features about C5.0 algorithm is the possibility to apply boosting techniques in it. Boosting techniques in C5.0 algorithm leads

References (73)

  • C.M. Hunt et al.

    Age-related differences in reporting of drug-associated liver injury: Data-mining of WHO safety report database

    Regulatory Toxicology and Pharmacology

    (2014)
  • N. Kasabov et al.

    Spiking neural network methodology for modelling, classification and understanding of EEG spatio-temporal data measuring cognitive processes

    Information Sciences

    (2015)
  • LinR.H.

    An intelligent model for liver disease diagnosis

    Artificial Intelligence in Medicine

    (2009)
  • H.R. Marateb et al.

    A hybrid intelligent system for diagnosing microalbuminuria in type 2 diabetes patients without having to measure urinary albumin

    Computers in Biology and Medicine

    (2014)
  • X.H. Meng et al.

    Comparison of three data mining models for predicting diabetes or prediabetes by risk factors

    The Kaohsiung Journal of Medical Sciences

    (2013)
  • J. Nahar et al.

    Association rule mining to detect factors which contribute to heart disease in males and females

    Expert Systems with Applications

    (2013)
  • S.L. Pang et al.

    C5. 0 classification algorithm and application on individual credit evaluation of banks

    Systems Engineering-Theory & Practice

    (2009)
  • S. Patidar et al.

    Automated diagnosis of coronary artery disease using tunable-Q wavelet transform applied on heart rate signals

    Knowledge-Based Systems

    (2015)
  • F. Riganello et al.

    Heart rate variability: An index of brain processing in vegetative state? An artificial intelligence, data mining study

    Clinical Neurophysiology

    (2010)
  • WengC.H. et al.

    Disease prediction with different types of neural network classifiers

    Telematics and Informatics

    (2016)
  • M. Abdar

    A survey and compare the performance of IBM SPSS modeler and rapid miner software for predicting liver disease by using various data mining algorithms

    Cumhuriyet Science Journal

    (2015)
  • D. Alemayehu et al.

    Big data: Transforming drug development and health policy decision making

    Health services and outcomes research methodology

    (2016)
  • S.D.N.N. Alfisahrin et al.

    Data mining techniques for optimization of liver disease classification

  • S. Alizadeh et al.

    Data mining and knowledge discovery

    (2011)
  • O.F. Althuwaynee et al.

    A novel integrated model for assessing landslide susceptibility mapping using CHAID and AHP pair-wise comparison

    International Journal of Remote Sensing

    (2016)
  • P.R. Anisha et al.

    A pragmatic approach for detecting liver cancer using image processing and data mining techniques

  • S. Bahramirad et al.

    Classification of liver disease diagnosis: A comparative study

  • Bittel, S., Kaiser, V., Teichmann, M., & Thoma, M. (2015). Pixel-wise segmentation of street with neural networks....
  • M.B. Boubekeur et al.

    A background subtraction algorithm for indoor monitoring surveillance systems

  • M. Chambers et al.

    Advanced analytics methodologies: Driving business value with analytics

    (2014)
  • S. Choudhury et al.

    Comparative analysis of machine learning algorithms along with classifiers for network intrusion detection

  • B.L. Deekshatulu et al.

    Classification of heart disease using K-nearest neighbor and genetic algorithm

    Procedia Technology

    (2013)
  • V. Figueiredo et al.

    An electric energy consumer characterization framework based on data mining techniques

    Power Systems, IEEE Transactions on

    (2005)
  • J. Fürnkranz

    Round robin classification

    The Journal of Machine Learning Research

    (2002)
  • J. Fürnkranz

    Pairwise classification as an ensemble technique

    Machine learning: ECML 2002

    (2002)
  • F. Gorunescu

    Data mining: Concepts, models and techniques

    (2011)
  • Cited by (122)

    View all citing articles on Scopus
    View full text