Elsevier

Cognitive Systems Research

Volume 54, May 2019, Pages 116-127
Cognitive Systems Research

A novel machine learning approach for early detection of hepatocellular carcinoma patients

https://doi.org/10.1016/j.cogsys.2018.12.001Get rights and content

Highlights

  • A novel method to detect hepatocellular carcinoma (HCC) is presented.

  • Performance of ten machine learning methods are compared.

  • Genetic algorithm (GA) is used to select the features.

  • New 2level Genetic optimizer using stratified 5-fold cross-validation coupled with GA is employed.

  • Obtained classification accuracy of 88.49%.

Abstract

Liver cancer is quite common type of cancer among individuals worldwide. Hepatocellular carcinoma (HCC) is the malignancy of liver cancer. It has high impact on individual’s life and investigating it early can decline the number of annual deaths. This study proposes a new machine learning approach to detect HCC using 165 patients. Ten well-known machine learning algorithms are employed. In the preprocessing step, the normalization approach is used. The genetic algorithm coupled with stratified 5-fold cross-validation method is applied twice, first for parameter optimization and then for feature selection. In this work, support vector machine (SVM) (type C-SVC) with new 2level genetic optimizer (genetic training) and feature selection yielded the highest accuracy and F1-Score of 0.8849 and 0.8762 respectively. Our proposed model can be used to test the performance with huge database and aid the clinicians.

Introduction

The report published by the World Health Organization (WHO) in 2012 indicated approximately 14.1 million new cancer cases and about 8.2 million deaths worldwide (World Health Organization, 2012). Hepatocellular carcinoma (HCC) is the malignancy of liver. Generally, it occurs in patients having chronic liver disease and/or cirrhosis. According to the latest evidence, HCC is one of the deadliest cancers around the world causing over 600,000 deaths annually (Cabibbo et al., 2009, DeWaal et al., 2018, Singh et al., 2014). According to Santos, Abreu, García-Laencina, Simão, and Carvalho (2015) liver cancer is the sixth most common and mostly diagnosed cancer worldwide. This evidence reveals that HCC has high impact on human’s life and it is essential to reduce the number of annual deaths due to HCC. Therefore, in medical and healthcare area automated systems can assist the clinicians in making accurate and timely diagnosis of various diseases of their patients (e.g., accurate clinical decision support system (CDSS)) which need to be developed based on patients’ past records.

Nowadays, CDSSs use different set of data mining and machine learning methods to improve the quality of medical decisions and reduce the diagnostic errors. Based on clinical evidence, to reduce and/or prevent diagnostic errors computer-based systems can be used (Shimizu, Nemoto, & Tokuda, 2018). For instance, Institute of Medicine (IOM) showed that, the diagnostic error rates in the United States of America (USA) were very high and unacceptable (Ball et al., 2015). Indeed, an accurate diagnosis is the key aspect of clinical decisions. In order to overcome any error during the clinical diagnosis, a wide variety of CDSSs can be developed using various data mining and machine learning algorithms.

The knowledge discovery in large databases (KDD) is a well-known field in computer science and engineering which includes several sub-sections such as theories, methods and techniques (García-Díaz et al., 2018, Trajanov et al., 2018). By using KDD, useful and efficient information and knowledge can be extracted from these datasets. The steps involved in KDD are as follows: (1) selection, (2) preprocess, (3) transformation, (4) data mining, and finally (5) interpretation-evaluation. Data mining is one of the major steps in the entire KDD process which tries to analyze different datasets using machine learning methods. Machine learning (Pławiak and Maziarz, 2014, Pławiak and Rzecki, 2015, Pławiak, 2014, Rzecki et al., 2017, Rzecki et al., 2018, Yildirim et al., 2018, Yıldırım, 2018) is a computer-based field that deals with the methods in which various machines can learn from experience. Russell and Norvig (2002) expressed that machine learning algorithms can be categorized into three main categories: (1) supervised learning, (2) unsupervised learning, and (3) reinforcement learning. Nowadays, the popularity of data mining and machine learning techniques are increasing significantly and being widely applied in various medical issues namely cardiac disorders (Alkeshuosh et al., 2017, Pławiak, 2018), Parkinson disease (Abdar and Zomorodi-Moghadam, 2017, Das, 2010), lung cancer (Lynch et al., 2017), breast cancer (Zheng, Yoon, & Lam, 2014), Alzheimer's disease (Shi et al., 2018, Zhang et al., 2018), etc.

Over the past years, many studies have been conducted on HCC and liver diseases. Santos et al. (2015) studied HCC using a new cluster-based oversampling algorithm. Both heterogeneous and missing data (HEOM) and clustering techniques (K-means) were applied for preprocessing. The synthetic minority over-sampling technique (SMOTE) method was applied to balance dataset. Finally, the balanced data was used for logistic regression and neural networks algorithms. The results indicated that the proposed approach can detect the HCC effectively. Hassoon, Kouhi, Zomorodi-Moghadam, and Abdar (2017) provided a new approach to optimize the generated rules by boosted C5.0 using genetic algorithm for timely detection of liver disease. The genetic algorithm tries to remove unnecessary rules instead of producing rules. The proposed approach increased the accuracy of C5.0 from 81% to 93%. In Abdar, Yen, and Hung (2017), a new approach to diagnose the liver disease was introduced. In the paper, base C5.0, Classification and Regression Trees (CART), Chi-square Automatic Interaction Detector (CHAID) algorithms and also boosted C5.0, CART, and CHAID methods were applied. To improve the performance of these methods, a multilayer perceptron neural network (MLPNN) algorithm was combined with boosted-based decision trees. The MLPNN combined with boosted-based decision trees were efficient in early detection of liver disease. The highest classification accuracy of 94.12% was obtained using MLPNNB-C5.0 method.

Liu, Soong, Lee, Chen, and Hsu (2018) expressed the use of orthotopic liver transplantation (OLT to treat the liver disease as an end-stage treatment solution. Even though OLT is a final solution, but however, acute allograft rejection is a significant problem for these patients. Some predictive models and expert systems can help the physicians and medical staffs to have a timely corrective action to prevent the harmful effects on patients. Abdar, Zomorodi-Moghadam, Das, and Ting (2017) concentrated on liver disease detection using two well-known decision trees. For this purpose, C5.0 and CHAID algorithms were applied on the Indian liver patient dataset (ILPD). They optimized the performance of C5.0 using boosting technique. In addition, several simple and understandable rules were presented in both C5.0 and CHAID methods. The boosted C5.0 was able to detect with an accuracy of 93.75%. Authors in Abajian et al. (2018) studied HCC in 36 patients who were treated with transarterial chemoembolization using machine learning approach. Their method obtained an overall accuracy of 78% using both linear regression (LR) and random forest (RF) classifiers.

Non-alcoholic fatty liver disease (NAFLD) is another liver disease where the machine learning is applied for accurate and timely detection (Perveen, Shahbaz, Keshavjee, & Guergachi, 2018). The proposed method showed the classification accuracy of 76% using J48 method. Fan, Chang, Lin, and Hsieh (2011) developed a new hybrid model using a case-based data clustering model and fuzzy decision tree method for medical data classification. The liver disorder and also breast cancer Wisconsin datasets were used. Their proposed model obtained the accuracies of 98.4% and 81.6% for breast cancer and liver disease datasets respectively. Furthermore, Peng, Liu, Yan, Ren, and Xu (2017) applied regularized logistic regression (RLR) and support vector machine (SVM) methods for liver X receptorβ (LXRβ) detection. The average accuracy of 84.76% was reported using RLR classification method for detection of LXRβ.

Moreover, Gui, Dong, Li, Li, and Wang (2015) concentrated on the HCC gene expression using feature selection and machine learning techniques. Their results showed that MT1X, BMI1, and CAP2 genes had close relationship with HCC and showed that TACSTD2 has an impact on HCC. Study by Wang et al. (2018) proposed two new methods fixed sequential (FS) and two-step (TS) to detect the HCC accurately. To compare the proposed models, random forest (RF), logistic regression (LR) and classification and regression trees (CART) were used. Their work presented that, RF and TS models outperformed other methods. Sreeja and Hariharan (2018) introduced a new computer aided diagnostic (CAD) system for early detection of HCC from abdominal CT images. They reported the best accuracy of 95% using Naïve Bayes model.

Wasyluk, Cianciara, Bobrowski, and Drapato (2010) presented a regression method to recognize the liver disorders. The research was conducted on 200 cirrhotic patient records. In this study, each clinical trial composed of various types of variables, containing both histopathological data and laboratory tests. Since, the number of HCC records were not high (5% of the cirrhotic patients), hence the obtained outcomes were not good enough to accredit the proposed system. In the works by Chiu et al., 2013, Ho et al., 2012 artificial neural network (ANN) performed better than logistic regression (LR) method, and finally a decision tree (DT) method in detecting the liver disease. In order to predict in-hospital mortality after preliminary liver cancer surgical operation, the performance of ANN and LR were compared in Shi et al. (2012) to diagnose the HCC. In their work ANN performed better than the LR.

A clinical and medical diagnostic process tries to find the correlation between known patterns hidden among diseases/cancer records belonging to different classes of medical data extracted from physical examination, past records, and also clinical tests (Kakiashvili et al., 2008). These detection processes help to make accurate diagnostic decisions. Hence, the goal of this study is to design a machine-based diagnostic approach using machine learning and data mining techniques using HCC dataset. The first motivation is the early diagnosis of HCC patients. This motivation plays a valuable role in current research since many patients suffer from HCC around the world. Hence, we are motivated to design a new and efficient CDSS to help those individuals. The second motivation is the power of machine learning methods to detect different diseases. Since in the clinical research, high performance rate is essential, a machine learning-based diagnostic model is used. The third motivation is the optimization of classical machine learning algorithms using other machine learning approaches. This motivation encouraged us to improve the performance of the model to be better than the previous models.

There are three major contributions in this study. The first contribution is the design of data preprocessing and testing of classical machine learning algorithms. The data preparation is very important and plays a valuable role in machine learning. It helps to have an appropriate dataset before using any learning algorithm. To deal with this challenge, we apply normalization approach on the HCC dataset. This research uses 10 common machine learning methods. They are linear SVM (LinSVM), SVM (C-SVC called SVC), SVM (called nu-SVC), multilayer perceptron (MLP1 with 1 hidden layer, max. 200 neurons in the hidden layer), K-nearest neighbor (KNN), Naive Bayes classifier (gaussNB), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), logistic regression (Reglog), and random forest (RF with 1000 trees). The second contribution is two level genetic algorithm for two purposes: (i) parameters optimization and (ii) feature optimization. The performance is evaluated using: (1) accuracy and (2) F1-Score. The third contribution is the design of a new 2level genetic optimizer method based on the cross-validation coupled with genetic algorithm for optimization of features and parameters. Thus, the paper introduces a new algorithm which combines new genetic training method (here 2level Genetic optimizer based on using cross-validation and genetic algorithm), an optimization approach (genetic algorithm) and preprocessing technique (normalization approach) with classical machine learning algorithms to predict the HCC disease with highest performance. To the best our knowledge, we apply the proposed solution (2level genetic optimizer) to detect the HCC for first time which has never been used before.

Section snippets

Material and methods

This section discusses about the material and methods used for this work. The data used and 10 well-known machine learning methods are described in this section.

Proposed methodology

This work provides an optimization approach for different machine learning techniques using HCC database. As shown in Fig. 1, the work is performed in 7 steps. After getting the dataset, preprocessing is applied on the dataset. For this purpose, we performed the following three steps:

  • a.

    complete missing categorical attributes with the modal value of a given attribute

  • b.

    complete missing numeric attributes with the average value of a given attribute

  • c.

    numerical and categorical attributes are normalized by

Results

The Python language along with Pandas, Deap, and Sklearn librarys were used in this work. We used an Intel Core i5-7300HQ 3.5 GHz machine with 16 GB of RAM and single core. We have used 10 algorithms mentioned above without feature selection in the first experiment and with feature selection in the second experiment. It may be noted that the MLP neural network and RF are the only non-deterministic methods used in this work and other methods are deterministic. Hence, the training for this method

Discussion

As discussed in previous sub-sections, our proposed methodology showed the highest performance for the diagnosis of HCC. The improvement in the accuracy and F1-Score for SVC are illustrated in Fig. 2 for the best classifier.

It can be seen from Fig. 2 that, our proposed algorithm (with feature selection) has improved the performance of SVC. As indicated, the proposed methodology can improve testing sets and prevents over-fitting effect in training sets. This means that the models are too fit to

Conclusion

This study developed an algorithm to detect the HCC automatically which is most prevalent type of primitive liver cancer among adults. In this, we have used 10 machine learning techniques with/without feature selection on the HCC dataset. In our new algorithm, normalization technique was applied during preprocessing. The genetic algorithm (GA) coupled with 5-fold cross-validation method was performed two times. In the first time, only parameters optimization was carried out. The GA was used in

References (74)

  • P. Pławiak

    Novel methodology of cardiac health recognition based on ECG signals and evolutionary-neural system

    Expert Systems with Applications

    (2018)
  • P. Pławiak et al.

    Classification of tea specimens using novel hybrid artificial intelligence methods

    Elsevier Sensors and Actuators B: Chemical

    (2014)
  • K. Rzecki et al.

    Person recognition based on touch screen gestures using computational intelligence methods

    Elsevier Information Science

    (2017)
  • M.S. Santos et al.

    A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients

    Journal of Biomedical Informatics

    (2015)
  • T. Shimizu et al.

    Effectiveness of a clinical knowledge support system for reducing diagnostic errors in outpatient care in Japan: A retrospective study

    International Journal of Medical Informatics

    (2018)
  • S. Singh et al.

    Chemopreventive strategies in hepatocellular carcinoma

    Nature Reviews Gastroenterology and Hepatology

    (2014)
  • M. Sokolova et al.

    A systematic analysis of performance measures for classification tasks

    Information Processing & Management

    (2009)
  • C.F. Tsai et al.

    A class center based approach for missing value imputation

    Knowledge-Based Systems

    (2018)
  • Ö. Yıldırım

    A novel wavelet sequence based on deep bidirectional LSTM network model for ECG signal classification

    Computers in Biology and Medicine

    (2018)
  • Ö. Yıldırım et al.

    Arrhythmia detection using deep convolutional neural network with long duration ECG signals

    Computers in Biology and Medicine

    (2018)
  • O. Yildirim et al.

    An efficient compression of ECG signals using deep convolutional autoencoders

    Cognitive Systems Research

    (2018)
  • B. Zheng et al.

    Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms

    Expert Systems with Applications

    (2014)
  • X. Zhi et al.

    Efficient discriminative clustering via QR decomposition-based linear discriminant analysis

    Knowledge-Based Systems

    (2018)
  • A. Abajian et al.

    Predicting treatment response to intra-arterial therapies for hepatocellular carcinoma with the use of supervised machine learning—An artificial intelligence concept

    Journal of Vascular and Interventional Radiology

    (2018)
  • M. Abdar et al.

    Improving the diagnosis of liver disease using multilayer perceptron neural network and boosted decision trees

    Journal of Medical and Biological Engineering

    (2017)
  • M. Abdar et al.

    Impact of patients’ gender on parkinson’s disease using classification algorithms

    Journal of AI and Data Mining

    (2017)
  • M. Abdar et al.

    A new nested ensemble technique for automated diagnosis of breast cancer

    Pattern Recognition Letters

    (2018)
  • A. Aggarwal et al.

    Grid search analysis of nu-SVC for text-dependent speaker-identification

  • A.H. Alkeshuosh et al.

    Using PSO algorithm for producing best rules in diagnosis of heart disease

  • A. Askarzadeh

    A memory-based genetic algorithm for optimization of power generation in a microgrid

    IEEE Transactions on Sustainable Energy

    (2018)
  • J. Ball et al.

    Improving diagnosis in health care

    (2015)
  • G. Cabibbo et al.

    Multimodal approaches to the treatment of hepatocellular carcinoma

    Nature Reviews Gastroenterology and Hepatology

    (2009)
  • R. Chai et al.

    Improving eeg-based driver fatigue classification using sparse-deep belief networks

    Frontiers in Neuroscience

    (2017)
  • C.C. Chang et al.

    LIBSVM: A library for support vector machines

    ACM Transactions on Intelligent Systems and Technology (TIST)

    (2011)
  • H.C. Chiu et al.

    Mortality predicted accuracy for hepatocellular carcinoma patients with hepatic resection using artificial neural network

    The Scientific World Journal

    (2013)
  • D.R. Cox

    The regression analysis of binary sequences

    Journal of the Royal Statistical Society. Series B Methodological

    (1958)
  • D. DeWaal et al.

    Hexokinase-2 depletion inhibits glycolysis and induces oxidative phosphorylation in hepatocellular carcinoma and sensitizes to metformin

    Nature Communications

    (2018)
  • Cited by (0)

    View full text