Elsevier

Neural Networks

Volume 118, October 2019, Pages 17-31
Neural Networks

Cascade interpolation learning with double subspaces and confidence disturbance for imbalanced problems

https://doi.org/10.1016/j.neunet.2019.06.003Get rights and content

Abstract

In this paper, a new ensemble framework named Cascade Interpolation Learning with Double subspaces and Confidence disturbance (CILDC) is designed for the imbalanced classification problems. Developed from the Cascade Forest of the Deep Forest which is the stacking based tree ensembles for big data issues with less hyper-parameters, CILDC aims to generalize the cascade model for more base classifiers. Specifically, CILDC integrates base classifiers through the double subspaces strategy and the random under-sampling preprocessing. Further, one simple but effective confidence disturbance technique is introduced to CILDC to tune the threshold deviation for imbalanced samples. In detail, the disturbance coefficients are multiplied to various confidence vectors before interpolating in each level of CILDC, and the ideal threshold can be adaptively learned through the cascade structure. Furthermore, both the Random Forest and the Naive Bayes are suitable to be the base classifier for CILDC. Subsequently, comprehensive comparison experiments on typical imbalanced datasets demonstrate both the effectiveness and generalization of CILDC.

Introduction

Imbalanced problems have attracted wide attentions in the last decade. For the given imbalanced data, the samples of minority class are obviously less than samples of other majority ones. Generally, the accuracy of minority class is more important in many real world problems (Krawczyk, 2016). However, imbalanced problems have led to serious challenges to traditional classification methods, because they are usually misguided to be partial to majority classes (Estabrooks, Jo, & Japkowicz, 2004). To address this issue, related methods can be categorized into two groups based on data level and algorithm level (Chawla, Japkowicz, & Kotcz, 2004). The former includes under-sampling (Donoho & Tanner, 2010), over-sampling (Zhu, Lin, & Liu, 2017) and hybrid strategies (Ramentol, Caballero, Bello, & Herrera, 2012), which can be called re-sampling collectively. While the latter contains ensemble learning (Galar et al., 2012, Yu et al.), cost-sensitive learning (Zhou, 2011) and decision threshold adjusting (Gao et al., 2014, Tian et al., 2015).

It is noteworthy that the approach named Data Processing based Ensemble (DPE) is the combined case of methods from both two levels, and it is able to deal with imbalanced and noise problems effectively (Galar et al., 2012, Sun et al., 2015, Yu et al.). In general, base classifiers are usually trained with samples processed by various re-sampling methods in DPE, such as over-sampling based (Chawla et al., 2003, Wang and Yao, 2009), under-sampling based (Seiffert, Khoshgoftaar, Van Hulse, & Napolitano, 2010) and hyper re-sampling based (Lu, Cheung, & Tang, 2016) ensemble approaches. According to the ensemble strategy adopted in DPE, it can be further classified into three groups: bagging-based ones, boosting-based ones, and hybrid ones (Galar et al., 2012). The ensemble strategy can make up for the information loss after the re-sampling partly. On the other hand, Random Subspace Method (RSM) (Ho, 1998) was proposed to avoid overfitting some predictive features and ignoring other underestimated ones. Therefore, RSM extends the conception of bootstrap sampling to the feature level (also called feature bagging or attribute bagging Bryll, Gutierrez-Osuna, & Quek, 2003). Furthermore, Asymmetric Bagging and Random Subspace SVM (ABRS-SVM) (Tao, Tang, Li, & Wu, 2006) proves that bagging ensemble on both sample and feature simultaneously can largely improve the classification accuracy. Furthermore, Yu et al. proposed a hybrid adaptive ensemble learning framework to adjust the weights of each base classifier and to explore a better performance in the random subspace set (Yu, Li, Liu, & Han, 2015). From (Yu & Chen et al., 2016), the local and global information is added to the hybrid KNN model to further solve sparse, imbalanced and noise problems.

The crucial problem of the traditional generated hyperplane on imbalanced dataset is that the threshold might be excessively deviated under the influence of majority samples. That is why the Area Under the ROC Curve (AUC) of most conventional algorithms usually perform much better than the mean value of True Positive Rate (TPR) and True Negative Rate (TNR) in imbalanced problems (Van Hulse, Khoshgoftaar, & Napolitano, 2007). This phenomenon indicates that these methods can get acceptable accuracy based on ranking, but the default classification thresholds of these approaches cannot be guaranteed to be optimal. Actually, these classification thresholds are often fine tuned during the predicting stage according to heuristics. In order to optimize this problem, Gao et al., 2014, Tian et al., 2015 proposed several new thresholds for Fisher and Pseudo-Inverse linear discriminants for imbalanced problems. Even so, appropriate threshold can be only got through heuristic methods or empirically defined. Therefore, we need a learning mechanism to find the most suitable classification threshold.

Following the DPE methods with double bagging strategy and the excessive deviation problem of threshold, this paper proposed the Cascade Interpolation Learning with Double subspaces and Confidence disturbance (CILDC) for imbalanced problems. CILDC is inspired by the stacking ensemble framework called Cascade Forest (Zhou & Feng, 2017). Primarily, the double subspaces strategy generalizes the CILDC as the cascade interpolation model to fit more base classifiers for imbalanced problems. In details, base classifiers in CILDC are preprocessed with Random Under-sampling (He & Garcia, 2009) and bagging strategy to balance the scales between majority and minority classes. Meanwhile, random feature subspaces are selected to avoid overfitting caused by interpolation with confidence vectors. Accordingly, CILDC is ensembled with different random subspaces of both data and feature simultaneously, which can be called double subspaces ensemble strategy. Furthermore, a confidence disturbance technique is applied in the cascade interpolation model. In particular, the probabilities predicted from the previous cascade level are multiplied by a disturbance coefficient, and then the altered probabilities are interpolated to the origin features as the input for the next level. Due to the adjustment of blending features layer by layer, the cascade structure of CILDC can get idea classification thresholds effectively. As a more general framework compared with Cascade Forest, CILDC can accept more algorithms as base classifiers. In this paper, we use Random Forests and Naive Bayes as base classifiers in CILDC to prove the excellent performance of it on imbalanced data.

The characteristics of CILDC can be highlighted as follows:

  • CILDC can be seen as the generalization learning framework of Cascade Forest with double subspaces strategy.

  • CILDC exploits the double subspaces strategy which contains Random Under-sampling bagging based preprocessing and random feature subspace selection to make all base classifiers competitive for imbalanced problems.

  • CILDC ingeniously uses the confidence disturbance technique and the cascade structure to improve the traditional DPE methods with optimal thresholds in base classifiers.

The outline of this paper is organized as follows. Section 2 provides some essential preliminaries on Random Forests and Naive Bayes which work as base classifiers in CILDC. Simultaneously, a rough introduction about Cascade Forest is presented. Section 3 describes the double subspaces strategy and the confidence disturbance technique in detail. Then, the overall formulation and the time complexity of CILDC is recommended. Section 4 shows experimental results and related analysis on KEEL imbalanced datasets. At last, the paper is concluded in Section 5.

Section snippets

Preliminary

In this section, brief introductions of Random Forests and Naive Bayes are supplied at first. Subsequently, we present the simple synopsis on the Cascade Forest.

Cascade interpolation learning with double subspaces and confidence disturbance

In this section, the total architecture of CILDC are shown in Section 3.1 firstly. Since the double subspaces strategy and the confidence disturbance technique are introduced to improve the performance on imbalanced problems in CILDC, details about them are described in Sections 3.2 Training with the double subspaces strategy, 3.3 Confidence disturbance technique correspondingly. Then, the testing rule and the selecting of base classifiers in CILDC are summarized in Section 3.4. Finally, the

Experiments

In this section, we introduce 30 binary and 12 multi-class KEEL datasets (Triguero & González et al., 2017) at first. Then, the experimental evaluation criteria and compared algorithms are presented. Subsequently, experimental results, related discussions, statistical tests on both Friedman and Nemenyi (Demšar, 2006, Yu et al., 2017) are supplied. Finally, we discuss the influence of hyper-parameter η and the cascade level L.

Conclusions

In this paper, we propose a novel Cascade Interpolation Learning with Double subspaces and Confidence disturbance called CILDC. Firstly, CILDC sums up a generalization model of Cascade Forest based on double random subspaces selection on both sample and feature. Then, the Random Under-sampling preprocessing method makes all base classifiers in CILDC fit for imbalanced problems. Moreover, the confidence disturbance technique is introduced into CILDC to learn the ideal threshold dynamically

Acknowledgments

This work is supported by Natural Science Foundation of China under Grant No. 61672227, “Shuguang Program” supported by Shanghai Education Development Foundation and Shanghai Municipal Education Commission, and National Key R&D Program of China under Grant No. 2018YFC0910500.

References (45)

  • ChenT. et al.

    Xgboost: A scalable tree boosting system

  • DemšarJ.

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research (JMLR)

    (2006)
  • DonohoD.L. et al.

    Precise undersampling theorems

    Proceedings of the IEEE

    (2010)
  • DžeroskiS. et al.

    Is combining classifiers with stacking better than selecting the best one?

    Machine Learning

    (2004)
  • EstabrooksA. et al.

    A multiple resampling method for learning from imbalanced data sets

    Computational Intelligence

    (2004)
  • Fernández-DelgadoM. et al.

    Do we need hundreds of classifiers to solve real world classification problems?

    Journal of Machine Learning Research (JMLR)

    (2014)
  • FreundY. et al.

    A desicion-theoretic generalization of on-line learning and an application to boosting

  • FriedmanM.

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance

    Journal of the American Statistical Association

    (1937)
  • FriedmanN. et al.

    Bayesian network classifiers

    Machine Learning

    (1997)
  • GalarM. et al.

    A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches

    IEEE Transactions on Systems Man and Cybernetics Part C Applications and Reviews

    (2012)
  • GaoD. et al.

    Integrated fisher linear discriminants: An empirical study

    Pattern Recognition

    (2014)
  • HeH. et al.

    Learning from imbalanced data

    IEEE Transactions on Knowledge and Data Engineering

    (2009)
  • Cited by (27)

    • Sample and feature selecting based ensemble learning for imbalanced problems

      2021, Applied Soft Computing
      Citation Excerpt :

      Second, cost-sensitive learning aims to emphasize the positive samples during the optimization by throwing more punishment on them [19–21]. Third, ensemble learning keeps effective in imbalanced cases through improvements in several steps, including splitting the samples to different structures for training [22][23], re-assigning the weights for base learners [24], introducing novel distribution updating formulation [25], or optimizing the hyper-parameters during iterations [26]. Victor Henrique Alves et al. [27] proposed a study of different multi-objective optimization design approaches for ensemble learning.

    • EBRB cascade classifier for imbalanced data via rule weight updating

      2021, Knowledge-Based Systems
      Citation Excerpt :

      The above research results prove the effectiveness of the ensemble learning method in the field of imbalance. Wang and Cao [11] introduced random forest and Native-Bayes as base classifiers, and proposed CILDC-RF and CILDC-NB methods. The ensemble of the two space strategies and random undersampling pre-processing is used to integrate the base classifiers.

    • Feature rearrangement based deep learning system for predicting heart failure mortality

      2020, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      In the past few years, the application of machine learning methods has been extended to various fields such as automated arrhythmia detection [9], supply chain demand forecasting [10], cardiac arrhythmias classification [11], neuroimaging [12], proteomics [13], coronary artery disease diagnosis [14], genomics [15] and credit scoring [16]. Efforts have also been made to address those real-world problems [17–20]. With the promotion of deep learning, more and more researchers in the medical field try to introduce machine learning methods to assist the treatment of certain diseases.

    View all citing articles on Scopus
    View full text