M2SPL: Generative multiview features with adaptive meta-self-paced sampling for class-imbalance learning

https://doi.org/10.1016/j.eswa.2021.115999Get rights and content

Highlights

  • A multi-view adaptive meta self-paced learning is presented for imbalanced classification.

  • A multiple feature selection approach is used to generate multiview features.

  • M2SPL performs multiple high-quality sampling procedures in multiple views.

  • The performance of M2SPL is higher than other competitors, which demonstrates the superiority of our method.

Abstract

Class-imbalance learning is an important research area and draws continued attention in various real-world applications for many years. Undersampling is a key method of class-imbalance learning in order to obtain a balanced class distribution, but it may discard potentially crucial samples and may be influenced by outliers or noises in imbalanced data. Multiview learning methods have shown that models trained on different views can help each other to improve their performances and robustness, but the existing imbalance learning approaches rely only on single-view samples. In this paper, we propose a multiview feature imbalance sampling method via meta self-paced learning (M2SPL) to effectively choose high-quality samples and separate adjacent features to improve the robustness of the trained model. There are two advantages of our proposed method: (1) An adaptive reweight generation process acts as a pivotal part in our M2SPL. The adaptive density-based reweight samples learning mechanism considers noisy and intractable samples to improve the robustness of model. (2) The multiview feature learning can avoid the large value of the loss function to learn a robust model from original data, and can enhance the discrimination capability of the model. Comparison with the existing sampling approaches shows that our proposed M2SPL approach significantly improves the classification performance, with increases in the F1-score and G-mean of 15.4% and 12.5%, respectively, on average. Finally, our experimental results pass the Friedman and Holm tests, indicating that our model has a significant improvement over existing methods.

Introduction

In recent years, imbalance learning has attracted significant interest from academia, industry, and government funding agencies because most standard learning algorithms assume or expect examples of balanced class distributions or equal misclassification costs (González et al., 2019, Guzmán-Ponce et al., 2021). In particular, in machine learning methods, imbalance learning is typically used to identify relationships among instance features. However, the datasets may include many majority-class samples and a few minority-class samples, which possibly introducing overfitting during the learning phase and causing training failures (Razavi-Far, Farajzadeh-Zanajni, Wang, Saif, & Chakrabarti, 2021). The current mainstream imbalance learning approaches introduce sampling methods such as synthetic sampling with data generation, random oversampling and undersampling, and informed undersampling (He & Garcia, 2009). By altering the size of the training sets, people usually picking less samples from the majority class or repeating samples from the minority class, these approaches create more balanced training datasets and achieve reasonable learning performance. However, the common sampling method is to randomly choose samples, which may lead to crucial data being discarded.

Recently, the self-paced learning (SPL) approach, mimicking the cognitive mechanism of human and animal to gradually learn from easy to difficult tasks, was first proposed (Kumar, Packer, & Koller, 2010) to retain high-quality samples and delete noisy samples. It has since been widely applied in disease diagnosis (Wang et al., 2020), molecular descriptor selection (Xia, Wang, Cao and Liang, 2019), and images recognition (Yin, Liu, & Sun, 2021). For self-paced parameter setting, a too-large or too-small initialization value will lead to miss the best optimization value and increase the computational cost. A recent work (Yao, Wei, Huang, & Li, 2019) has shown that a meta-learner can efficiently overcome the parameter setting problem of self-paced learning (Khodak, Balcan, & Talwalkar, 2019). Hence, in this work, a meta-learner can be used to effectively set parameters of the self-paced function.

Multiview learning is a rapidly developing direction of machine learning research that uses multiple distinct feature sets, and has a great potential for practical use. The development of the multiview learning mechanism was primarily motivated by the properties of data from multiple templates, in which data samples are described by various feature domains or “views”. The applications of multiview learning range from dimensionality reduction (White, Zhang, Schuurmans, & Yu, 2012) and active learning (Zhang & Sun, 2010) to clustering (Saini, Bansal, Saha, & Bhattacharyya, 2021). Recent works (Huang et al., 2019, Xiao et al., 2021, Zhang et al., 2018) have fused similarity measurements from diverse views to construct a graph for multiview samples, and successfully extended the conventional multiview spectral clustering. These works inspire us to consider multiview idea in supervised learning, i.e., generating multiview samples by analyzing the correlation between features in imbalance learning.

Zhang and Shen (2012) showed that learning with datasets that contain joint multiple-style data from multiview learning performs better than learning with single-view data, because multiview datasets contain more valuable information and can lead to a more accurate classification. However, the direct mining of useful information contained in multiview data is disadvantageous; for example, the data include some random noise that will affect the performance of every view in the combined data if we directly combine all views. Additionally, high-dimensional data may lead to the construction of an invalid model and overfitting due to the curse of dimensionality (Cheng, Zhu, Song, Wen, & Wei, 2017). Moreover, neighboring features may have similar characteristics, which can easily cause overfitting during model training. Even if dimensionality is reduced, some key features may be discarded; thus, new methods are needed to carefully transform single-view features into multiview features to train a more robust model. Below, we review, to the best of our knowledge, the available methods for the generation of multiview features and then combine them with meta-SPL for imbalance learning problems.

Feature selection methods are efficient techniques for improving model performance and have been employed to generate multiview features. Depending on the method of generating various feature subsets, feature selection techniques can be divided into three categories: filter (Yu & Liu, 2003), wrapper (Maldonado & Weber, 2009), and embedded (Geng et al., 2020, Maldonado and López, 2018). They are used when individual domain knowledge is unavailable or unexplainable to achieve higher learning accuracy. Hence, these methods can be used to generate multiview features.

In this study, we exploit the benefits of multiview adaptive sampling and meta-SPL to solve the imbalanced distribution of samples. Specifically, we use three feature selection methods to produce multiview feature data from the original datasets. Then, we apply the meta-SPL technique to select the initial high-quality datasets and thereby avoid the noisy sampling effect. When multiview adaptive sampling and meta-SPL are carried out to select high-quality samples, a productive and robust training model can be achieved.

The rest of this paper is organized as follows. We first provide a brief survey of the work relevant to this paper in Section 2. Then, in Section 3, a multiview imbalance meta-SPL framework is proposed to produce multiview features and process noisy samples to improve the classification performance. Section 4 presents the experimental results obtained on 20 real datasets. The conclusion and future work are discussed in Section 5.

Section snippets

Related works

Sampling from imbalanced samples is the fundamental way to achieve a balanced data distribution (Jing et al., 2021). For example, Hoyos-Osorio et al. 2021 investigated a Relevant Information-based UnderSampling (RIUS) approach to select the most relevant examples from the majority class to improve the learning performance in imbalanced datasets. Chawla et al. 2002 developed the Synthetic Minority Oversampling Technique (SMOTE), which increases the number of new synthetic minority-class data

Methods

Multiview learning involves learning with multiple views, which is superior to the naive approach of using one view or concatenating all views (Cheng et al., 2017). Even when multiple natural views are unavailable, the manual generation of multiple views can still improve the learning favorably (Huang et al., 2019). Here, we use multiple features selection approaches to generate multiview features to improve the effectiveness of the training model.

Datasets

The original datasets containing the expression profiles are obtained from the online publicly available variable datasets, such as the US National Library of Medicine and the UC Irvine Machine Learning Repository. The details of all of the datasets are given in Table 1.

Table 1 shows the details of the training samples, including multiple types of datasets, such as biomedical, finance, and recognition datasets. The largest imbalanced ratio is 578.88, which happens in the credit fraud

Conclusion

This paper proposed a multiview meta-SPL sampling method for imbalanced classification based on 20 datasets. The M2SPL method reduces the noise of imbalanced samples, allowing us to remove some irrelevant and redundant samples and identify a suitable subset, thus improving the classification performance. Compared with the existing methods, our proposed M2SPL approach can achieve improvements in the F1-score and G-mean of 15.4% and 12.5%, respectively, on average. Notably, even if the imbalance

CRediT authorship contribution statement

Qingyong Wang: Conceptualization, Methodology, Writing – original draft. Yun Zhou: Writing – review & editing, Methodology, Supervision, Validation. Zehong Cao: Writing – review & editing, Validation. Weiming Zhang: Writing – review & editing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 61703416), Training Program for Excellent Young Innovators of Changsha (Grant No. KQ2009009), Huxiang Youth Talent Support Program (Grant No. 2021RC3076), and the Postgraduate Scientific Research Innovation Project of Hunan Province (Grant no. CX20190040 and CX20190038).

References (59)

  • JiaC. et al.

    S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique

    Journal of Theoretical Biology

    (2017)
  • KohaviR. et al.

    Wrappers for feature subset selection

    Artificial Intelligence

    (1997)
  • LuM.

    Embedded feature selection accounting for unknown data heterogeneity

    Expert Systems with Applications

    (2019)
  • MaldonadoS. et al.

    Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification

    Applied Soft Computing

    (2018)
  • MaldonadoS. et al.

    A wrapper method for feature selection using support vector machines

    Information Sciences

    (2009)
  • SainiN. et al.

    Multi-objective multi-view based search result clustering using differential evolution framework

    Expert Systems with Applications

    (2021)
  • UrbanowiczR.J. et al.

    Relief-based feature selection: Introduction and review

    Journal of Biomedical Informatics

    (2018)
  • WangQ. et al.

    Adaptive sampling using self-paced learning for imbalanced cancer data pre-diagnosis

    Expert Systems with Applications

    (2020)
  • YinT. et al.

    Self-paced active learning for deep CNNs via effective loss function

    Neurocomputing

    (2021)
  • ZhangD. et al.

    Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease

    Neuroimage

    (2012)
  • ZhangQ. et al.

    Multiple-view multiple-learner active learning

    Pattern Recognition

    (2010)
  • ZhaoJ. et al.

    Multi-view learning overview: Recent progress and new challenges

    Information Fusion

    (2017)
  • BechtleS. et al.

    Meta learning via learned loss

  • ChawlaN.V. et al.

    SMOTE: synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • ChenZ. et al.

    Class-imbalanced deep learning via a class-balanced ensemble

    IEEE Transactions on Neural Networks and Learning Systems

    (2021)
  • ChenX. et al.

    Enhanced recursive feature elimination

  • DemšarJ.

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • DengZ. et al.

    Robust relief-feature weighting, margin maximization, and fuzzy optimization

    IEEE Transactions on Fuzzy Systems

    (2010)
  • DengH. et al.

    Feature selection via regularized trees

  • Cited by (7)

    View all citing articles on Scopus
    View full text