M2SPL: Generative multiview features with adaptive meta-self-paced sampling for class-imbalance learning
Introduction
In recent years, imbalance learning has attracted significant interest from academia, industry, and government funding agencies because most standard learning algorithms assume or expect examples of balanced class distributions or equal misclassification costs (González et al., 2019, Guzmán-Ponce et al., 2021). In particular, in machine learning methods, imbalance learning is typically used to identify relationships among instance features. However, the datasets may include many majority-class samples and a few minority-class samples, which possibly introducing overfitting during the learning phase and causing training failures (Razavi-Far, Farajzadeh-Zanajni, Wang, Saif, & Chakrabarti, 2021). The current mainstream imbalance learning approaches introduce sampling methods such as synthetic sampling with data generation, random oversampling and undersampling, and informed undersampling (He & Garcia, 2009). By altering the size of the training sets, people usually picking less samples from the majority class or repeating samples from the minority class, these approaches create more balanced training datasets and achieve reasonable learning performance. However, the common sampling method is to randomly choose samples, which may lead to crucial data being discarded.
Recently, the self-paced learning (SPL) approach, mimicking the cognitive mechanism of human and animal to gradually learn from easy to difficult tasks, was first proposed (Kumar, Packer, & Koller, 2010) to retain high-quality samples and delete noisy samples. It has since been widely applied in disease diagnosis (Wang et al., 2020), molecular descriptor selection (Xia, Wang, Cao and Liang, 2019), and images recognition (Yin, Liu, & Sun, 2021). For self-paced parameter setting, a too-large or too-small initialization value will lead to miss the best optimization value and increase the computational cost. A recent work (Yao, Wei, Huang, & Li, 2019) has shown that a meta-learner can efficiently overcome the parameter setting problem of self-paced learning (Khodak, Balcan, & Talwalkar, 2019). Hence, in this work, a meta-learner can be used to effectively set parameters of the self-paced function.
Multiview learning is a rapidly developing direction of machine learning research that uses multiple distinct feature sets, and has a great potential for practical use. The development of the multiview learning mechanism was primarily motivated by the properties of data from multiple templates, in which data samples are described by various feature domains or “views”. The applications of multiview learning range from dimensionality reduction (White, Zhang, Schuurmans, & Yu, 2012) and active learning (Zhang & Sun, 2010) to clustering (Saini, Bansal, Saha, & Bhattacharyya, 2021). Recent works (Huang et al., 2019, Xiao et al., 2021, Zhang et al., 2018) have fused similarity measurements from diverse views to construct a graph for multiview samples, and successfully extended the conventional multiview spectral clustering. These works inspire us to consider multiview idea in supervised learning, i.e., generating multiview samples by analyzing the correlation between features in imbalance learning.
Zhang and Shen (2012) showed that learning with datasets that contain joint multiple-style data from multiview learning performs better than learning with single-view data, because multiview datasets contain more valuable information and can lead to a more accurate classification. However, the direct mining of useful information contained in multiview data is disadvantageous; for example, the data include some random noise that will affect the performance of every view in the combined data if we directly combine all views. Additionally, high-dimensional data may lead to the construction of an invalid model and overfitting due to the curse of dimensionality (Cheng, Zhu, Song, Wen, & Wei, 2017). Moreover, neighboring features may have similar characteristics, which can easily cause overfitting during model training. Even if dimensionality is reduced, some key features may be discarded; thus, new methods are needed to carefully transform single-view features into multiview features to train a more robust model. Below, we review, to the best of our knowledge, the available methods for the generation of multiview features and then combine them with meta-SPL for imbalance learning problems.
Feature selection methods are efficient techniques for improving model performance and have been employed to generate multiview features. Depending on the method of generating various feature subsets, feature selection techniques can be divided into three categories: filter (Yu & Liu, 2003), wrapper (Maldonado & Weber, 2009), and embedded (Geng et al., 2020, Maldonado and López, 2018). They are used when individual domain knowledge is unavailable or unexplainable to achieve higher learning accuracy. Hence, these methods can be used to generate multiview features.
In this study, we exploit the benefits of multiview adaptive sampling and meta-SPL to solve the imbalanced distribution of samples. Specifically, we use three feature selection methods to produce multiview feature data from the original datasets. Then, we apply the meta-SPL technique to select the initial high-quality datasets and thereby avoid the noisy sampling effect. When multiview adaptive sampling and meta-SPL are carried out to select high-quality samples, a productive and robust training model can be achieved.
The rest of this paper is organized as follows. We first provide a brief survey of the work relevant to this paper in Section 2. Then, in Section 3, a multiview imbalance meta-SPL framework is proposed to produce multiview features and process noisy samples to improve the classification performance. Section 4 presents the experimental results obtained on 20 real datasets. The conclusion and future work are discussed in Section 5.
Section snippets
Related works
Sampling from imbalanced samples is the fundamental way to achieve a balanced data distribution (Jing et al., 2021). For example, Hoyos-Osorio et al. 2021 investigated a Relevant Information-based UnderSampling (RIUS) approach to select the most relevant examples from the majority class to improve the learning performance in imbalanced datasets. Chawla et al. 2002 developed the Synthetic Minority Oversampling Technique (SMOTE), which increases the number of new synthetic minority-class data
Methods
Multiview learning involves learning with multiple views, which is superior to the naive approach of using one view or concatenating all views (Cheng et al., 2017). Even when multiple natural views are unavailable, the manual generation of multiple views can still improve the learning favorably (Huang et al., 2019). Here, we use multiple features selection approaches to generate multiview features to improve the effectiveness of the training model.
Datasets
The original datasets containing the expression profiles are obtained from the online publicly available variable datasets, such as the US National Library of Medicine and the UC Irvine Machine Learning Repository. The details of all of the datasets are given in Table 1.
Table 1 shows the details of the training samples, including multiple types of datasets, such as biomedical, finance, and recognition datasets. The largest imbalanced ratio is 578.88, which happens in the credit fraud
Conclusion
This paper proposed a multiview meta-SPL sampling method for imbalanced classification based on 20 datasets. The MSPL method reduces the noise of imbalanced samples, allowing us to remove some irrelevant and redundant samples and identify a suitable subset, thus improving the classification performance. Compared with the existing methods, our proposed MSPL approach can achieve improvements in the F1-score and G-mean of 15.4% and 12.5%, respectively, on average. Notably, even if the imbalance
CRediT authorship contribution statement
Qingyong Wang: Conceptualization, Methodology, Writing – original draft. Yun Zhou: Writing – review & editing, Methodology, Supervision, Validation. Zehong Cao: Writing – review & editing, Validation. Weiming Zhang: Writing – review & editing, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant No. 61703416), Training Program for Excellent Young Innovators of Changsha (Grant No. KQ2009009), Huxiang Youth Talent Support Program (Grant No. 2021RC3076), and the Postgraduate Scientific Research Innovation Project of Hunan Province (Grant no. CX20190040 and CX20190038).
References (59)
- et al.
A novel low-rank hypergraph feature selection for multi-view classification
Neurocomputing
(2017) - et al.
Consistency-based search in feature selection
Artificial Intelligence
(2003) - et al.
Semantic relation extraction using sequential and tree-structured LSTM with attention
Information Sciences
(2020) - et al.
Joint entity and relation extraction model based on rich semantics
Neurocomputing
(2021) - et al.
Chain based sampling for monotonic imbalanced classification
Information Sciences
(2019) - et al.
Nearest neighbor editing aided by unlabeled data
Information Sciences
(2009) - et al.
DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem
Expert Systems with Applications
(2021) - et al.
Relevant information undersampling to support imbalanced data classification
Neurocomputing
(2021) - et al.
Multi-view intact space clustering
Pattern Recognition
(2019) - et al.
Dual self-paced multi-view clustering
Neural Networks
(2021)
S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique
Journal of Theoretical Biology
Wrappers for feature subset selection
Artificial Intelligence
Embedded feature selection accounting for unknown data heterogeneity
Expert Systems with Applications
Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification
Applied Soft Computing
A wrapper method for feature selection using support vector machines
Information Sciences
Multi-objective multi-view based search result clustering using differential evolution framework
Expert Systems with Applications
Relief-based feature selection: Introduction and review
Journal of Biomedical Informatics
Adaptive sampling using self-paced learning for imbalanced cancer data pre-diagnosis
Expert Systems with Applications
Self-paced active learning for deep CNNs via effective loss function
Neurocomputing
Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease
Neuroimage
Multiple-view multiple-learner active learning
Pattern Recognition
Multi-view learning overview: Recent progress and new challenges
Information Fusion
Meta learning via learned loss
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Class-imbalanced deep learning via a class-balanced ensemble
IEEE Transactions on Neural Networks and Learning Systems
Enhanced recursive feature elimination
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
Robust relief-feature weighting, margin maximization, and fuzzy optimization
IEEE Transactions on Fuzzy Systems
Feature selection via regularized trees
Cited by (7)
Class Activation Maps-based Feature Augmentation for long-tailed classification
2024, Expert Systems with ApplicationsECC + +: An algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data
2024, Expert Systems with ApplicationsFGBC: Flexible graph-based balanced classifier for class-imbalanced semi-supervised learning
2023, Pattern RecognitionSelf-paced multi-view positive and unlabeled graph learning with auxiliary information
2023, Information Sciences