A robust outlier control framework for classification designed with family of homotopy loss function
Introduction
Noise accompanied by the collection stage refers to those data that deviate from the original true value attributes. These noise data may result in the overfitting and decline in generalization ability. For classification learning, a dataset is usually composed of two parts: attributes and category labels. The quality of the attributes indicates whether the attributes of the data can accurately depict the samples, and the quality of the category labels represents the correct allocation of the categories of each sample. When classifying, samples with similar attributes are divided into the same category label. Thus, noise can be divided into two types: feature noise and label noise according to the noise whether in the attributes or in the label (Feng, Yang, Huang, Mehrkanoon, & Suykens, 2016).
In this work, we concentrate on studying mitigating the impact of the label noise to the model that many researchers have studied in this area. Generally speaking, these methods can be roughly divided into two branches: the one is based on dealing with the noise data and outliers directly, the other is robust classifier which aims at reducing the effects in the model.
For the first branch, three main directions are discussed: clustering, ensemble learning and graph embedding. These methods are used to detect the noise data and outliers. In the noise test, cluster analysis is often used by eliminating the samples away from the right class samples, so as to achieve a robust effect (Christy et al., 2015, Du et al., 2016, Hautamaki et al., 2004, Maiywan and Kashyap, 2002). Ensemble learning is also a robust learning technique where data are divided into several parts, by constructing and combining multiple base classifiers to complete the learning task. If the outliers are contained in a certain base learner, it may have poor performance. The ensemble system will remove the base learner that have poor performance in the process of integration. Corresponding algorithms are listed such as boosting algorithm (Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2012), bagging algorithm (Breiman, 1996), AdaBoost algorithm (Freund & Schapire, 1996) and so on. Graph embedding (Goyal and Ferrara, 2018, Sheng, 1973, Zhang et al., 2013) is also an effective way to resist noise data and outliers. By defining intrinsic graph, penalty graphs and corresponding edge weight matrices, sample points with low similarity (noise points and outliers) are detected and assigned small weight values. In this way, robustness is achieved. However, these methods based on noise detection algorithm also increase the complexity of the algorithm.
To reduce impact of the noise, robust classifier is an alternative way to resist the adversarial perturbations. Also, we mainly discuss three directions: convolution neural networks, classifiers induced by robust statistics and classifiers induced by robust loss functions, where the last direction is most relevant to our work. Convolution neural network (Gu et al., 2018, Krizhevsky et al., 2012, Tran et al., 2018, Zhang et al., 2019) is a robust network topology. Neurons in the convolutional layer are connected with local receptive fields which are small areas of the input samples. This structure can tolerate noisy input samples and is more robust. However, convolutional neural network needs a large number of samples for training in practice, and will inevitably become over-fitting. Since the squared loss function is more appropriate for data subject to normal distribution, the data with noise and outliers are far beyond the right class. The mean value may be unsuitable for even one outlier may affect the final decision function heavily. Robust statistics are used to eliminate noise such as median (Ma, 2011), fractile quantile (Xu, Zhang, Jiang, Huang, & He, 2015), data divergence (Zhang, Jiang, & Chai, 2010) and maximum correntropy criterion (Du et al., 2018, Liu et al., 2007).
Building robust classifiers based on robust loss function is an effective method to mitigate the impact of noise, which is also most related to our work. In the following, we will introduce robust classifiers modeled by different kinds of robust loss functions. In practical applications, due to the different intensities and types of noise, it is difficult to detect all the anomalies through robust optimization algorithm based on noise detection directly. To reduce the effect of outliers to the classifier, series of truncated loss function as robust loss functions are often used to induce robust classifiers in the literature such as Ramp loss, which is also well known as truncated hinge loss (Gimpel and Smith, 2012, Huang et al., 2014, Wu and Liu, 2007), the truncated least squares loss (Wang & Zhong, 2014) and truncated logistic loss (Park & Liu, 2011). Classifiers induced by these truncated loss functions intuitively restrict the upper bound of the penalty to the outliers. However, they are neither convex nor smooth which make the models difficult to solve. For these problems, there are generally two important branches. The first is to design effective algorithms such as re-weighted least squares algorithm (Perez-Cruz, Navia-Vazquez, Alarcon-Diana, & Artes-Rodriguez, 2000), concave convex procedure (CCCP) algorithm (Yuille & Rangarajan, 2003) and outlier path algorithm (OP) (Suzumura, Ogawa, Sugiyama, Karasuyama, & Takeuchi, 2015). The other is to smooth the loss function to reduce the complexity of the algorithm (Lee and Mangasarian, 2001, Wang et al., 2007), such as Geman–Reynolds loss (Yu, Aslan, & Schuurmans, 2012), Geman–McClure loss (Geman, 1987), loss (Karal, 2017) and general -loss (Barron, 2017). In 2016, Feng et al. (2016) gave two families of non-convex and smooth classification loss: correntropy-based and logarithm-based. Singh and Principe (2010) claim that correntropy (Liu et al., 2007), as a good similarity measure, can be used as a robust loss function. Based on this, Singh, Pokharel, and Principe (2014) and Xu, Cao, Hu, and Principe (2016) design a C-loss and rescaled Hinge loss respectively. Another robust loss function can be found in Chen, Zhou, Chen, Shao, and Gu (2017) which is called by -.
However, from another point of view, the cut-off level should be carefully considered in the truncated loss function for practitioners. What is more, the pull-in cut-off level parameter will also bring either computation complexity or time complexity. On the other hand, the loss functions that are based on square loss and absolute loss are especially remarkable for Gaussian noise and Laplace noise respectively. However, in the applications, the noise type is confused and is tended to be more complex and uncontrollable. It is meaningful to design a more flexible loss function to handle. In addition, Barron (2017) claim that smooth non-convex losses are “plug and play”. The price is expensive to change the model when the practitioners take the time, manpower and material resources to adjust the parameters in the model and finally find that the result of the model is always unsatisfactory. Thus, it is more effective to design a superset that contains different types of loss function for practitioners. Motivated by this, this work is aimed at searching for a more simple and feasible method to establish a superset of loss functions controlled by homotopy parameter to facilitate practitioners and “achieve once and for all”.
In this work, we present a two-parameter loss function, in which, one parameter is a scale parameter, the other enables practitioners to find the optimal penalty function class so that the performance of the final classifier can be improved. In topology, those penalty function classes are called homotopic, such a deformation being called a homotopy among these functions. The main contributions of this work can be summarized as follows:
(1) A homotopy loss is proposed to continuously explore a wider family of loss functions for practitioners. The - loss, logarithmic loss, Geman–Reynolds loss (Yu et al., 2012), Geman–McClure loss (Geman, 1987) and correntropy-based loss (Yang, Ren, Wang, & Dong, 2017) are all special cases of the homotopy loss. The Fisher consistency of this homotopy loss is proved to ensure that the proposed classifiers that are induced by this loss yield the Bayes decision boundary asymptotically. Further more, re-weighted least square algorithm is used to obtain the approximation optimal solution, and the resulting algorithms converge globally.
(2) We analyze the robustness of the proposed homotopy loss from different perspectives: M-estimation (Koltchinskii, 1997) and adversarial perturbations. In particular, to further prove the robustness to adversarial perturbations, we represent a new evaluation criterion to measure robustness in quantity and provide the upper bound to ensure the validity of this measure.
(3) The proposed robust LSSVM and ELM models are implemented on various datasets with different noises’ intensity. Compared with traditional methods, experiment results on real-world datasets show that the proposed models have good anti-interference ability to outliers.
The reminder of this paper is organized as follows. In Section 2, we give a brief view of LSSVM model and ELM model. Then, a new robust homotopy loss function is derived in Section 3 based on - loss which also satisfies Fisher consistency. Properties are also given in this section. New robust frameworks with homotopy loss which are robust LSSVM model and robust ELM model and corresponding algorithms are given in Section 4. In Section 5, analysis of robustness to adversarial perturbations of the model and its upper bound is represented. Numerical experiments and parameter analysis are given in Section 6 to illustrate the validity of the proposed model and conclusions of our paper are given in Section 7.
Section snippets
Background
In this section, we briefly introduce LSSVM (Feng et al., 2016) and ELM (Huang et al., 2006, Yang and Zhang, 2016) models which are two popular machine learning methods.
For a binary classification problem in a -dimensional Euclidean space, suppose that training set consists of labeled samples: where is the input vector whose component represents features. The is the label of sample .
A robust homotopy loss function
In this section, we try to design a robust loss function with - loss. As is known to all that - loss is more robust than - loss function. With the value of increases, - loss has a slower growth rate than - loss. Thus, the influence of outliers to classifier is weakened when using - loss. We first give the definition of a proper loss function as defined in Shewhart and Wilks (2006):
Definition 3.1 A function is called proper as a loss function if such that: C1:
Robust classification framework with homotopy loss
In this section, we present two robust classification frameworks with homotopy loss.
Analysis of robustness to adversarial perturbations
In the previous sections, we have illustrated the robustness of the homotopy loss from the viewpoint of M-estimation. Motivated by a recent contribution (Fawzi, Fawzi, & Frossard, 2018), this section tests mainly the robustness of classifiers to adversarial perturbations quantitatively. When there is a noise point, the classification hyperplane will move toward the noise point, and therefore this kind of classifiers is sensitive to noise.
Define the minimum distance that is needed to switch one
Numerical experiments
To illustrate the validity of the established model with new homotopy loss function, numerical experiments are represented in this section.
In the first experiment, we compare the ELM with homotopy -loss (-HELM for short), ELM with Laplace kernel based homotopy loss (-LKELM for short), ELM with general -loss (Barron, 2017) (-HELM for short), ELM with C-loss (Singh et al., 2014) (CELM for short), ELM with loss (Karal, 2017) (LnELM for short) and classical ELM (ELM for short) and
Conclusion
In this investigation, we have presented a two-parameter loss function that unifies a number of robust loss functions and generalizes many existing one-parameter robust loss functions: - loss, logarithmic loss, Geman–Reynolds loss, Geman–McClure loss and correntropy-based loss. Thus the proposed homotopy loss is more convenient for practitioners to operate and has much more flexibility compared to classical loss functions. The Fisher consistency of the proposed homotopy loss is proved to
Acknowledgments
This work is supported by National Nature Science Foundation of China (Nos. 11471010 and 11271367). Moreover, the authors thank the referees and editor for their constructive comments to improve the paper.
References (51)
- et al.
Cluster based outlier detection algorithm for healthcare data
Procedia Computer Science
(2015) An introduction to roc analysis
Pattern Recognition Letters
(2006)- et al.
Recent advances in convolutional neural networks
Pattern Recognition
(2018) - et al.
Extreme learning machine: Theory and applications
Neurocomputing
(2006) Maximum likelihood optimal and robust Support Vector Regression with lncosh loss function
Neural Networks
(2017)- et al.
Robust penalized logistic regression with truncated loss functions
The Canadian journal of statistics Revue canadienne de statistique
(2011) - et al.
The C-loss function for pattern classification
Pattern Recognition
(2014) - et al.
Improving efficiency in convolutional neural networks with multilinear filters
Neural Networks
(2018) - et al.
Robust non-convex least squares loss function for regression with outliers
(2014) - et al.
Weighted quantile regression via support vector machine
Expert Systems with Applications
(2015)
A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition, Vol. 53
Penalized Bregman divergence for large-dimensional regression and classification
Biometrika
Recent advances in convolutional neural network acceleration
Neurocomputing
A more general robust loss function
Bagging predictors
Machine Learning
Supervised multiview feature selection exploring homogeneity and heterogeneity with - and automatic view generation
IEEE Transactions on Geoscience and Remote Sensing
Robust graph-based semisupervised learning for noisy labeled data via maximum correntropy criterion
IEEE Transactions on Cybernetics
Novel clustering-based approach for local outlier detection
Computer communications workshops
Analysis of classifiers’ robustness to adversarial perturbations
Machine Learning
Robust support vector machines for classification with nonconvex and smooth losses
Neural Computation
Experiments with a new boosting algorithm
Thirteenth international conference on international conference on machine learning
A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches
IEEE Transactions on Systems Man and Cybernetics Part C
Statistical methods for tomographic image reconstruction
Structured ramp loss minimization for machine translation
Proceedings of the 2012 conference of the north american chapter of the association for computational linguistics: Human language technologies
Cited by (13)
Robust twin extreme learning machines with correntropy-based metric
2021, Knowledge-Based SystemsA new heuristic model for monthly streamflow forecasting: Outlier-robust extreme learning machine
2021, Advances in Streamflow Forecasting: From Traditional to Modern ApproachesTwin minimax probability machine for pattern classification
2020, Neural NetworksCitation Excerpt :Moreover, SVM cannot directly output a posterior probability to enable postprocessing. Some improved versions for SVM have been presented such as sequential minimal optimization SVM (Keerthi, Shevade, Bhattacharyya, & Murthy, 2014), chunking algorithm for SVM (Shawe-Taylor & Sun, 2011), robust SVM with family of homotopy loss function (Wang, Yang, et al., 2019a), privacy-reserving SVM (Farokhi, 2019), and generalized eigenvalue proximal SVM (GEPSVM) (Chauhan et al., 2019) that relaxes the parallel hyperplanes in traditional SVM, and attempts to set up a pair of nonparallel hyperplanes by solving two generalized eigenvalue problems. Subsequently, Jayadeva et al. proposed twin support vector machines (TWSVM) (Jayadeva & Chandra, 2007) to generate two non-parallel hyperplanes such that each hyperplane is close to one of the two classes and distant from the other simultaneously.
Efficiently utilizing complex-valued PolSAR image data via a multi-task deep learning framework
2019, ISPRS Journal of Photogrammetry and Remote SensingCitation Excerpt :Although more automated sub-fields are emerging (Elsken et al., 2019; Liu et al., 2018), the design of network architectures and the construction of objectives incorporate the wisdom of the human experts. CNNs have the ability to utilize massive data compared to the shallow models in machine learning (Cortes and Vapnik, 1995; Wang et al., 2019; Zhang et al., 2015). Moreover, the generation of fast computing technology based on graphics processing unit (GPU) greatly promotes the application of CNNs.
Intuitionistic Fuzzy Extreme Learning Machine with the Truncated Pinball Loss
2024, Neural Processing Letters