A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition

doi:10.1016/j.engappai.2016.04.003

Engineering Applications of Artificial Intelligence

Volume 53, August 2016, Pages 176-189

https://doi.org/10.1016/j.engappai.2016.04.003 Get rights and content

Highlights

•
A sparse ELM framework is proposed based on with zero-norm regularization.
•
Four sparse ELM formulations with zero-norm are built based on LSE and LAD.
•
We develop two continuous approaches to solve the problems.
•
The first is DC (difference of convex functions) approximation approach.
•
The second is an exact penalty technique for zero-norm.
•
All the resulting problems are posed as DC programming.

Abstract

Extreme learning machine (ELM) has demonstrated great potential in machine learning owing to its simplicity, rapidity and good generalization performance. In this investigation, based on least-squares estimate (LSE) and least absolute deviation (LAD), we propose four sparse ELM formulations with zero-norm regularization to automatically choose the optimal hidden nodes. Furthermore, we develop two continuous optimization methods to solve the proposed problems respectively. The first is DC (difference of convex functions) approximation approach that approximates the zero-norm by a DC function, and the resulting optimizations are posed as DC programs. The second is an exact penalty technique for zero-norm, and the resulting problems are reformulated as DC programs, and the corresponding DCAs converge finitely. Moreover, the proposed framework is applied directly to recognize the hardness of licorice seeds using near-infrared spectral data. Experiments in different spectral regions illustrate that the proposed approaches can reduce the number of hidden nodes (or output features), while either improve or show no significant difference in generalization compared with the traditional ELM methods and support vector machine (SVM). Experiments on several benchmark data sets demonstrate that the proposed framework is competitive with the traditional approaches in generalization, but selects fewer output features.

Introduction

Extreme learning machine (ELM) (Huang et al., 2006, Huang et al., 2010) is a popular and important learning algorithm for single-hidden-layer feedforward neural networks (SLFNs) (Huang et al., 2006). With good generalization performance, ELM has been applied successfully in regression and classification applications. Compared with traditional neural networks, the main advantages of ELM are that it runs fast and is easy to implement. Its hidden nodes and input weights are randomly generated and the output weights are expressed analytically. Moreover ELM overcomes some drawbacks of traditional neural networks, such as local minima, imprecise learning rates and slow convergence rates. However, the traditional ELM does not explicitly combine output features (or hidden nodes) and generalization of the model, which makes it difficult to control automatically the balance between prediction accuracy and the number of selected features.

According to statistical learning theory (SLT) (Vapnik, 1998), to ensure better generalization performance on test set, an algorithm should not only achieve low training error on training set, but also have a lower Vapnik–Chervonenkis (VC) dimension. Recently, researches (Liu et al., 2012, Huang et al., 2015) indicate that the VC dimension of ELM has a specific value and depends strongly on the number of the hidden-layer nodes. In addition, according to the theories (Liu et al., 2012, Huang et al., 2015), ELM has universal approximation capability. It can achieve low approximation error on training set. Therefore, ELM is a potential learning method, and its hidden layer neurons are important for building ELM network with good generalization.

However, some hidden nodes might be closely correlated owing to the randomness of the input weights and hidden node biases in ELM. Thus it is very necessary for regularization to prevent over-fitting and enhance the generalization capability. In addition, ELM outputs its weight based on the least-squares estimate (LSE) (Xiang et al., 2012), and its outputs lack sparseness. Therefore looking for compact ELM networks and choosing the optimal hidden nodes remain important subjects to achieve good performance. Recently, several techniques have been developed to obtain the sparse ELM network, which are summarized as follows:

(1) Double-regularized ELM models such as TROP-ELM (Miche et al., 2011) and regularized ELM with missing data (Yu et al., 2013).

TROP-ELM is an improvement of the optimally pruned extreme learning machine (OP-ELM) (Miche et al., 2010). It uses a cascade of two regularization penalties, the l₁-norm and l₂-norm. TROP-ELM first constructs a SLFN like ELM, then ranks the best neurons by l₁ regularization, finally selects the optimal number of neurons by l₂ regularization. TROP-ELM introduces the l₂ regularization in the calculation of the pseudo-inverse by the singular value decomposition (SVD), and it uses the leave-one-out (LOO) error to select the optimal number of neurons. Thus this is complicated to implement. The regularized ELM with missing data (Yu et al., 2013) is a modification version of TROP-ELM. It uses a cascade of l₁-penalty and l₂-penalty in ELM to solve the missing data problem.

(2) Robust ELM models such as RELM (Horata et al., 2013) and robust ELM with outliers (Barreto and Barros, 2016).

The RELM (Horata et al., 2013) proposes an extended complete orthogonal decomposition (ECOD) method to compute the weights of the ELM. And the paper also proposes the other three algorithms—the iteratively reweighed least squares (IRWLS-ELM), ELM based on the multivariate least-trimmed squares (MLTS-ELM) and ELM based on the one-step reweighed MLTS (RMLTS-ELM)— to solve the outlier robustness problems. Robust ELM with outliers (Barreto and Barros, 2016) is designed to apply M-estimators (Bai and Wu, 1997) in the output weights instead of the standard ordinary least squares method.

(3) Sparse ELM models such as the $l_{1 / 2}$ regularized ELM (Khan et al., 2014; Han et al., 2015) and l₁-regularization approach (Balasundaram and Kapil, 2014; Luo and Zhang, 2014).

The use of l₁-norm regularization results in sparse solutions, thereby helping with feature selection. However, the l₁-norm regularization scheme is consistent in feature selection under some conditions with restrictive assumptions, while there exist certain cases where l₁-penalty technology is inconsistent in feature selection (Zou, 2006, Lin et al., 2009, Le Thi et al., 2015). Note that the l₁-norm regularization criterion generates many components that are close to zero but not exactly equal to zero. The $l_{1 / 2} - norm$ regularization is easier to solve than the l₀-norm regularization, and more sparse than the l₁-norm regularization. But the performance of sparse representation using the l₀-norm regularization is stronger than that of the $l_{1 / 2} - norm$ regularization.

(4) Regularized ELM such as pre-fitting and back-fitting approach (Li et al., 2014) and regularized ELM (Iosifidis et al., 2015).

The pre-fitting and back-fitting approach (Li et al., 2014) is with l₂-norm regularization. This two-stage approach is a greedy algorithm and time-consuming. RELM (Iosifidis et al., 2015) is based on Frobenius norm of matrix. This approach can choose appropriately hidden layer weights and leads to ELM space dimensions having varying values for different training samples. Usually, these methods cannot automatically produce sparse representation.

Feature selection for classification and regression problems is an important topic with many applications, the objectives of which are two-fold: selecting a small feature subset while maintaining high accuracy. Specifically, feature selection for a linear decision function $f (x) = sgn (β^{T} x)$ can be posed as searching for a sparse weight vector β such that most elements of β are zero. This implies that when the ith component of β is zero, the ith component of an observation vector x is irrelevant to the class of x. The zero-norm of the vector β, $∥ β ∥_{0} = card {i | β_{i} \neq 0}$ , is defined to be the number of nonzero elements in β, meaning that zero-norm regularization criterion allows us to reduce the number of representative features in the decision function f(x). Thus feature selection for the decision function f(x) usually is posed as minimizing the $∥ β ∥_{0}$ under appropriate constraint conditions. Nevertheless, the sparse ELM model based on the zero-norm is relatively few discussed in the literature, the main reason for which is the discontinuity and nonconvexity of the zero-norm. Therefore, most work in dealing with feature selection has focused on effective approximation of the zero-norm. The l₁-norm is only a convex approximation of the zero-norm (see Fig. 1). Therefore, the main questions for feature selection include how to approximate the zero-norm effectively and which computational method to use for solving the resulting optimization problem.

In this paper, based on least-squares estimate (LSE) and least absolute deviation (LAD) (Cao and Liu, 2009, Yang et al., 2011), we present two sparse ELM frameworks with zero-norm regularization to select automatically output features. Moreover we present four continuous methods to solve the proposed problems. The first is a DC (Tao and An, 1997, Le Thi et al., 2014, Le Thi et al., 2008, Le Thi et al., 2015) approximation approach that approximates the zero-norm by a DC function. The second applies a new exact penalty technique (Le Thi et al., 2014) to reformulate equivalently the original problem as DC programs. The resulting problems all are posed as DC programs. The corresponding DC optimization algorithms converge linearly or finitely and only requires solving one quadratic program at each iteration.

Throughout the paper we adopt the following notations. The scalar product of two vectors x and y in the n-dimensional real space is denoted by $x^{T} y$ or $〈 x \cdot y 〉$ . For a n-dimension vector x, $∥ x ∥_{1}$ denotes the l₁-norm of x, $∥ x ∥_{1} = \sum_{i = 1}^{i = n} | x_{i} |$ , where $| \cdot |$ denotes absolute value operator, and $∥ x ∥_{2}$ denotes the l₂-norm of x, $∥ x ∥_{2} = \sqrt{x^{T} x}$ . The base of the natural logarithm is denoted by ϵ. A vector of zeros in a real space of arbitrary dimension is denoted by 0. An arbitrary dimension vector of ones is denoted by e.

The rest of the paper is organized as follows. Section 2 briefly summarizes DC programming and ELM. In Section 3, we propose a sparse ELM framework with the zero-norm regularization, and develop four nonconvex optimization algorithms to solve the problems. The experimental results are analyzed in Section 4 and Section 5 give the concluding remarks, summarizes the main contributions of this work, and presents future directions of investigation.

Section snippets

DC programming

DC programming and DCA, introduced by Pham Dinh (constitute the backbone of nonconvex continuous programming (Tao and An, 1997, Le Thi et al., 2014, Le Thi et al., 2008, Le Thi et al., 2015). Generally speaking, a DC program takes the form $\inf {f (x) = g (x) - h (x), x \in R^{n}} (P_{dc})$ where g and h are lower semicontinuous proper convex functions on Rⁿ. Such a function f is called a DC function. g and h are the DC components of f. A function $π (x)$ is said to be polyhedral convex if $π (x) = \max {ϖ_{i}^{T} x - σ_{i}, i = 1, 2, \dots, m} + χ_{Ω} (x)$

Sparse ELM formulations with zero-norm regularization

In this section, we consider a new sparse ELM classification framework with the l₀-norm regularization. Two sparse regularized ELM formulations are presented based on the l₀-norm.

In particular, we incorporate the zero-norm into the objective function of ELM (12) by weighting $∥ Q β - T ∥_{2}^{2}$ with a suitable parameter λ, which leads to a regularized ELM framework with a sparse solution (called l₀-norm ELM) $\min_{β} ∥ β ∥_{0} + λ ∥ Q β - T ∥_{2}^{2}$ where the parameter $λ$ controls a tradeoff between empirical errors and the

Experiments

To evaluate the proposed algorithms, numerical simulations are implemented on various datasets. In the first part, we run the proposed algorithms on ten benchmark datasets from UCI Machine Learning Repository (Blake and Merz, 1998). In the second part, experiment is implemented on a practical application dataset. We perform ten-fold cross validation in all considered datasets. In other words, the dataset is split randomly into ten subsets, and one of those sets is reserved as a test set. This

Conclusions

The number of hidden-layer nodes is an important index for designing and training ELM networks. In this investigation, based on the LSE and LAD methods, we propose herein a sparse ELM learning framework to automatically choose the optimal hidden nodes to ensure better generalization performance with smaller VC dimension. However, discontinuity of the zero-norm makes it difficult to solve. The main contributions of this work are:

(1) We propose four sparse ELM formulations with zero-norm

Acknowledgements

This work is supported by National Nature Science Foundation of China (11471010 and 11271367).

References (33)

Z.D. Bai et al.
General M-estimation
J. Multivar. Anal.
(1997)
S. Balasundaram et al.
1-Norm extreme learning machine for regression and multiclass classification using Newton method
Neurocomputing
(2014)
G.A. Barreto et al.
A robust extreme learning machine for pattern classification with outliers
Neurocomputing
(2016)
F. Benoit et al.
Feature selection for nonlinear models with extreme learning machines
Neurocomputing
(2013)
T. Fawcett
An introduction to ROC analysis
Pattern Recognit. Lett.
(2006)
P. Horata et al.
Robust extreme learning machine
Neurocomputing
(2013)
G.B. Huang et al.
Extreme learning machinetheory and applications
Neurocomputing
(2006)
G. Huang et al.
Trends in extreme learning machinesa review
Neural Netw.
(2015)
G.B. Huang et al.
Optimization method based extreme learning machine for classification
Neurocomputing
(2010)
A. Iosifidis et al.
Regularized extreme learning machine for large-scale media content analysis
Procedia Comput. Sci.
(2015)

A. Khan et al.

Double parallel feedforward neural network based on extreme learning machine with $l_{1 / 2}$ regularizer

Neurocomputing

(2014)

H.A. Le Thi et al.

DC approximation approaches for sparse optimization

Eur. J. Oper. Res.

(2015)

X.Y. Liu et al.

A comparative analysis of support vector machines and extreme learning machines

Neural Netw.

(2012)

X.D. Li et al.

Fast sparse approximation of extreme learning machine

Neurocomputing

(2014)

M.X. Luo et al.

A hybrid approach combining extreme learning machine and sparse representation for image classification

Eng. Appl. Artif. Intell.

(2014)

Y. Miche et al.

TROP-ELMa double-regularized ELM using LARS and Tikhonov regularization

Neurocomputing

(2011)

Cited by (30)

An ensemble approach to meta-heuristic algorithms: Comparative analysis and its applications
2021, Computers and Industrial Engineering
In this paper, we intend to propose an ensemble optimization algorithm based on Follow The Leader (FTL), Multi-verse Optimizer (MVO), and Salp Swarm Algorithm (SSA) to solve constrained optimization problems. The FTL, MVO, and SSA are swarm-based algorithms that update their particle position using a selection approach. Less number of control parameters and a common selection approach make these algorithms suitable for hybridization. In this work, combinations of FTL, MVO, and SSA algorithms such as FTL_MVO, FTL_SSA, MVO_SSA, and FTL_MVO_SSA have been proposed to solve different optimization problems. The proposed ensemble optimization algorithms have been compared with base optimization algorithms on forty-eight unimodal and multimodal benchmark functions. The ensemble model has achieved significant performance improvement over base FTL, MVO, and SSA. Moreover, these algorithms have been tested on six well-known constrained optimization problems to benchmark their performance over real-world applications. Finally, the comparison with classical optimization algorithms reveals the efficacy of the proposed models.
A robust outlier control framework for classification designed with family of homotopy loss function
2019, Neural Networks
Citation Excerpt :
Compared with traditional methods, experiment results on real-world datasets show that the proposed models have good anti-interference ability to outliers. In this section, we briefly introduce LSSVM (Feng et al., 2016) and ELM (Huang, Zhu, & Siew, 2006; Yang & Zhang, 2016) models which are two popular machine learning methods. Extreme learning machine (ELM) is an important learning algorithm for single-hidden-layer feedforward neural networks(SLFNs) (Huang, Zhou, Ding, & Zhang, 2012; Huang et al., 2006).
We propose a new homotopy loss, where practitioners can tune the parameter to derive different loss functions such as $l_{1}$ - $n o r m$ loss, logarithmic loss, Geman–Reynolds loss, Geman–McClure loss and correntropy-based loss. Moreover, we illustrate that the proposed loss satisfies Fisher consistency, and we analyze the robustness of the proposed homotopy loss from different perspectives: M-estimation and adversarial perturbations. Then, we represent a new evaluation standard to measure robustness and demonstrate its upper bound to ensure the validity of this measure. Applied the proposed homotopy loss to least square support vector machine (LSSVM) and the extreme learning machine (ELM), two robust models are presented to enhance the robustness. Furthermore, re-weighted least square algorithm is used to solve the problems, and the resulting algorithms converge globally. In addition, the proposed methods are implemented on various datasets with different levels of noise. Compared with traditional methods, experiment results on real-world datasets show that the proposed methods have superior anti-interference ability to outliers in most cases.
Correntropy-based robust extreme learning machine for classification
2018, Neurocomputing
Citation Excerpt :
Lately, broad learning system [10] was proposed and aimed to offer an alternative way for deep learning and structure, based on the idea of the random vector functional-link neural network. As one of most popular learning algorithms in machine learning, extreme learning machine (ELM) is a supervised single hidden layer feedforward neural network (SLFNs), proposed by Huang [11,12]. Traditional feedforward neural networks are usually trained by gradient-based algorithms, thus much time and iterations are needed to obtain optimal parameters, and there always exist some problems such as local optimum, parameter sensitivity and so on.
Correntropy is a local similarity measure between two arbitrary variables, and it has been applied in a variety of learning algorithms to improve noise insensitivity. In this paper, based on the correntropy, a non-convex and bounded loss function is obtained which contains second and higher order moments of the classification margin. And the novel loss function is robust to noises and close to the 0–1 loss function. Then we introduce it into extreme learning machine (ELM), and propose a correntropy-based robust ELM framework for classification, trained by half quadratic optimization to cope with non-convexity of the algorithm. To evaluate robustness, feature noise and label noise are simulated to provide noisy environments. Experimental results on benchmark datasets demonstrate that the proposed algorithm is better than original algorithms and robust algorithms. Moreover, the superiority of proposed algorithm in noisy environment is more evident, which further proves its robustness to noises.
Multitask learning for virtual metrology in semiconductor manufacturing systems
2018, Computers and Industrial Engineering
Virtual metrology (VM) estimates the real metrology of wafers from process data collected from multiple chambers. In semiconductor manufacturing, independent models for each process chamber are limited because the number of sampled wafers measured at each chamber are too few to build a reliable model. One potential solution to this problem is to pool the data from all chambers to create a model capable of learning and serving as a global predictive model. However, even with chambers that perform the same operation, the condition of their semiconductor tools may vary because of various factors. This study uses, for the first time, various multitask methods to develop VM models. By learning multiple related tasks simultaneously, multitask methods effectively increase the number of observations included in the prediction model. In addition, by identifying the related task, the method can make a prediction using only similar tasks. This property of multitask learning can be useful to account for lack of information in a single chamber and for diversity among the chambers. The experimental results indicate that multitask models consistently outperformed independent and pooled models regardless of the size of the training set used. Among the multitask methods, a multitask tree-based ensemble model outperformed the others in every case. This implies that the problem of wafer quality prediction can be better addressed with a form of multitask learning.
Support vector machine with truncated pinball loss and its application in pattern recognition
2018, Chemometrics and Intelligent Laboratory Systems
Support vector machine(SVM) with pinball loss(PINSVM) has been recently proposed and shown its advantages in pattern recognition. In this paper, we present a robust bounded loss function (called $L_{t}$ -loss) that truncates pinball loss function. Then a novel robust SVM formulation with $L_{t}$ -loss(called TPINSVM) is proposed to enhance noise robustness. Moreover, we demonstrate that the proposed TPINSVM satisfies Bayes rule and it has a certain sparseness. However, the non-convexity of the proposed TPINSVM makes it difficult to optimize. We develop a continuous optimization method, DC(difference of convex functions) programming method, to solve the proposed TPINSVM. The resulting DC optimization algorithm converges finitely. Furthermore, the proposed TPINSVM is directly applied to recognize the purity of hybrid maize seeds using near-infrared spectral data. Experiments show that the proposed method achieves better performance than the traditional methods in most spectral regions. Meanwhile we simulate the proposed TPINSVM in benchmark datasets in different situations. In noiseless setting, the proposed TPINSVM either improves or shows no significant difference in generalization compared to the traditional approaches. While in noise situations, TPINSVM improves generalization in most cases.
Quantitative thickness prediction of tectonically deformed coal using Extreme Learning Machine and Principal Component Analysis: a case study
2017, Computers and Geosciences
Citation Excerpt :
In addition, the ELM has many other advantages, such as easy to implement, quick to converge to the smallest training error, small norms of weights and good generalization performance (Huang et al., 2006). Therefore, it has been widely used in regression, multiclass classification, data analysis of non-linear time series, environmental data analysis, water level forecasting of streamflow and pattern recognition (Benoît et al., 2013; Butcher et al., 2013; De Lima et al., 2016; Deo and Sahin, 2016; Leuenberger and Kanevski, 2015; Yang and Zhang, 2016). Seismic is a main reliable method to forecast the characteristics of coal beds.
The thickness of tectonically deformed coal (TDC) has positive correlation associations with gas outbursts. In order to predict the TDC thickness of coal beds, we propose a new quantitative predicting method using an extreme learning machine (ELM) algorithm, a principal component analysis (PCA) algorithm, and seismic attributes. At first, we build an ELM prediction model using the PCA attributes of a synthetic seismic section. The results suggest that the ELM model can produce a reliable and accurate prediction of the TDC thickness for synthetic data, preferring Sigmoid activation function and 20 hidden nodes. Then, we analyze the applicability of the ELM model on the thickness prediction of the TDC with real application data. Through the cross validation of near-well traces, the results suggest that the ELM model can produce a reliable and accurate prediction of the TDC. After that, we use 250 near-well traces from 10 wells to build an ELM predicting model and use the model to forecast the TDC thickness of the No. 15 coal in the study area using the PCA attributes as the inputs. Comparing the predicted results, it is noted that the trained ELM model with two selected PCA attributes yields better predication results than those from the other combinations of the attributes. Finally, the trained ELM model with real seismic data have a different number of hidden nodes (10) than the trained ELM model with synthetic seismic data. In summary, it is feasible to use an ELM model to predict the TDC thickness using the calculated PCA attributes as the inputs. However, the input attributes, the activation function and the number of hidden nodes in the ELM model should be selected and tested carefully based on individual application.

View all citing articles on Scopus

View full text

A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition

Highlights

Abstract

Introduction

Section snippets

DC programming

Sparse ELM formulations with zero-norm regularization

Experiments

Conclusions

Acknowledgements

J. Multivar. Anal.

Neurocomputing

Neurocomputing

Neurocomputing

Pattern Recognit. Lett.

Neurocomputing

Neurocomputing

Neural Netw.

Neurocomputing

Procedia Comput. Sci.

Neurocomputing

Eur. J. Oper. Res.

Neural Netw.

Neurocomputing

Eng. Appl. Artif. Intell.

Neurocomputing