A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition

https://doi.org/10.1016/j.engappai.2016.04.003Get rights and content

Highlights

  • A sparse ELM framework is proposed based on with zero-norm regularization.

  • Four sparse ELM formulations with zero-norm are built based on LSE and LAD.

  • We develop two continuous approaches to solve the problems.

  • The first is DC (difference of convex functions) approximation approach.

  • The second is an exact penalty technique for zero-norm.

  • All the resulting problems are posed as DC programming.

Abstract

Extreme learning machine (ELM) has demonstrated great potential in machine learning owing to its simplicity, rapidity and good generalization performance. In this investigation, based on least-squares estimate (LSE) and least absolute deviation (LAD), we propose four sparse ELM formulations with zero-norm regularization to automatically choose the optimal hidden nodes. Furthermore, we develop two continuous optimization methods to solve the proposed problems respectively. The first is DC (difference of convex functions) approximation approach that approximates the zero-norm by a DC function, and the resulting optimizations are posed as DC programs. The second is an exact penalty technique for zero-norm, and the resulting problems are reformulated as DC programs, and the corresponding DCAs converge finitely. Moreover, the proposed framework is applied directly to recognize the hardness of licorice seeds using near-infrared spectral data. Experiments in different spectral regions illustrate that the proposed approaches can reduce the number of hidden nodes (or output features), while either improve or show no significant difference in generalization compared with the traditional ELM methods and support vector machine (SVM). Experiments on several benchmark data sets demonstrate that the proposed framework is competitive with the traditional approaches in generalization, but selects fewer output features.

Introduction

Extreme learning machine (ELM) (Huang et al., 2006, Huang et al., 2010) is a popular and important learning algorithm for single-hidden-layer feedforward neural networks (SLFNs) (Huang et al., 2006). With good generalization performance, ELM has been applied successfully in regression and classification applications. Compared with traditional neural networks, the main advantages of ELM are that it runs fast and is easy to implement. Its hidden nodes and input weights are randomly generated and the output weights are expressed analytically. Moreover ELM overcomes some drawbacks of traditional neural networks, such as local minima, imprecise learning rates and slow convergence rates. However, the traditional ELM does not explicitly combine output features (or hidden nodes) and generalization of the model, which makes it difficult to control automatically the balance between prediction accuracy and the number of selected features.

According to statistical learning theory (SLT) (Vapnik, 1998), to ensure better generalization performance on test set, an algorithm should not only achieve low training error on training set, but also have a lower Vapnik–Chervonenkis (VC) dimension. Recently, researches (Liu et al., 2012, Huang et al., 2015) indicate that the VC dimension of ELM has a specific value and depends strongly on the number of the hidden-layer nodes. In addition, according to the theories (Liu et al., 2012, Huang et al., 2015), ELM has universal approximation capability. It can achieve low approximation error on training set. Therefore, ELM is a potential learning method, and its hidden layer neurons are important for building ELM network with good generalization.

However, some hidden nodes might be closely correlated owing to the randomness of the input weights and hidden node biases in ELM. Thus it is very necessary for regularization to prevent over-fitting and enhance the generalization capability. In addition, ELM outputs its weight based on the least-squares estimate (LSE) (Xiang et al., 2012), and its outputs lack sparseness. Therefore looking for compact ELM networks and choosing the optimal hidden nodes remain important subjects to achieve good performance. Recently, several techniques have been developed to obtain the sparse ELM network, which are summarized as follows:

(1) Double-regularized ELM models such as TROP-ELM (Miche et al., 2011) and regularized ELM with missing data (Yu et al., 2013).

TROP-ELM is an improvement of the optimally pruned extreme learning machine (OP-ELM) (Miche et al., 2010). It uses a cascade of two regularization penalties, the l1-norm and l2-norm. TROP-ELM first constructs a SLFN like ELM, then ranks the best neurons by l1 regularization, finally selects the optimal number of neurons by l2 regularization. TROP-ELM introduces the l2 regularization in the calculation of the pseudo-inverse by the singular value decomposition (SVD), and it uses the leave-one-out (LOO) error to select the optimal number of neurons. Thus this is complicated to implement. The regularized ELM with missing data (Yu et al., 2013) is a modification version of TROP-ELM. It uses a cascade of l1-penalty and l2-penalty in ELM to solve the missing data problem.

(2) Robust ELM models such as RELM (Horata et al., 2013) and robust ELM with outliers (Barreto and Barros, 2016).

The RELM (Horata et al., 2013) proposes an extended complete orthogonal decomposition (ECOD) method to compute the weights of the ELM. And the paper also proposes the other three algorithms—the iteratively reweighed least squares (IRWLS-ELM), ELM based on the multivariate least-trimmed squares (MLTS-ELM) and ELM based on the one-step reweighed MLTS (RMLTS-ELM)— to solve the outlier robustness problems. Robust ELM with outliers (Barreto and Barros, 2016) is designed to apply M-estimators (Bai and Wu, 1997) in the output weights instead of the standard ordinary least squares method.

(3) Sparse ELM models such as the l1/2 regularized ELM (Khan et al., 2014; Han et al., 2015) and l1-regularization approach (Balasundaram and Kapil, 2014; Luo and Zhang, 2014).

The use of l1-norm regularization results in sparse solutions, thereby helping with feature selection. However, the l1-norm regularization scheme is consistent in feature selection under some conditions with restrictive assumptions, while there exist certain cases where l1-penalty technology is inconsistent in feature selection (Zou, 2006, Lin et al., 2009, Le Thi et al., 2015). Note that the l1-norm regularization criterion generates many components that are close to zero but not exactly equal to zero. The l1/2-norm regularization is easier to solve than the l0-norm regularization, and more sparse than the l1-norm regularization. But the performance of sparse representation using the l0-norm regularization is stronger than that of the l1/2-norm regularization.

(4) Regularized ELM such as pre-fitting and back-fitting approach (Li et al., 2014) and regularized ELM (Iosifidis et al., 2015).

The pre-fitting and back-fitting approach (Li et al., 2014) is with l2-norm regularization. This two-stage approach is a greedy algorithm and time-consuming. RELM (Iosifidis et al., 2015) is based on Frobenius norm of matrix. This approach can choose appropriately hidden layer weights and leads to ELM space dimensions having varying values for different training samples. Usually, these methods cannot automatically produce sparse representation.

Feature selection for classification and regression problems is an important topic with many applications, the objectives of which are two-fold: selecting a small feature subset while maintaining high accuracy. Specifically, feature selection for a linear decision function f(x)=sgn(βTx) can be posed as searching for a sparse weight vector β such that most elements of β are zero. This implies that when the ith component of β is zero, the ith component of an observation vector x is irrelevant to the class of x. The zero-norm of the vector β, β0=card{i|βi0}, is defined to be the number of nonzero elements in β, meaning that zero-norm regularization criterion allows us to reduce the number of representative features in the decision function f(x). Thus feature selection for the decision function f(x) usually is posed as minimizing the β0 under appropriate constraint conditions. Nevertheless, the sparse ELM model based on the zero-norm is relatively few discussed in the literature, the main reason for which is the discontinuity and nonconvexity of the zero-norm. Therefore, most work in dealing with feature selection has focused on effective approximation of the zero-norm. The l1-norm is only a convex approximation of the zero-norm (see Fig. 1). Therefore, the main questions for feature selection include how to approximate the zero-norm effectively and which computational method to use for solving the resulting optimization problem.

In this paper, based on least-squares estimate (LSE) and least absolute deviation (LAD) (Cao and Liu, 2009, Yang et al., 2011), we present two sparse ELM frameworks with zero-norm regularization to select automatically output features. Moreover we present four continuous methods to solve the proposed problems. The first is a DC (Tao and An, 1997, Le Thi et al., 2014, Le Thi et al., 2008, Le Thi et al., 2015) approximation approach that approximates the zero-norm by a DC function. The second applies a new exact penalty technique (Le Thi et al., 2014) to reformulate equivalently the original problem as DC programs. The resulting problems all are posed as DC programs. The corresponding DC optimization algorithms converge linearly or finitely and only requires solving one quadratic program at each iteration.

Throughout the paper we adopt the following notations. The scalar product of two vectors x and y in the n-dimensional real space is denoted by xTy or x·y. For a n-dimension vector x, x1 denotes the l1-norm of x, x1=i=1i=n|xi|, where |·| denotes absolute value operator, and x2 denotes the l2-norm of x, x2=xTx. The base of the natural logarithm is denoted by ϵ. A vector of zeros in a real space of arbitrary dimension is denoted by 0. An arbitrary dimension vector of ones is denoted by e.

The rest of the paper is organized as follows. Section 2 briefly summarizes DC programming and ELM. In Section 3, we propose a sparse ELM framework with the zero-norm regularization, and develop four nonconvex optimization algorithms to solve the problems. The experimental results are analyzed in Section 4 and Section 5 give the concluding remarks, summarizes the main contributions of this work, and presents future directions of investigation.

Section snippets

DC programming

DC programming and DCA, introduced by Pham Dinh (constitute the backbone of nonconvex continuous programming (Tao and An, 1997, Le Thi et al., 2014, Le Thi et al., 2008, Le Thi et al., 2015). Generally speaking, a DC program takes the forminf{f(x)=g(x)h(x),xRn}(Pdc)where g and h are lower semicontinuous proper convex functions on Rn. Such a function f is called a DC function. g and h are the DC components of f. A function π(x) is said to be polyhedral convex ifπ(x)=max{ϖiTxσi,i=1,2,,m}+χΩ(x)

Sparse ELM formulations with zero-norm regularization

In this section, we consider a new sparse ELM classification framework with the l0-norm regularization. Two sparse regularized ELM formulations are presented based on the l0-norm.

In particular, we incorporate the zero-norm into the objective function of ELM (12) by weighting QβT22 with a suitable parameter λ, which leads to a regularized ELM framework with a sparse solution (called l0-norm ELM)minββ0+λQβT22where the parameter λ controls a tradeoff between empirical errors and the

Experiments

To evaluate the proposed algorithms, numerical simulations are implemented on various datasets. In the first part, we run the proposed algorithms on ten benchmark datasets from UCI Machine Learning Repository (Blake and Merz, 1998). In the second part, experiment is implemented on a practical application dataset. We perform ten-fold cross validation in all considered datasets. In other words, the dataset is split randomly into ten subsets, and one of those sets is reserved as a test set. This

Conclusions

The number of hidden-layer nodes is an important index for designing and training ELM networks. In this investigation, based on the LSE and LAD methods, we propose herein a sparse ELM learning framework to automatically choose the optimal hidden nodes to ensure better generalization performance with smaller VC dimension. However, discontinuity of the zero-norm makes it difficult to solve. The main contributions of this work are:

(1) We propose four sparse ELM formulations with zero-norm

Acknowledgements

This work is supported by National Nature Science Foundation of China (11471010 and 11271367).

References (33)

Cited by (30)

  • A robust outlier control framework for classification designed with family of homotopy loss function

    2019, Neural Networks
    Citation Excerpt :

    Compared with traditional methods, experiment results on real-world datasets show that the proposed models have good anti-interference ability to outliers. In this section, we briefly introduce LSSVM (Feng et al., 2016) and ELM (Huang, Zhu, & Siew, 2006; Yang & Zhang, 2016) models which are two popular machine learning methods. Extreme learning machine (ELM) is an important learning algorithm for single-hidden-layer feedforward neural networks(SLFNs) (Huang, Zhou, Ding, & Zhang, 2012; Huang et al., 2006).

  • Correntropy-based robust extreme learning machine for classification

    2018, Neurocomputing
    Citation Excerpt :

    Lately, broad learning system [10] was proposed and aimed to offer an alternative way for deep learning and structure, based on the idea of the random vector functional-link neural network. As one of most popular learning algorithms in machine learning, extreme learning machine (ELM) is a supervised single hidden layer feedforward neural network (SLFNs), proposed by Huang [11,12]. Traditional feedforward neural networks are usually trained by gradient-based algorithms, thus much time and iterations are needed to obtain optimal parameters, and there always exist some problems such as local optimum, parameter sensitivity and so on.

  • Quantitative thickness prediction of tectonically deformed coal using Extreme Learning Machine and Principal Component Analysis: a case study

    2017, Computers and Geosciences
    Citation Excerpt :

    In addition, the ELM has many other advantages, such as easy to implement, quick to converge to the smallest training error, small norms of weights and good generalization performance (Huang et al., 2006). Therefore, it has been widely used in regression, multiclass classification, data analysis of non-linear time series, environmental data analysis, water level forecasting of streamflow and pattern recognition (Benoît et al., 2013; Butcher et al., 2013; De Lima et al., 2016; Deo and Sahin, 2016; Leuenberger and Kanevski, 2015; Yang and Zhang, 2016). Seismic is a main reliable method to forecast the characteristics of coal beds.

View all citing articles on Scopus
View full text