SVM–ELM: Pruning of Extreme Learning Machine with Support Vector Machines for Regression

Saif F. Mahmood; Mohammad H. Marhaban; Fakhrul Z. Rokhani; Khairulmizam Samsudin; Olasimbo Ayodeji Arigbabu

doi:10.1515/jisys-2015-0021

Open Access Published by De Gruyter September 29, 2015

SVM–ELM: Pruning of Extreme Learning Machine with Support Vector Machines for Regression

Saif F. Mahmood , Mohammad H. Marhaban , Fakhrul Z. Rokhani , Khairulmizam Samsudin and Olasimbo Ayodeji Arigbabu

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0021

Abstract

Extreme Learning Machine provides very competitive performance to other related classical predictive models for solving problems such as regression, clustering, and classification. An ELM possesses the advantage of faster computational time in both training and testing. However, one of the main challenges of an ELM is the selection of the optimal number of hidden nodes. This paper presents a new approach to node selection of an ELM based on a 1-norm support vector machine (SVM). In this method, the targets of SVM y_i ∈{+1, –1} are derived using the mean or median of ELM training errors as a threshold for separating the training data, which are projected to SVM dimensions. We present an integrated architecture that exploits the sparseness in solution of SVM to prune out the inactive hidden nodes in ELM. Several experiments are conducted on real-world benchmark datasets, and the results attained attest to the efficiency of the proposed method.

Keywords: Support vector; extreme learning machine; training error; optimal hidden node

1 Introduction

The extreme learning machine (ELM) [16], a new paradigm in the machine learning community, has attracted a significant amount of research interest in the past few years. The predictive model is a generalized form of single-layer feedforward neural network (SLFN) that has only one hidden layer with randomly generated hidden neurons. SLFNs are networks with no back or side connections to connect two nodes. One of the main advantages of ELM is fast computational time both during training and testing. In addition to that, the classifier has the ability of attaining good generalization without the need of tuning the network parameters, unlike conventional neural networks. ELM have been widely applied in several supervised, semi-supervised, and unsupervised learning problems [14, 15, 25, 34]. Conceptually, the main idea behind the ELM algorithm is that the weights between the input and hidden layer nodes can be randomly assigned, while the weights between the hidden layer and output nodes can be derived by solving a system of linear equations. Furthermore, Huang et al. [4] pointed out that a continuous, bounded, and non-constant activation function, embedded in ELM with randomly assigned hidden neurons, can converge to any continuous target function.

In spite of the several contributions on ELM, there are some underlying issues that need to be investigated to make the implementation of ELM more feasible for real-world applications. Specifically, one of the open research problems in ELM is the finding of compact and optimal network architecture, which can eventually result in a robust model, applicable to various learning tasks. Meanwhile, the selection of an optimal number of nodes is closely related to the problem of curve fitting using polynomials [17]. Intuitively, too many coefficients result in overfitting, while fewer coefficients improve fitting of the generative function to the input data. In practice, ELM inclines to require a higher number of hidden nodes than the conventional tuning-based algorithms. But analytically, a certain number of the hidden nodes in such networks may have a minor influence on the output of the network and potentially increase the network complexity [27]. As well, selecting the optimal number of nodes in case the hidden nodes are greater than the training samples, the ELM may face the singularity problem, and the system may become unstable [23].

Therefore, in most cases, the suitable number of hidden neurons can be selected by gradually increasing the number of nodes with respect to the prediction error, which is time consuming and tedious. However, researchers have suggested that the network structure should not be computed using the trial-and-error method, but should rather be optimized or selected using learning algorithms [1]. There are two main approaches popularly adopted in determining the appropriate structure of SLFN, namely, the growing and pruning methods [18]. In this context, the growing method can be described as an incremental learning approach, where the number of hidden layer nodes is gradually increased one by one or group by group, while updating the output weights, until the optimal number of nodes is derived [4]. Such techniques have been extensively investigated in previous reports, for instance, the error minimization approach for incremental learning suggested by Feng et al. [9] and its extensions [4, 9].

The pruning-based method is a destructive learning technique that initially considers a large amount of hidden nodes. Then, by using statistical analysis, the non-contributing nodes can be pruned out or removed from the network [21, 26]. Rong et al. [26] proposed fast-pruned ELM (P-ELM), which starts with a large number of nodes, and by using the χ² and information gain (IG), the nodes with low correlation are removed. Miche et al. [21] proposed optimally pruned ELM (OP-ELM), which is an improvement over the P-ELM by applying a multiresponse sparse regression algorithm (MRSR) to rank the hidden nodes. Further, leave-one-out (LOO) validation is utilized to choose the optimal number of hidden nodes. However, one of the drawbacks of the pruning approach is that increasing the initial number of nodes results in an increased model complexity and eventually increases the training time. Using a hybrid approach, the pruning and growing methods have been successfully applied in fuzzy neural networks [29, 31, 32]. In addition, constructive and destructive parsimonious ELM (CP and DP–ELM) for multi-input multi-output (MIMO) SLFNs has been proposed with exemplar application to noisy data samples [30]. Also, adopting the concept of principal components analysis (PCA), a deterministic approach named PCA–ELM has been proposed to address the selection of architecture of the ELM, by depending on the information retrieved from the PCA to fit the hidden nodes [7]. Conversely, as the PCA follows the fundamental assumption that data take a normal distribution, thus, the deterministic method may possess some limitations to provide an efficient solution in the case of noisy data.

It is very obvious that the hidden node selection is an interesting problem and is still a very much open area of research. Our investigation in this paper is inspired by the model proposed by Han and Yin [11], where a 1-norm support vector machine (SVM) is utilized to pre-select the hidden neurons, followed by a stepwise selection algorithm based on ridge regression for choosing the optimal hidden neurons of wavelet networks. The main objective of the algorithm is to eliminate the problems of including a great number of irrelevant neurons in wavelet networks.

In this paper, a destructive hidden node selection technique for an ELM is introduced to increase the system’s generalization performance and reduce the computational time required for finding the sufficient number of nodes. We based the application of the proposed model on a regression task. Huang et al. [15] have pointed out that using a large number of hidden nodes enhances the generalization performance. However, there may be many redundant and irrelevant nodes in the initial random nodes. This can be considered as a feature selection problem using a 1-norm SVM [11]. By projecting the approximated hidden node matrix of the ELM to the SVM dimension, the resulting support vectors in the SVM are represented as the significant number of hidden layer nodes in the ELM. This method is named SVM–ELM. The rest of the paper is organized as follows. In Section 2, a brief review on the ELM and SVM is presented. Section 3 describes in detail the proposed node selection method based on a 1-norm SVM. Section 4 provides the experimental results for validating the efficiency of the proposed model. Finally, a conclusion is provided in Section 5.

2 Brief Overview of ELM and 1-Norm SVM

This section provides a brief mathematical illustration of the concepts of the ELM and 1-norm SVM.

2.1 Extreme Learning Machine

The ELM is a simple yet efficient method of approximating input data using SLFN. Recently, the ELM has been extensively adopted in many applications as the learning algorithms for SLFNs. This is basically due to its fast learning and the aspects of the ELM algorithm that overcome the shortcomings of the back propagation artificial neural network (BPANN) such as parameters tuning and the local minima problem.

For N arbitrary distinct samples (x_i , τ_i )∈Rⁿ^×^m, the standard SLFNs with M hidden nodes and an activation function Ω(.) can be expressed as [15]:

(1)∑i = 1MQiΩ(xj) = ∑i = 1MQiΩ(xj, ci + di) = τj, j = 1, …,N

where c_i =[c_i1, c_i2, …, c_in ]^T is the weight vector that is randomly generated to connect the ith hidden nodes of the hidden layer and input nodes, Q=[Q_i1, Q_i2, …, Q_im ]^T is the output weight vector connecting hidden nodes and output nodes, and d_i is the bias of the ith hidden node. x_j , c_i denotes the inner product of x_j and c_i . The output of the ith hidden node with respect to the input sample x_j is Ω(x_j , c_i +d_i ). Equation (1) can be formulated compactly as [15]:

(2)KQ = T

where

K = [Ω(x1, c1 + d1)⋯Ω(x1, cM + dM)⋮⋱⋮Ω(xN, c1 + d1)⋯Ω(xN, cM + dM)]N×MQ = [Q1Q2.⋮QM]M×1 and T = [t1t2.⋮tN]N×1.

K is called the hidden layer output matrix of the SLFN. The ith column of K represents the ith node in the hidden layer with respect to inputs x₁, x₂, …, x_N. Ω(x₁, c₁+d₁) is called the hidden layer feature mapping. In ELM, to train an SLFN, one would find a solution such that [16]:

(3)||K(c1, …, cN, d1, …, dN)Q−T|| = minc, b, q||K(c1, …, cN, d1, …, dN)Q − T||.

Equation (3) is equivalent to minimizing the cost function

(4)E = ∑j = 1N(∑i = 1MQiΩ(xj, ci, di) − τj)2.

Based on the theories of ELM, almost every nonlinear piecewise continuous function used for feature mappings can make the ELM attain universal approximation capability [19]. For fixed input weights c_i and the hidden layer biases d_i , to train an SLFN is simply equivalent to finding the specific Q of the output [19]:

(5)||KQ^ − T| = ||KK†T − T| = minQ||KQ − T||.

The Algorithm of ELM can be simplified as follows [19]:

Input: (x_n , τ_m )∈Rⁿ^×^m, i=1, …, N // Training dataset

Ω(x_j , c_i +d_i ) // Activation function of the hidden node layer

M // Number of hidden nodes

Output:Q // The output weight vector

// Step 1, The parameters of hidden node include (c_i , d_i ) randomly assigned, i=1, …, M
for i=1: M do
|c_i ; b_i : randomly generated;
end
// Step 2, compute the output matrix K.
for i=1: M do
for j=1: N do
|Ω(x_j , c_i , d_i )
end
end
// Step 3, calculate the output weight vector Q.
Q=K^†T

The most important aspect of the ELM is that it learns the output weight Q by minimizing the cost function equation (4), which is equivalent to finding the specific Q^ in equation (5). Thus, the ELM learns Q by:

(6)Q^ = K†T

where K^† is the Moore–Penrose generalized inverse of matrix K. Therefore, the linear system of ELM in equation (2) can be solved by finding the least-square solutions of equation (6) to obtain the best value of the output weight. From Bartlett’s theory [3], better generalization performance of the networks tends to reach the smallest training error by the smallest norm of output weights, which can sufficiently be achieved through the ELM.

Moore–Penrose generalized inverse of K can be computed using various numerical mathematical methods. Orthogonal projection is an example of the numerical methods used in ELM such that K^†=(K^TK)^–1K^T if K^TK is nonsingular or K^†=K^T(K^TK)^–1 if KK^T is nonsingular. Nonetheless, other approaches such as singular value decomposition and fast Moore–Penrose pseudo inverse have also been studied [12].

2.2 Support Vector Machine

The SVM is a learning approach motivated by the statistical learning theory [8]. The SVM algorithm is based on three main aspects [6, 24]:

Training: used to estimate the parameters from a dataset.
Testing: to determine the function value.
Evaluation: to derive generalization ability and performance.

SVM involves projection of the input data {x₁, x₂, …, x_N } to a high-dimensional feature space using a mapping function G(·). Then, a linear hyperplane can be found in the feature space to separate the data. The generalized equation of SVM can be expressed as follows [8]:

(7)G(X) = wT ⋅ x + b

w, b are the normal and bias parameters that control the hyperplane, and w^T·x denotes the dot product between the normal and input x. To solve the two unknown variables (w, b) in equation (5), the following primal problem function has to be minimized [8]:

(8)Min:12wTws.t y(w ⋅ x + b) ≥ 1

The problem in equation (6) can be regarded as a convex, quadratic programming, and the constraints can be replaced with Lagrangian multipliers (α, …, α_M≥0) using the Lagrangian formulation. Therefore, the optimization problem can be formulated as [8]:

(9)Minimize: Lp(w, b, α) = 12wTw − ∑i(αiyi(w ⋅ xi − b) − αi)

(10)s.t ∑iαiyi = 0w = ∑iαiyixi.

Theoretically, there are two optimal formulations of SVM, namely, dual and primal. For the dual form, the Wolfe dual expression [6] is substituted in equation (7) and expressed as:

(11)Lp = ∑αi − 12∑i, jαiαjyiyjxixj.

Training SVM to assign x^*, which is the class label and side of decision boundary based on sgn(w^T·x+b), the points with α>0 that lie closest to the hyperplane are the support vectors. In the primal form, the Karush–Kuhn–Tucker (KKT) condition is introduced, which is expressed as follows [6, 24]:

(12)∂Lp∂wp = 0

(13)∂Lp∂b = 0

(14)yi(w ⋅ x + b) − 1 ≥ 0, i = 1, …, d

(15)α ≥ 0 ∀i

(16)αi(yi(w ⋅ x + b) − 1) ≥ 0, ∀i.

The KKT condition is crucial for w, b, α to be optimally computed. More so, it is important to mention that while w is determined explicitly in the training phase, the value of b is determined using KKT conditions (14) [6].

Furthermore, SVMs can be penalized using parameter regularization parameter C. With the 1-norm SVM containing the penalty C, the tradeoff between the training error and margin can be formulated as:

(17)12∥w∥1 + C∑i = 1lξi

where ‖w‖₁ denotes the Euclidean 1-norm, ξ is the slack variable that measures the margin violation ξ≥0, i=1, 2, …, l [20].

3 Proposed SVM–ELM Node Selection

The selection of the number of nodes is one of the underlying challenges when utilizing an ELM for pattern analysis. For instance, if the number of hidden nodes M is equal to the number N of distinct training samples, the matrix K is square and invertible. However, an overfitting model is constituted.

In this section, we present an SVM–ELM method for node selection of the ELM regression model based on the SVM classification with 1-norm, as shown in Figure 1. This can be related to the problem of finding the best number of features with SVM [11] or the optimal rule selection [22].

Figure 1:

Structure of the Proposed SVM–ELM.

Our aim is to optimize the number of hidden nodes by projecting the hidden layer matrix (which is already approximated with an activation function) to the dimension of 1-norm SVM. The focus in this paper is on the regression problem; thus, to enable SVM (which is a binary classification method) for the selection of the node, the target and input (approximated) matrix for SVM are generated using either the mean or the median of the ELM regression model training error. Each node in the approximated matrix K considered as input to SVM geometry is assigned to either −1 or +1. Then, the sparsity of SVM is 1-norm and is exploited to eliminate non-active nodes. The set of support vectors that lie on the optimal separating hyperplane of 1-norm SVM is used to select the corresponding candidate hidden neurons that are enough for training ELM.

Generally, in machine learning, better sparseness depends on a lower number of non-zero coefficients [33]. The standard SVM tends to have a number of non-zero coefficients equivalent to the number of support vectors [10]. In the standard SVM formulation with linear programming, the target is to minimize ‖w‖₁, and according to related studies, it has been shown that better sparsity can be obtained. In essence, a significant amount of the input data points have a lesser role to play in determining the coefficients of the model. The sparse coefficients from 1-norm SVM are the support vectors, which are directly related to the candidate hidden neurons (or active nodes) in ELM.

The network architecture of the proposed SVM–ELM is assigned with initial hidden nodes of the training dataset, and based on the standard ELM, all parameters of the network are randomly generated. Our method is composed of three main phases:

Regression analysis and error evaluation of ELM by calculating the root mean square error (RMSE) for the training model.
Consider an input data D with randomly generated hidden nodes as the input training matrix R, consisting of N-dimensional vectors x₁ ··· x_N with associated targets t₁ ··· t_N . The formulation of ELM training is simply a linear combination represented with equation (2). The training error vector E can be computed as the difference between the actual output y_a and the desired output y_d as follows:
(18)E = ya – yd.
Definition of a criteria value (mean or median of training error) to divide the approximated hidden node matrix into positive and negative classes.
A criteria value V (mean or median of training error) is used to separate the prediction error into two classes based on the following expression:
(19)E(xN) ≤ V, E(xN) > V.
Hence, we generated a new target vector T with respect to the distribution of the errors by formalizing the expression as follows:
(20)T(t1, … tN) = {1−1if E(xN) ≤ Votherwise.
Subsequently T was utilized to divide the R matrix into two classes: positive and negative. As matrix R consists of N-dimensional vectors, where the dimension of the matrix is simply the size of the random hidden neurons M by the number of input training sample N, therefore, the array of a newly derived R_new matrix can be represented as:
Rnew = [x1⋯x1, N⋮⋱⋮xm, 1⋯xm, N] ∈ [1⋮−1].
Training of 1-norm SVM to obtain the sparse coefficients. Displace all zero elements from the generated model and considering the support vectors as the active nodes, as shown in Figure 1.
To obtain the number of hidden neurons, we now consider classification with SVM for two class problem, where the input data is R_new with class target T. The decision function of SVM can be computed using equation (7) and the optimization of the primal of standard SVM to obtain the output coefficients is achieved using equation (9). However, it is imperative to note that in this paper we exploited 1-norm SVM, therefore the minimization is based on equation (17) and the output coefficients contain very sparse solutions. The resulting non-zero coefficients (w) are regarded as the candidate (active) hidden neurons in ELM. Finally, we can formally express the formulation of SVM–ELM as:
(21)min12||w||1 + C∑i = 1Nξi
s.t. wT ⋅ x + b = {1E(xN)ELM ≤ V−1E(xN)ELM > V

We summarize the steps of the hidden neuron generation in the following:

Input: data matrix D.
Add random initial weight and bias to D to derived matrix R.
Run ELM on the matrix R and get the prediction errors E.
Set a criteria value V.
Evaluate the set of errors [e₁, e₂, …, e_n ] in the prediction error vector E.
Find error values that satisfy equation (20), to generate a new target vector T.
Divide the matrix R with respect to T into a new matrix R_new containing both positive and negative classes.
Consider matrix R_new as the input matrix to 1-norm SVM.
Train 1-norm SVM to generate a model.
Find the non-zero elements and discard the solutions containing zeroes.
Output: Total number of non-zero elements, which is equivalent to the number of active hidden neurons.

4 Experimental Evaluation

In this section, the proposed SVM–ELM is evaluated on 10 real-world regression problems from an UCI ML repository [5] and StatLib [28]. Table 1 illustrates the composition of the nine regression application datasets.

In all the experiments, the input data are divided into training and testing sets with random permutation without replacement, and each experiment is repeated 50 times. The training and testing data are normalized between the range of 0 and 1. All simulations are carried in a MATLAB 12.b environment running on a desktop with CPU 2.66 GHz and 4 GB RAM. In this paper, the root-mean-squared-error (RMSE) is used as the evaluation metrics for the regressor.

Table 1

Specification of Benchmark Datasets.

Data sets	Attributes	Observations	Training	Testing
CCPP	4	9568	7000	2568
δ Elevators	6	9517	4000	5517
Machine CPU	10	103	75	28
Istanbul stock	8	536	350	186
Bank domains	32	4500	3000	1500
Airfoil self-noise	8	1503	800	379
Body fat	15	252	176	76
Autompg	8	392	300	92
Yacht hydrodynamics	7	308	200	108
Concrete slump	10	103	70	33

To investigate the robustness of the proposed SVM–ELM method for designing the compact structure network of ELM, we evaluate the impact of the initial number of hidden nodes on the consequent network architecture complexity and prediction accuracy using different activation functions. We illustrate an example of this experiment with bank domain dataset, as shown in Table 2.

Table 2

Investigation of the Effect of Initial Number of Nodes on the Final SVM–ELM Network Architecture Complexity and Testing Results.

Activation function	No. of initial nodes	No. of SVM–ELM nodes	Testing result (RMSE)	Training time (s)
Sigmoid	500	378	0.6327	1.1250
	1000	587	0.6883	3.0469
	2000	1047	0.7395	8.5938
	3000	1592	0.9687	19.7813
RBF	500	419	1.0169	1.7656
	1000	825	0.7166	5.7969
	2000	1585	0.9925	22.8281
	3000	2204	1.6892	40.1719
Hard lim	500	414	0.6306	1.4219
	1000	815	0.7052	5.7031
	2000	1529	0.9745	21.1406
	3000	2032	2.2077	35.6719

The results indicated that SVM–ELM optimizes the number of nodes effectively by selecting the support vectors as the active nodes, which also corresponds to feature selection with 1-norm SVM, as the set of features that lie on the margin carry the most valuable information [2]. For example, in the network with sigmoid function in Table 2, it can be seen that the initial number of nodes is significantly reduced from 500 to 378, and from 3000 to 1592. Consequently, it can be observed that the computational time for training is very low.

4.1 Benchmark Regression Problems

Here, we present the performance of a SVM–ELM for the regression task on real-world application datasets. Each experiment was repeated 50 times, and the results presented in this section are the average results of 50 trials. Following the approach described in [26], the initial number of nodes was selected by trial and error depending on the size of the input data. In terms of performance accuracy, SVM–ELM as shown in Table 3, provides better accuracy due to the removal of the redundant or inactive hidden nodes. For instance, the dataset of delta elevators with 3000 as the initial number of nodes attained a RMSE of 0.3050 using the standard ELM, while SVM–ELM yielded a RMSE of 0.0027.

Table 3

The Performance Comparison between ELM_sig and SVM–ELM_sig Methods.

Data sets	Methods	No. of initial node	No. SV (node)	Testing accuracy (RMSE)	Training time (s)
Istanbul stock	SVM–ELM	250	4	0.0142	0.4219
	ELM	250	–	0.0829	0.1563
Machine CPU	SVM–ELM	80	16	0.1507	0.1251
	ELM	80	–	3.3719	0.0313
CCPP	SVM–ELM	3000	436	5.8430	187.9
	ELM	3000	–	6.8399	118.3
Bank domains	SVM–ELM	2000	1047	1.0552	50.57
	ELM	2000	–	1.6045	32.95
Delta Elevators	SVM–ELM	3000	1852	0.0027	61.8750
	ELM	3000	–	0.3050	80.6563
Airfoil self-noise	SVM–ELM	800	526	1.09e-6	1.75
	ELM	800	–	2.25e-5	1.2031
Body fat	SVM–ELM	100	8	0.6726	0.0613
	ELM	100	–	0.1644	0.1074
Autompg	SVM–ELM	300	142	2.34e-5	0.1563
	ELM	300	–	4.37e-4	0.0938
Yacht hydrodynamics	SVM–ELM	100	35	2.4908	0.2187
	ELM	100	–	9.5863	0.0156
Concrete slump	SVM–ELM	60	7	0.1488	0.0315
	ELM	60	–	0.2483	0.0313

Moreover, the time required for training SVM–ELM when compared to the standard ELM is very marginal. Using a small dataset (Machine CPU) with a training size of 75, the training time for ELM is 0.0313, while for SVM–ELM, it is 0.1251. For a larger dataset (CCPP) with a training size of 7000, the training time for ELM is 118.3 and SVM–ELM is 187.9. Hence, the difference between the training time of ELM and SVM–ELM is the time spent for hidden node optimization with 1-norm SVM. SVM is able to prune redundant nodes while maintaining the good performance of ELM in a small training time. In essence, from the results illustrated in Table 3, it can be observed that SVM–ELM minimizes overfitting of the model.

Furthermore, as the tradeoff between the maximized (wider) margin and training error in SVM algorithm is determined by the C parameter, therefore, we present in Figure 2 the optimal number of nodes derived with the best value of C. Based on the results, the proposed SVM–ELM method selects the C value that increases the accuracy of the selected network, while minimizing the complexity of the model.

Figure 2:

Impact of Regularization Factor on Number of Nodes.

4.2 Comparison Between Mean and Median

In this paper, the targets of SVM y_i ∈{+1, –1} are derived by using the mean or median of the ELM training errors as a threshold. One issue that may arise is the presence of outliers, which can severely bias the threshold and create an imbalanced dataset (with respect to class distribution). To evaluate the most effective method for data separation, we performed different experiments using either mean or median for data separation, as shown in Table 4.

Table 4

The Performance Comparison between Mean and Median of ELM Training Error.

Datasets	No. of node	Mean	Median
Istanbul stock	250	0.0142	0.0130
Machine CPU	16	0.1507	0.1490
CCPP	436	5.8430	187.9
Bank domains	1047	1.0552	1.0293
δ Elevators	1852	0.0027	0.0021
Airfoil self-noise	526	1.09e-6	1.08e-6
Body fat	8	0.0613	0.0611
Autompg	142	2.34e-5	1.82e-5
Yacht hydrodynamics	35	2.4908	2.0311
Concrete slump	17	0.1488	0.1406

We discovered that the most effective statistic to use is the median of errors, which would always provide a balanced dataset for classification, as illustrated in Table 4. Based on the results, the use of the median as the threshold for partitioning the R matrix provides a slightly better performance than the mean.

5 Conclusion

This paper presented a new approach for selecting the optimal number of hidden nodes for ELM. In practical applications, the selection of hidden neurons of ELM has been considered crucial, most especially, for real-time problems. We introduced a new algorithm (SVM–ELM) for selecting the candidate neurons of ELM by optimizing the network architecture with 1-norm SVM. Initially, an ELM regression model is trained to obtain the model error. Then, a criteria value, which is based on either the mean or median of the model’s error is used to separate the approximated data matrix for training 1-norm SVM. The small number of support vectors that lie on the optimal separating hyperplane are considered as the candidate hidden neurons in the ELM. The SVM–ELM has been evaluated using 10 regression datasets from the UCI ML repository and Statlib. The results indicated the effectiveness of the proposed node selection method, both in terms of training time and prediction performance.

Corresponding author: Saif F. Mahmood, Faculty of Engineering, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia, e-mail: saiffd2000@gmail.com

Bibliography

[1] E. Alpaydin, Gal: networks that grow when they learn and shrink when they forget, Int. J. Pattern Recognit. Artif. Intell.08 (1994), 391–414.10.1142/S021800149400019XSearch in Google Scholar

[2] E. Alpaydin, Introduction to machine learning, MIT Press, 2004.Search in Google Scholar

[3] P. L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Trans. Inf. Theory44 (1998), 525–536.10.1109/18.661502Search in Google Scholar

[4] G. Bin Huang, L. Chen and C. K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Networks17 (2006), 879–892.10.1109/TNN.2006.875977Search in Google Scholar PubMed

[5] C. Blake and C. Merz, UCI repository of machine learning databases, in: UCI machine learning repository. School of Information and Computer Sciences, University of California, Irvine, 2007. http://www.ics. uci.edu/mlearn/MLRepository.html.Search in Google Scholar

[6] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, 43 (1998), 121–167.Search in Google Scholar

[7] A. Castaño, F. Fernández-Navarro and C. Hervás-Martínez, PCA-ELM: a robust and pruned extreme learning machine approach based on principal component analysis, Neural Process. Lett.37 (2012), 377–392.10.1007/s11063-012-9253-xSearch in Google Scholar

[8] J. S.-T. N Cristianini, An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, Cambridge, 2000.10.1017/CBO9780511801389Search in Google Scholar

[9] G. Feng, G. Huang, Q. Lin and R. Gay, Error minimized extreme learning machine with growth of hidden nodes and incremental learning, IEEE Trans. Neural Networks20 (2009), 1352–1357.10.1109/TNN.2009.2024147Search in Google Scholar PubMed

[10] G. M. Fung and O. L. Mangasarian, A feature selection Newton method for support vector machine classification, Comput. Optim. Appl.28 (2004), 185–202.10.1023/B:COAP.0000026884.66338.dfSearch in Google Scholar

[11] M. Han and J. Yin, The hidden neurons selection of the wavelet networks using support vector machines and ridge regression, Neurocomputing72 (2008), 471–479.10.1016/j.neucom.2007.12.009Search in Google Scholar

[12] P. Horata, S. Chiewchanwattana and K. Sunat, A comparative study of pseudo-inverse computing for the extreme learning machine classifier, in: Proceedings of the 3rd International Conference on Data Mining and Intelligent Information Applications, pp. 40–45, 2011.Search in Google Scholar

[13] G.-B. Huang and L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomputing71 (2008), 3460–3468.10.1016/j.neucom.2007.10.008Search in Google Scholar

[14] G. Huang, S. Song, J. N. D. Gupta and C. Wu, Semi-supervised and unsupervised extreme learning machines, IEEE Trans. Cybern.44 (2014), 1–13.10.1109/TCYB.2014.2307349Search in Google Scholar PubMed

[15] G.-B. Huang, H. Zhou, X. Ding and R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man, Cybern. Part B42 (2012), 513–529.10.1109/TSMCB.2011.2168604Search in Google Scholar

[16] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing70 (2006), 489–501.10.1016/j.neucom.2005.12.126Search in Google Scholar

[17] T. Y. Kwok and D. Y. Yeung, Objective functions for training new hidden units in constructive neural networks, IEEE Trans. Neural Networks8 (1997) 1131–1148.10.1109/72.623214Search in Google Scholar

[18] Y. Lan, Y. C. Soh and G. Bin Huang, Constructive hidden nodes selection of extreme learning machine for regression, Neurocomputing73 (2010), 3191–3199.10.1016/j.neucom.2010.05.022Search in Google Scholar

[19] S. Liao and C. Feng, Meta-ELM: ELM with ELM hidden nodes, Neurocomputing128 (2014), 81–87.10.1016/j.neucom.2013.01.060Search in Google Scholar

[20] O. Mangasarian, Exact 1-norm support vector machines via unconstrained convex differentiable minimization, J. Mach. Learn. Res.7 (2006), 1517–1530.Search in Google Scholar

[21] Y. Miche, A. Sorjamaa and A. Lendasse, OP-ELM: theory, experiments and a toolbox, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)5163 LNCS (PART 1) (2008), 145–154.10.1007/978-3-540-87536-9_16Search in Google Scholar

[22] H. Núñez, C. Angulo and A. Català, Rule extraction from support vector machines, in: Proceedings of European Symposium on Artificial Neural Networks (ESANN), pp. 107–112, 2002.Search in Google Scholar

[23] O. F. Alcin, A. Sengur, J. Qian, and M.C. Ince, OMP-ELM: orthogonal matching pursuit-based extreme learning machine for regression, J. Intell. Syst.24 (2015) 135–143.10.1515/jisys-2014-0095Search in Google Scholar

[24] C. M. Rocco and J. A. Moreno, Fast Monte Carlo reliability evaluation using support vector machine, Reliab. Eng. Syst. Saf.76 (2002), 237–243.10.1016/S0951-8320(02)00015-7Search in Google Scholar

[25] H.-J. Rong, Y.-X. Jia and G.-S. Zhao, Aircraft recognition using modular extreme learning machine, Neurocomputing128 (2014), 166–174.10.1016/j.neucom.2012.12.064Search in Google Scholar

[26] H.-J. Rong, Y.-S. Ong, A.-H. Tan and Z. Zhu, A fast pruned-extreme learning machine for classification problem, Neurocomputing72 (2008), 359–366.10.1016/j.neucom.2008.01.005Search in Google Scholar

[27] K. G. Sheela and S. N. Deepa, Review on methods to fix number of hidden neurons in neural networks, Math. Probl. Eng.2013 (2013), 1–11.10.1155/2013/425740Search in Google Scholar

[28] Statlib – datasets archive. http://lib.stat.cmu.edu/datasets/.Search in Google Scholar

[29] N. Wang, A generalized ellipsoidal basis function based online self-constructing fuzzy neural network, Neural Process. Lett.34 (2011), 13–37.10.1007/s11063-011-9181-1Search in Google Scholar

[30] N. Wang, M. J. Er and M. Han, Parsimonious extreme learning machine using recursive orthogonal least squares, 25 (2014), 1–14.10.1109/TNNLS.2013.2296048Search in Google Scholar PubMed

[31] N. Wang, M. J. Er and X. Y. Meng, A fast and accurate online self-organizing scheme for parsimonious fuzzy neural networks, Neurocomputing72 (2009), 3818–3829.10.1016/j.neucom.2009.05.006Search in Google Scholar

[32] N. Wang, M. J. Er, X.-Y. Meng and X. Li, An online self-organizing scheme for parsimonious and accurate fuzzy neural networks, Int. J. Neural Syst.20 (2010), 389–403.10.1142/S0129065710002486Search in Google Scholar PubMed

[33] L. Zhang and W. Zhou, On the sparseness of 1-norm support vector machines, Neural Networks23 (2010), 373–385.10.1016/j.neunet.2009.11.012Search in Google Scholar PubMed

[34] W. Zong and G.-B. Huang, Face recognition based on extreme learning machine, Neurocomputing74 (2011), 2541–2551.10.1007/978-3-642-21538-4_26Search in Google Scholar

Received: 2015-3-21

Published Online: 2015-9-29

Published in Print: 2016-10-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

SVM–ELM: Pruning of Extreme Learning Machine with Support Vector Machines for Regression

Abstract

1 Introduction

2 Brief Overview of ELM and 1-Norm SVM

2.1 Extreme Learning Machine

2.2 Support Vector Machine

3 Proposed SVM–ELM Node Selection

4 Experimental Evaluation

4.1 Benchmark Regression Problems

4.2 Comparison Between Mean and Median

5 Conclusion

Bibliography

Journal and Issue

Articles in the same Issue