Parsimonious regularized extreme learning machine based on orthogonal transformation

doi:10.1016/j.neucom.2014.12.046

Neurocomputing

Volume 156, 25 May 2015, Pages 280-296

https://doi.org/10.1016/j.neucom.2014.12.046 Get rights and content

Abstract

Recently, two parsimonious algorithms were proposed to sparsify extreme learning machine (ELM), i.e., constructive parsimonious ELM (CP-ELM) and destructive parsimonious ELM (DP-ELM). In this paper, the ideas behind CP-ELM and DP-ELM are extended to the regularized ELM (RELM), thus obtaining CP-RELM and DP-RELM. For CP-RELM(DP-RELM), there are two schemes to realize it, viz. CP-RELM-I and CP-RELM-II(DP-RELM-I and DP-RELM-II). Generally speaking, CP-RELM-II(DP-RELM-II) outperforms CP-RELM-I(DP-RELM-I) in terms of parsimoniousness. Under nearly the same generalization, compared with CP-ELM(DP-ELM), CP-RELM-II(DP-RELM-II) usually needs fewer hidden nodes. In addition, different from CP-ELM and DP-ELM, for CP-RELM and DP-RELM the number of candidate hidden nodes may be larger than the number of training samples, which assists the selection of much better hidden nodes for constructing more compact networks. Finally, eleven benchmark data sets divided into two groups are utilized to do experiments and the usefulness of the proposed algorithms is reported.

Introduction

As a new learning algorithm for single-hidden-layer feed-forward neural networks (SLFN), the theory of extreme learning machines (ELMs) [1], [2], [3] has recently become increasingly popular in a wide scope of applications such as function approximation [4], [5], classification [5], [6], and density estimation [7]. The main reasons are due to their advantages of low computational cost, good generalization ability, and ease of implementation. Traditionally, all the parameters of the feed-forward networks are tuned with the commonly-used gradient descent-based methods. It is clear that gradient descent-based learning methods are generally very slow because of improper learning steps or easily stuck in the local minima. In contrast, the input weights and the hidden layer biases are chosen randomly in ELM, where the output weights can be analytically determined via the simple generalized inverse operation on the hidden layer output matrices. Hence, the learning speed of ELM is thousands of times faster than the traditional feed-forward network learning algorithms like back-propagation [8]. On the other side, ELM not only emphasizes to obtain the smallest training errors but also the smallest norm of weights. According to Bartlett׳s theory [9], the smaller the norm of weights is, the better generalization performance the networks tend to have. Therefore, ELM tends to have good generalization performance for feed-forward networks. Another well-known machine learning algorithm, viz. support vector machine (SVM) [10], [11], has been extensively used in classification and other fields in the last two decades due to its outstanding learning performance. In [12], ELM and SVM are compared from two viewpoints: one is the Vapnik–Chervonenkis dimension, and another is their performance under different training sample sizes. It is shown that the VC dimension of an ELM is equal to its number of hidden nodes with probability one. In addition, their generalization ability and computational complexity are demonstrated with changing training sample size. ELMs have weaker generalization ability than SVMs for small samples but can generalize as well as SVMs for large samples. Evidently, great superiority in computational speed especially for large-scale sample problems is found in ELMs.

In theory, ELM (with N hidden nodes) with randomly chosen input weights and hidden layer biases can exactly learn N distinct samples [13]. Actually, investigations show that N hidden nodes are usually not needed to construct good ELM. On one hand, the property of random choice of input weights and biases may generate redundant or unimportant hidden nodes for ELM, thus leading to the hidden layer output matrix being not of full column rank. In this context, ELM is ill-posed, often suffering from the deterioration of the generalization performance. On the other hand, according to the Occam׳s razor principle “plurality should not be posited without necessity” [14], it is known that a practical nonlinear modeling principle is to find the smallest model that generalizes well. Sparse¹ models are preferable in engineering applications since a model׳s computational complexity scales with its model complexity. Moreover, a sparse model is easier to interpret from the viewpoint of knowledge extraction [15]. Hence, it is necessary and important to find an sparse model for randomly generated hidden nodes, which owns good generalization performance meanwhile needing fewest hidden nodes. Additionally, many learning algorithms like online sequential learning [16], [17], [18] also need a thrifty mode before starts. That is, they were developed on the basis of ELM with an a prior fixed network size.

Generally speaking, two different approaches are usually utilized to sparsify ELM. The first is referred to as constructive algorithms, which begin with a small initial network (even a null model) and then gradually recruit the hidden nodes required until a satisfactory solution is sought. The incremental ELM [3] and its two variants [19], [20] belong to typical constructive algorithms. Subsequently, an error minimized extreme learning machine [21] was proposed, which can add random hidden nodes one by one or group by group and automatically determine the number of hidden nodes. Based on multi-response sparse regression and unbiased risk-estimation-based criterion C_p, two constructive algorithms [22], [23] for ELM were developed to cope with single- and multioutput regression problems, respectively. Recently, a constructive parsimonious extreme learning machine (CP-ELM) [24] was presented, which firstly transforms the hidden nodes output matrix into an upper triangular matrix equivalently using Givens rotation, and then recruits the hidden nodes incurring the largest additional residual error reductions to construct the final network subsequently. By contrast, the second, named as destructive algorithms (a.k.a. pruning algorithms), starts by training a larger than necessary network and then removes the redundant or unimportant hidden nodes. Aiming at classification problems, a pruned-ELM [25] was proposed to reach a compact network classifier, in which the irrelevant nodes are pruned by considering their relevance to the class labels from an initial large number of hidden nodes. In [26], an optimally pruned ELM (OP-ELM) methodology was presented, which firstly builds the original ELM, and then ranks the hidden nodes by applying multi-response sparse regression and selects the hidden nodes through leave-one-out validation. In parallel with CP-ELM, there is a destructive parsimonious ELM (DP-ELM) developed in [24], correspondingly.

To improve the performance on the original ELM, ELM was regularized using Tikhonov׳s method [27], i.e., regularized extreme learning machine (RELM) [28]. The sequential learning algorithm for RELM [29] was developed as well. And experimental studies [30], [31], [32], [33] demonstrated that RELM can overcome the ill-posed problem and control the machine learning complexity to enhance the generalization performance. Similar to the original ELM, the solution of RELM is also dense. Compared with a lot of efforts above-mentioned on sparsifying the solution of ELM, to date much less attention has been attracted to RELM. An improvement of OP-ELM was proposed to determine a compact network for RELM, which uses a cascade of two regularization penalties: first a L₁ penalty to rank the neurons of the hidden layer using least angle regression [34], followed by Tikhonov-regularized (TR) penalty on the regression weights for numerical stability and efficient pruning of the neurons, so this destructive algorithm was named TROP-ELM for short [35]. To obtain low complexity and sparse solution, a fast sparse approximation scheme for RELM (FSA-RELM) [36] was recently introduced, which begins with a null solution and gradually selects a new hidden node according to some criteria. And this procedure is repeated until the stopping criterion is met. Evidently, we know that FSA-RELM is a typical constructive algorithm.

Hence, we will do some research work to realize the sparseness of RELM, which includes the extensions of CP-ELM and DP-ELM to RELM. The main contributions in this paper are listed as:

(1)
Enlightened by the ideas of CP-ELM and DP-ELM, we extend them to RELM, thus yielding CP-RELM and DP-RELM, respectively. CP-RELM and DP-RELM dominate CP-ELM and DP-ELM in terms of the number of hidden nodes, respectively, under nearly the same prediction accuracy.
(2)
During the process of sparsifying RELM, two schemes are adopted. Although these two schemes are completely equivalent from theoretical viewpoint, their empirical studies show different to some extent.
(3)
In CP-ELM and DP-ELM, the number of candidate hidden nodes is small because we need to transform the candidate hidden nodes output matrix into an upper triangular matrix by the orthogonal decomposition. If the number of candidate hidden nodes is large, the hidden nodes output matrix is singular or nearly singular. Due to the Tikhonov regularization, we can initialize RELM with more candidate hidden nodes (even their number is larger than the number of training samples) in order that much better hidden nodes are recruited to improve the generalization performance and generate more compact network.

The rest of the paper is organized as follows. In Section 2, ELM and RELM are introduced. And then the ideas behind CP-ELM and DP-ELM are described. In Section 4, we extend the ideas of CP-ELM and DP-ELM to sparsify RELM with two schemes, yielding CP-RELM and DP-RELM. To confirm the effectiveness and feasibility of the proposed algorithms, eleven benchmark data sets divided into two groups are utilized to do experiments in Section 5. Finally, conclusions follow.

Section snippets

ELM and RELM

In this section, we will briefly depict the essence of ELM firstly, and then RELM is introduced. As for the theoretical foundation of ELM, Huang and Babri [37] studied the learning performance of SLFN on small size data set and found that SLFN with at most N hidden neurons can learn N distinct samples with zero error by adopting any bounded nonlinear activation function. On the basis of this concept, Huang et al. [1], [2], [3] proposed the ELM algorithm. The main concept behind ELM lies in the

CP-ELM and DP-ELM

From some preliminary work in [2], it is easily known that the original ELM needing N hidden nodes at most can approximate N distinct samples with any accuracy. In fact, in real-world applications, it is common that the number of hidden nodes is always less than the number of training samples, i.e., $L < N$ , and $H$ is of full column rank. Therefore, the following theorem is obtained:

Theorem 1

The minimizer of Eq. (4) is equivalent to the following minimizer: $\min_{β} {‖ R_{L \times L} β - {\hat{d}}_{L \times 1} ‖_{2}^{2}}$ where $H = Q [\begin{matrix} R_{L \times L} \\ 0_{(N - L) \times L} \end{matrix}]$ $Q^{T} d = [\begin{matrix} {\hat{d}}_{L \times} \end{matrix}$

CP-RELM and DP-RELM

In order to use the ideas behind CP-ELM and DP-ELM to sparsify RELM, some preliminaries must be done on Eq. (7). Hence, the following theorems are introduced in advance.

Theorem 2

The minimizer of Eq. (7) is completely equivalent to the following minimizer: $\min_{β} {\dot{J} = {∥ \dot{H} β - \dot{d} ∥}_{2}^{2}}$ where $\dot{H} = H^{T} H + λ I$ , $\dot{d} = H^{T} d$ .

Proof

The optimal solution of Eq. (7) is found by solving $(H^{T} H + λ I) β = H^{T} d$ Then, the solution of Eq. (32) in the least square sense can be got via the following optimal problem: $\min_{β} {{∥ (H^{T} H + λ I) β - H^{T} d ∥}_{2}^{2} = {∥ \dot{H} β - \dot{d} ∥}_{2}^{2}}$ This

Experiments

To show the usefulness of the proposed methods empirically, some regression experiments including 11 benchmark data sets specified in Table 1 are performed, which are divided into two groups. In our experiments, all the inputs (attributes) have been normalized into the range [−1, 1] while the outputs (targets) have been normalized into [0, 1]. In this paper, two typical hidden nodes are chosen as the activation functions of hidden layer, i.e. the Sigmoid $h (x) = 1 / (1 + \exp {- x^{T} a_{i}})$ and the RBF $h (x) =$

Conclusions

Extreme learning machine as an emerging learning tool in machine learning community has attracted much attention in the past few years. Because of its simple formula and high computational efficiency, ELM has rapidly obtained wide applications in walks of life. In theory, it has been shown that ELM with N random hidden nodes can exactly learn N distinct samples. Actually, there are two potential risks of doing this. On one hand, N hidden nodes may lead to the network too complicated and thus

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly.

Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His

References (44)

G.-B. Huang et al.
Extreme learning machinetheory and applications
Neurocomputing
(2006)
R. Savitha et al.
Fast learning circular complex-valued extreme learning machine (CC–ELM) for real-valued classification problems
Inf. Sci.
(2012)
X. Liu et al.
A comparative analysis of support vector machines and extreme learning machines
Neural Netw.
(2012)
Y. Lan et al.
Ensemble of online sequential extreme learning machine
Neurocomputing
(2009)
J. Zhao et al.
Online sequential extreme learning machine with forgetting mechanism
Neurocomputing
(2012)
G.-B. Huang et al.
Convex incremental extreme learning machine
Neurocomputing
(2007)
G.-B. Huang et al.
Enhanced random search based incremental extreme learning machine
Neurocomputing
(2008)
Y. Lan et al.
Constructive hidden nodes selection of extreme learning machine for regression
Neurocomputing
(2010)
N. Wang et al.
Constructive multi-output extreme learning machine with application to large tanker motion dynamics identification
Neurocomputing
(2014)
H.-J. Rong et al.
A fast pruned-extreme learning machine for classification problem
Neurocomputing
(2008)

H.T. Huynh et al.

Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks

Pattern Recognit. Lett.

(2011)

Y. Miche et al.

TROP-ELMa double-regularized ELM using LARS and Tikhonov regularization

Neurocomputing

(2011)

X. Li et al.

Fast sparse approximation of extreme learning machine

Neurocomputing

(2014)

G.-B. Huang et al.

Optimization method based extreme learning machine for classification

Neurocomputing

(2010)

G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in:...

G.-B. Huang et al.

Universal approximation using incremental constructive feedforward networks with random hidden nodes

IEEE Trans. Neural Netw.

(2006)

G. Feng et al.

Evolutionary selection extreme learning machine optimization for regression

Soft Comput.

(2012)

G.-B. Huang et al.

Extreme learning machine for regression and multiclass classification

IEEE Trans. Syst. Man Cybern. Part B: Cybern.

(2012)

M. Han, D. Li, A fast two stage density estimation based on extreme learning machine, in: Proceedings of the 2009...

D.E. Rumelhart et al.

Learning representations by back-propagating errors

Nature

(1986)

P.L. Bartlett

Sample complexity of pattern classification with neural networksthe size of the weights is more important than the size of the network

IEEE Trans. Inf. Theory

(1998)

C. Cortes et al.

Support-vector networks

Mach. Learn.

(1995)

Cited by (12)

Soft extreme learning machine for fault detection of aircraft engine
2019, Aerospace Science and Technology
When extreme learning machine (ELM) is used to cope with classification problems, the ±1 is commonly used to construct the label vector. Since ELM adopts the square loss function, this means that it tends to force the margins of all the training samples exactly equaling one from the perspective of margin learning theory, which is unreasonable to some extent. To overcome this hard margin flaw, in this paper a soft extreme learning machine (SELM) is proposed, which flexibly sets a soft target margin for each training sample. Through solving a series of regularized ELMs (RELMs), SELM can be computed efficiently. Based on SELM, an improved SELM (ISELM) is proposed to deal with imbalanced classification problems, which can keep the same computational efficiency as SELM via solving a series of weighted RELMs. From the experimental results on benchmark data sets, the effectiveness and feasibility of SELM and ISELM are confirmed. More importantly, when they are applied to fault detection of aircraft engine, they are promising to be developed as the candidate techniques for it, and ISELM is especially in favor.
C-loss based extreme learning machine for estimating power of small-scale turbojet engine
2019, Aerospace Science and Technology
As a high-efficiency training method for single layer feedforward neural networks, extreme learning machine (ELM) has drawn much interest recently, but its robustness is not good due to the adoption of the square loss function. Hence, the convex loss function in ELM is replaced with a nonconvex loss function, i.e., the C-loss function, so a novel algorithm, called C-loss based ELM (CELM), is proposed in this paper. According to the experimental results on a toy example and two benchmark data sets, CELM performs better than the other algorithms with respect to the generalization performance. To be more important, when CELM is used to estimate power of small-scale turbojet engine, it also dominates the other algorithms, which makes the development of the potential control structure, viz., the direct power control, for the unmanned aerial vehicles more feasible.
A robust extreme learning machine for modeling a small-scale turbojet engine
2018, Applied Energy
In this paper, a robust extreme learning machine is proposed. In comparison with the original extreme learning machine and the regularized extreme learning machine, this robust algorithm minimizes both the mean and variance of modeling errors in the objective function to overcome the bias-variance dilemma. As a result, its generalization performance and robustness are enhanced, and these merits are further proved theoretically. In addition, this proposed algorithm can keep the same computational efficiency as the original extreme learning machine and the regularized extreme learning machine. Then, several benchmark data sets are used to test the effectiveness and soundness of the proposed algorithm. Finally, it is employed to model a real small-scale turbojet engine. This engine is fit well. Especially, on the idle phase, where the signal-to-noise ratio is low and it is very hard to model, the proposed algorithm performs well and its robustness is sufficiently showcased. All in all, the proposed algorithm provides a candidate technique for modeling real systems.
Feature selection of generalized extreme learning machine for regression problems
2018, Neurocomputing
Citation Excerpt :
Moreover, an error-minimization-based method [51] was presented to grow hidden nodes for ELM one by one or group by group, and output weights are incrementally updated which significantly reduces the computational complexity. Recently, the constructive parsimonious ELMs [52–54] were developed based on the orthogonal transformation. These aforementioned algorithms are inside the range of the forward learning algorithms.
Recently a generalized single-hidden layer feedforward network was proposed, which is an extension of the original extreme learning machine (ELM). Different from the traditional ELM, this generalized ELM (GELM) utilizes the p-order reduced polynomial functions of complete input features as output weights. According to the empirical results, there may be insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. However, to date there has not been such work of selecting appropriate input features used for constructing output weights of GELM. Hence, in this paper two greedy learning algorithms, i.e., a forward feature selection algorithm (FFS-GELM) and a backward feature selection algorithm (BFS-GELM), are first proposed to tackle this issue. To reduce the computational complexity, an iterative strategy is used in FFS-GELM, and its convergence is proved. In BFS-GELM, a decreasing iteration is applied to decay this model, and in this process an accelerating scheme was proposed to speed up computation of removing the insignificant or redundant features. To show the effectiveness of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments. From these reports, it is demonstrated that both FFS-GELM and BFS-GELM can select appropriate input features to construct the p-order reduced polynomial function as output weights for GELM. FFS-GELM and BFS-GELM enhance the generalization performance and simultaneously reduce the testing time compared to the original GELM. BFS-GELM works better than FFS-GELM in terms of the sparsity ratio, the testing time and the training time. However, it slightly loses the advantage in the generalization performance over FFS-GELM.
Retargeting extreme learning machines for classification and their applications to fault diagnosis of aircraft engine
2017, Aerospace Science and Technology
Citation Excerpt :
Thereinto, the constructive and destructive algorithms play an important role in this respect. The constructive algorithms start with a small initial size and gradually recruit new hidden nodes until a required solution is found, such as the incremental ELMs [10–14]. In contrast, the destructive algorithms, also known as pruning algorithms [15,16], are initialized with a network of a larger than necessary size, and then the redundant or less effective hidden nodes are gradually removed.
Since the original extreme learning machine (ELM) generates the hidden nodes randomly, it usually needs more hidden nodes to reach the good classification performance. However, more hidden nodes will jeopardize the real time, which limits its applications to the testing time sensitive scenarios. To this end, the commonly-used methods tend to compact its structure via optimizing the number of hidden nodes. Different from this viewpoint of network structure, in this paper two algorithms are proposed to improve the real time performance of ELM from a viewpoint of data structure. Specially, they improve the ELM classification performance by retargeting its label vectors. As thus, they need fewer hidden nodes to reach the same classification performance, which means the better real time. Finally, experimental results on the benchmark data sets validate the effectiveness and feasibility of the presented two algorithms. To be more important, they are applied to the fault diagnosis of aircraft engine and can be developed as its candidate techniques.
The selection of input weights of extreme learning machine: A sample structure preserving point of view
2017, Neurocomputing
Citation Excerpt :
This point of view makes the classical results (e.g. Koksma–Hlawka inequality) of QMC method applicable and the theoretical analysis more direct, while in [30,31] a new discrepancy measure, called box discrepancy, had to be defined because the variation of the deduced integrand corresponding to the shift-invariant kernel is not bounded. In previous works [34–37], the orthogonalization technique has been used in ELM. In [34,35], the orthogonalization operation are conducted on the output matrix of hidden layer of ELM not on the input weights.
The random assignment strategy for input weights has brought extreme learning machine (ELM) many advantages such as fast learning speed, minimal manual intervention and so on. However, the Monte Carlo (MC) based random sampling method that is typically used to generate input weights of ELM has poor capability of sample structure preserving (SSP), which will degenerate the learning and generalization performance. For this reason, the Quasi-Monte Carlo (QMC) method is revisited and it is shown that the distortion error of QMC projection can obtain faster convergence rate than that of MC for relatively low-dimensional problems. Further, a unified random orthogonal (RO) projection method is proposed, and it is shown that such RO method can always provide the optimal transformation in terms of minimizing the loss of all the distances between samples. Experimental results on real-world benchmark data sets verify the rationality of theoretical analysis and indicate that by enhancing the SSP capability of input weights, QMC and RO projection methods tend to bring ELM algorithms better generalization performance.

View all citing articles on Scopus

Kang-Kang Wang was born in 1989. He received the B.Eng. degree in Heilongjiang University of Science and Technology, China, in 2012. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.

Ye-Bo Li received his B.S. degree in the thermal energy and power engineering field from Shenyang Aerospace University, Shenyang, China in July 2008. From September 2008 to April 2011, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in Aerospace Propulsion Theory and Engineering, and received M.S. degree. Now, he is pursuing the Ph.D. degree from Nanjing University of Aeronautics and Astronautics. His research interests include aero-engine control system design and establishment of aero-engines׳ mathematical model.

View full text

Parsimonious regularized extreme learning machine based on orthogonal transformation

Abstract

Introduction

Section snippets

ELM and RELM

CP-ELM and DP-ELM

CP-RELM and DP-RELM

Experiments

Conclusions

Acknowledgments

Neurocomputing

Inf. Sci.

Neural Netw.

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Pattern Recognit. Lett.

Neurocomputing

Neurocomputing

Neurocomputing

Universal approximation using incremental constructive feedforward networks with random hidden nodes

IEEE Trans. Neural Netw.

Evolutionary selection extreme learning machine optimization for regression

Soft Comput.

Extreme learning machine for regression and multiclass classification

IEEE Trans. Syst. Man Cybern. Part B: Cybern.

Learning representations by back-propagating errors

Nature

Sample complexity of pattern classification with neural networksthe size of the weights is more important than the size of the network

IEEE Trans. Inf. Theory

Support-vector networks

Mach. Learn.