An accelerating scheme for destructive parsimonious extreme learning machine

doi:10.1016/j.neucom.2015.04.002

Neurocomputing

Volume 167, 1 November 2015, Pages 671-687

https://doi.org/10.1016/j.neucom.2015.04.002 Get rights and content

Abstract

Constructive and destructive parsimonious extreme learning machines (CP-ELM and DP-ELM) were recently proposed to sparsify ELM. In comparison with CP-ELM, DP-ELM owns the advantage in the number of hidden nodes, but it loses the edge with respect to the training time. Hence, in this paper an equivalent measure is proposed to accelerate DP-ELM (ADP-ELM). As a result, ADP-ELM not only keeps the same hidden nodes as DP-ELM but also needs less training time than CP-ELM, which is especially important for the training time sensitive scenarios. The similar idea is extended to regularized ELM (RELM), yielding ADP-RELM. ADP-RELM accelerates the training process of DP-RELM further, and it works better than CP-RELM in terms of the number of hidden nodes and the training time. In addition, the computational complexity of the proposed accelerating scheme is analyzed in theory. From reported results on ten benchmark data sets, the effectiveness and usefulness of the proposed accelerating scheme in this paper is confirmed experimentally.

Introduction

The widespread popularity of single-hidden layer feedforward networks (SLFNs) in extensive fields is mainly due to their powerfulness of approximating complex nonlinear mappings and simple forms. As a specific type of SLFNs, extreme learning machine (ELM) [1], [2], [3] recently has drawn a lot of interests from researchers and engineers. Generally, training an ELM consists of two main stages [4]: (1) random feature mapping and (2) linear parameters solving. In the first stage, ELM randomly initializes the hidden layer to map the input samples into a so-called ELM feature space with some nonlinear mapping functions, which can be any nonlinear piecewise continuous functions, such as the sigmoid and the RBF. Since in ELM the hidden node parameters are randomly generated (independent of the training samples) without tuning according to any continuous probability distribution instead of being explicitly trained, it owns the remarkable computational priority to regular gradient-descent backpropagation [5], [6]. That is, unlike conventional learning methods that must see the training samples before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training samples. In the second stage, the linear parameters can be obtained by solving the Moore-Penrose generalized inverse of hidden layer output matrix, which reaches both the smallest training error and the smallest norm of output weights [7]. Hence, different from most algorithms proposed for feedforward neural networks not considering the generalization performance when they were proposed first time, ELM can achieve better generalization performance.

From the interpolation perspective, Huang et al. [3] indicated that for any given training set, there exists an ELM network which gives sufficient small training error in squared error sense with probability one, and the number of hidden nodes is not larger than the number of distinct training samples. If the number of hidden nodes amounts to the number of distinct training samples, then training errors decreases to zero with probability one. Actually, in the implementation of ELM, it is found that the generalization performance of ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [8]. In addition, unlike traditional SLFNs, such as radial basis function neural networks and multilayer perceptron with one hidden layer, where the activation functions are required to be continuous or differentiable, ELM can choose the threshold function and many others as activation functions without sacrificing the universal approximation capability at all. In statistical learning theory, Vapnik–Chervonenkis (VC) dimension theory is one of the most widely used frameworks in generalization bound analysis. According to the structure risk minimization perspective, to obtain better generalization performance on testing set, an algorithm should not only achieve low training error on training set, but also should have a lower VC dimension. Aiming to classification tasks, Liu et al. [9] proved that the VC dimension of an ELM is equal to its number of hidden nodes with probability one, which states that ELM has a relatively low VC dimension. With respect to regression problems, the generalization ability of ELM had been comprehensively studied in [10] and [11], leading to the conclusion that ELM with some suitable activation functions, such as polynomials, sigmoid and Nadaraya–Watson function, can achieve the same optimal generalization bound as a SLFN in which all parameters are tunable. From the above analyses, it is known that ELM owns excellent universal approximation capability and generalization performance. However, due to the fact that ELM generates hidden nodes randomly, it usually requires more hidden nodes than traditional SLFN to get matched performance. Large network size always signifies more running time in the testing phase. In cost sensitive learning, the testing time should be minimized, which requires a compact network to meet testing time budget. Thus, the topic on improving the compactness of ELM has recently attracted great interests.

First, Huang et al. [12] proposed an incremental ELM (I-ELM), which randomly generates the hidden nodes and analytically calculates the output weights. Due to the fact that I-ELM does not recalculate the output weights of all the existing nodes when a new node is added, hence its convergence rate can be further improved by recalculating the output weights of the existing nodes based on a convex optimization method when a new hidden node is randomly added [13]. In I-ELM, there may be some hidden nodes, which play a very minor role in the network output and eventually increase the network complexity. In order to avoid this problem and to obtain a more compact network, an enhanced version of I-ELM was presented [14], where in each learning step several hidden nodes are randomly generated and among them the hidden node leading to the largest residual error decreasing will be added to the existing network. In [15], an error minimized ELM was proposed, which can add random hidden nodes one by one or group by group. And during the growth of the network, the output weights are updated incrementally. In [16], an optimally pruned extreme learning machine was presented, which is based on the original ELM with additional steps to make it more robust and compact. Deng et al. [17] adopted the two-stage stepwise strategy to improve the compactness of the ELM. At the first stage, the selection procedure can be automatically terminated based on the leave-one-out error. At the second stage, the contribution of each hidden nodes is reviewed, and insignificant ones are replaced. This procedure dose not terminate until no insignificant hidden nodes exist in the final model. When Bayesian approach is applied to ELM, a Bayesian ELM was got [18]. This Bayesian method can not only optimize the network architecture, but also make use of a priori knowledge and obtain the confidence interval. Subsequently, a new learning algorithm called bidirectional ELM was presented [19], which can greatly enhance the learning effectiveness, reduce the number of hidden nodes, and eventually further increase the learning speed. In addition, the traditional ELM was further extended by Zhang et al. [20] to be the one using Lebesgue p-integrable hidden activation functions, which can approximate any Lebesgue p-integrable functions on a compact input set. Further, a dynamic ELM where the hidden nodes can be recruited or deleted dynamically according to their significance to network performance was proposed [21], so that not only the parameters can be adjusted but also the architecture can be self-adapted simultaneously. Castano et al. [22] proposed a robust and pruned ELM approach, where the principal component analysis is utilized to select the hidden nodes from input features while corresponding input weights are deterministically defined as principal components rather than random ones. To solve large and complex sample issues, a stacked ELM was designed [23], which divides a single large ELM network into multiple stacked small ELMs which are serially connected. In addition, to improve the performance of ELM on high-dimension and small-sample problems, a projection vector machine [24] was proposed. Aiming at classification tasks [7], [25], [26], [27], [28], several algorithms were proposed to optimize the network size of ELM.

Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) were proposed based on recursive orthogonal least squares [29]. CP-ELM starts with a small initial network and gradually adds new hidden nodes until a satisfactory solution is found. In contrary, DP-ELM starts by training a larger than necessary network and then removes the insignificant node one by one. Further, Zhao et al. [30] extend CP-ELM and DP-ELM to the regularized ELM (RELM), thus correspondingly obtaining CP-RELM and DP-RELM. Generally speaking, compared with CP-ELM, DP-ELM can obtain more compact network size when reaching nearly the same accuracy, but it loses the edge on the training time. The main reason is that when DP-ELM removes every insignificant hidden node, a series of Givens rotations are implemented to compute the additional residual error reduction. As known, the computational burden of Givens rotation is high. Hence, a scheme is proposed to accelerate DP-ELM (ADP-ELM) instead of Givens rotations. ADP-ELM not only needs fewer hidden nodes than CP-ELM, but also wins it in the training time. When this accelerating scheme is extended to DP-RELM, ADP-RELM is yielded. ADP-RELM reduces the training time of DP-RELM further, and outperforms CP-RELM in terms of the number of hidden nodes and the training time. In theory, this proposed accelerating scheme can acquire the same effect as Givens rotations but requires less training time when removing the insignificant hidden nodes. Finally, extensive experiments are conducted on benchmark data sets, and the results verify the effectiveness of the proposed scheme in this paper.

The remainder of this paper is organized as follows. In Section 2, ELM and RELM are briefly introduced. As two destructive algorithms, DP-ELM and DP-RELM are described in Section 3. In Section 4, an equivalent measure is proposed instead of Givens rotations to respectively accelerate DP-ELM and DP-RELM, yielding ADP-RELM and ADP-RELM, and their computational complexity is analyzed. Experimental results on ten benchmark data sets are presented in Section 5. Finally, conclusions follow.

Section snippets

ELM

For N arbitrary distinct samples ${(x_{i}, t_{i})}_{i = 1}^{N}$ , where $x_{i} \in ℜ^{n}$ and $t_{i} \in ℜ^{m}$ , the traditional SLFN with L hidden nodes and activation function $h (x)$ can be mathematically modeled as $y = \sum_{i = 1}^{L} β_{i} h (x; a_{i}, b_{i})$ where $a_{i} \in ℜ^{n}$ is the input weight vector connecting the ith hidden node and the input nodes, $b_{i} \in ℜ$ is the bias of the ith hidden node, $β_{i} \in ℜ^{m}$ is the output weight vector connecting the ith hidden node and the output nodes.

If the outputs of the SLFN amount to the targets, the following compact formulation is got

Preliminary work

According to the conclusion in [3], it has been proved that an ELM with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct samples. In fact, in real-world applications, the number of hidden nodes is always less than the number of training samples, i.e., L<N, and the matrix H is of full column rank with probability one. Hence, the following theorem is obtained

Theorem 1

The minimizer of Eq. (5) amounts to the following minimizer:

\min_{β} {{G'}_{E L M} = {‖ R_{L \times L} β - {\hat{T}}_{L \times m} ‖}_{F}^{2}}

An equivalent measure

Assume that at the (i−1)th iteration in DP-ELM or DP-RELM, $R^{(i - 1)}$ and $T^{(i - 1)}$ are obtained, and let ${\hat{G}}^{(i - 1)} = \min_{β^{(i - 1)}} {G^{(i - 1)} = {‖ R^{(i - 1)} β^{(i - 1)} - T^{(i - 1)} ‖}_{F}^{2}}$ Evidently, ${\hat{G}}^{(i - 1)} = 0$ . When the regressor $r_{s}^{(i - 1)}$ is removed at the ith iteration, Eq. (24) becomes as ${\hat{G}}_{- s}^{(i)} = \min_{β^{(i - 1)}} {G_{- s}^{(i)} = {‖ R_{- s}^{(i)} β^{(i)} - T^{(i - 1)} ‖}_{F}^{2}}$ Then, define $Δ_{- s}^{(i)} = {\hat{G}}_{- s}^{(i)} - {\hat{G}}^{(i - 1)} = {\hat{G}}_{- s}^{(i)}$ here $Δ_{- s}^{(i)}$ represents the increase on the objective function in (24) due to the removal of the regressor $r_{s}^{(i - 1)}$ . Therefore, the following theorem is obtained

Theorem 3

The

Experiments

In this section, we experimentally study the usefulness of the proposed scheme in accelerating the training processes of DP-ELM and DP-RELM. The experiment environment consists of: Windows 7 32-b operating system, Intel^® Core™ i3-2310 M CPU @2.10 GHz, 2.00 GB RAM, Matlab2013b platform. Ten benchmark data sets (three multi-output data sets plus seven single-output ones) are utilized to do experiments, which include Energy efficiency, Sml2010, Parkinsons, Airfoil, Abalone, Winequality white, CCPP,

Conclusions

Single-hidden layer feedforward networks recently have drawn much attention due to their excellent performances and simple forms. Unlike the traditional SLFNs where the input weights and the biases of hidden layer need to be tuned, ELM randomly generates these parameters independent of the training samples, and its output weights can be analytically determined. In this situation, ELM maybe generates redundant hidden nodes. Hence, it is necessary to improve the compactness of ELM network. CP-ELM

Acknowledgment

This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly.

Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His

References (33)

G.-B. Huang et al.
Extreme learning machine: theory and applications
Neurocomputing
(2006)
G. Huang et al.
Trends in extreme learning machines: a review
Neural Netw.
(2015)
G.-B. Huang et al.
Optimization method based extreme learning machine for classification
Neurocomputing
(2010)
X. Liu et al.
A comparative analysis of support vector machines and extreme learning machines
Neural Netw.
(2012)
G.-B. Huang et al.
Convex incremental extreme learning machine
Neurocomputing
(2007)
G.-B. Huang et al.
Enhanced random search based incremental extreme learning machine
Neurocomputing
(2008)
J. Deng et al.
Fast automatic two-stage nonlinear model identification based on the extreme learning machine
Neurocomputing
(2011)
W.-Y. Deng et al.
Projection vector machine
Neurocomputing
(2013)
H.-J. Rong et al.
A fast pruned-extreme learning machine for classification problem
Neurocomputing
(2008)
D. Du et al.
A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine
Neurocomputing
(2013)

Y.-P. Zhao et al.

Parsimonious regularized extreme learning machine based on orthogonal transformation

Neurocomputing

(2015)

G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in:...

G.-B. Huang, C.-K. Slew, Extreme learning machine: RBF network case, in: Proceedings of 8th International Conference on...

D.E. Rumelhart et al.

Learning representations by back-propagating errors

Nature

(1986)

Y.A. LeCun et al.

Efficient backprop

Lect. Notes Comput. Sci.

(2012)

G.-B. Huang et al.

Extreme learning machine for regression and multiclass classification

IEEE Trans. Syst. Man Cybern. Part B: Cybern.

(2012)

Cited by (6)

Soft extreme learning machine for fault detection of aircraft engine
2019, Aerospace Science and Technology
When extreme learning machine (ELM) is used to cope with classification problems, the ±1 is commonly used to construct the label vector. Since ELM adopts the square loss function, this means that it tends to force the margins of all the training samples exactly equaling one from the perspective of margin learning theory, which is unreasonable to some extent. To overcome this hard margin flaw, in this paper a soft extreme learning machine (SELM) is proposed, which flexibly sets a soft target margin for each training sample. Through solving a series of regularized ELMs (RELMs), SELM can be computed efficiently. Based on SELM, an improved SELM (ISELM) is proposed to deal with imbalanced classification problems, which can keep the same computational efficiency as SELM via solving a series of weighted RELMs. From the experimental results on benchmark data sets, the effectiveness and feasibility of SELM and ISELM are confirmed. More importantly, when they are applied to fault detection of aircraft engine, they are promising to be developed as the candidate techniques for it, and ISELM is especially in favor.
Feature selection of generalized extreme learning machine for regression problems
2018, Neurocomputing
Citation Excerpt :
Subsequently, an optimally pruned ELM methodology was presented [56], in which an original ELM is initially constructed, and then the hidden nodes are ranked with multi-response sparse regression technique [57], the number of hidden nodes finally decided by leave-one-out cross validation [58]. In addtion, there are destructive methods of constructing parsimonious ELMs correspondingly [52,59]. According to the obtained conclusion, we know that the generalization performance of the original ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [61].
Recently a generalized single-hidden layer feedforward network was proposed, which is an extension of the original extreme learning machine (ELM). Different from the traditional ELM, this generalized ELM (GELM) utilizes the p-order reduced polynomial functions of complete input features as output weights. According to the empirical results, there may be insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. However, to date there has not been such work of selecting appropriate input features used for constructing output weights of GELM. Hence, in this paper two greedy learning algorithms, i.e., a forward feature selection algorithm (FFS-GELM) and a backward feature selection algorithm (BFS-GELM), are first proposed to tackle this issue. To reduce the computational complexity, an iterative strategy is used in FFS-GELM, and its convergence is proved. In BFS-GELM, a decreasing iteration is applied to decay this model, and in this process an accelerating scheme was proposed to speed up computation of removing the insignificant or redundant features. To show the effectiveness of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments. From these reports, it is demonstrated that both FFS-GELM and BFS-GELM can select appropriate input features to construct the p-order reduced polynomial function as output weights for GELM. FFS-GELM and BFS-GELM enhance the generalization performance and simultaneously reduce the testing time compared to the original GELM. BFS-GELM works better than FFS-GELM in terms of the sparsity ratio, the testing time and the training time. However, it slightly loses the advantage in the generalization performance over FFS-GELM.
Retargeting extreme learning machines for classification and their applications to fault diagnosis of aircraft engine
2017, Aerospace Science and Technology
Citation Excerpt :
The constructive algorithms start with a small initial size and gradually recruit new hidden nodes until a required solution is found, such as the incremental ELMs [10–14]. In contrast, the destructive algorithms, also known as pruning algorithms [15,16], are initialized with a network of a larger than necessary size, and then the redundant or less effective hidden nodes are gradually removed. In addition, there are evolutionary algorithms to optimize the structure of ELM [17–19].
Since the original extreme learning machine (ELM) generates the hidden nodes randomly, it usually needs more hidden nodes to reach the good classification performance. However, more hidden nodes will jeopardize the real time, which limits its applications to the testing time sensitive scenarios. To this end, the commonly-used methods tend to compact its structure via optimizing the number of hidden nodes. Different from this viewpoint of network structure, in this paper two algorithms are proposed to improve the real time performance of ELM from a viewpoint of data structure. Specially, they improve the ELM classification performance by retargeting its label vectors. As thus, they need fewer hidden nodes to reach the same classification performance, which means the better real time. Finally, experimental results on the benchmark data sets validate the effectiveness and feasibility of the presented two algorithms. To be more important, they are applied to the fault diagnosis of aircraft engine and can be developed as its candidate techniques.
Gram–Schmidt process based incremental extreme learning machine
2017, Neurocomputing
Citation Excerpt :
The first refers to constructive algorithms [7–14], which begin with a small initial network and gradually recruit new hidden nodes until some stopping criterions are met. In contrast, the second strategy is called destructive algorithms, also known as pruning algorithms [15–19], in which a network with a larger than necessary size is initially trained, and then the redundant or less effective hidden nodes are gradually removed until the performance required deteriorates. To investigate the effectiveness and feasibility of the proposed GSI-ELM and IGSI-ELM, experiments on benchmark data sets including regression and classification are done.
To compact the architecture of extreme learning machine (ELM), two incremental learning algorithms are proposed in this paper. The previous incremental learning algorithms for ELM recruit hidden nodes randomly, which is equivalent to implementing a random selection from a candidate set of infinite size. Hence, it is impossible to recruit good hidden nodes, and thus it usually requires more hidden nodes than traditional neural networks to achieve matched performance. To improve the quality of the hidden nodes recruited, an incremental learning algorithm for ELM is presented based on Gram--Schmidt process (GSI-ELM), which recruits the best hidden node from a random subset of fixed size via defining an evaluating criterion at each learning step. However, the “nesting effect” exists in the GSI-ELM, that is to say, the hidden nodes once recruited by GSI-ELM can not be later discarded. To treat this “nesting problem”, the improved GSI-ELM (IGSI-ELM) is generated with an elimination mechanism. At each learning step IGSI-ELM eliminates the worst hidden node from the already-recruited group if it is not the newly-recruited one. Finally, to verify the efficacy and feasibility of the proposed algorithms, i.e. GSI-ELM and IGSI-ELM, in this paper, experiments on regression and classification benchmark data sets are investigated.
Parsimonious kernel extreme learning machine in primal via Cholesky factorization
2016, Neural Networks
Recently, extreme learning machine (ELM) has become a popular topic in machine learning community. By replacing the so-called ELM feature mappings with the nonlinear mappings induced by kernel functions, two kernel ELMs, i.e., P-KELM and D-KELM, are obtained from primal and dual perspectives, respectively. Unfortunately, both P-KELM and D-KELM possess the dense solutions in direct proportion to the number of training data. To this end, a constructive algorithm for P-KELM (CCP-KELM) is first proposed by virtue of Cholesky factorization, in which the training data incurring the largest reductions on the objective function are recruited as significant vectors. To reduce its training cost further, PCCP-KELM is then obtained with the application of a probabilistic speedup scheme into CCP-KELM. Corresponding to CCP-KELM, a destructive P-KELM (CDP-KELM) is presented using a partial Cholesky factorization strategy, where the training data incurring the smallest reductions on the objective function after their removals are pruned from the current set of significant vectors. Finally, to verify the efficacy and feasibility of the proposed algorithms in this paper, experiments on both small and large benchmark data sets are investigated.
Fault diagnosis of Tennessee-eastman process using orthogonal incremental extreme learning machine based on driving amount
2018, IEEE Transactions on Cybernetics

Bing Li was born in 1990. He received the B.S. degree from Xinxiang University, China, in 2014. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.

Ye-Bo Li studied in Shenyang Aerospace University of China from September 2004 to June 2008, majored in aircraft power engineering and received B.S. degree. From September 2008 to October 2014, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in aerospace propulsion theory and engineering, and received M.S. degree and Ph.D. degree. Now, he is an engineer at AVIC Aeroengine Control Research Institute of China. His research interests include modelling, control and fault diagnosis on aircraft engines.

View full text

An accelerating scheme for destructive parsimonious extreme learning machine

Abstract

Introduction

Section snippets

ELM

Preliminary work

An equivalent measure

Experiments

Conclusions

Acknowledgment

Neurocomputing

Neural Netw.

Neurocomputing

Neural Netw.

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Learning representations by back-propagating errors

Nature

Efficient backprop

Lect. Notes Comput. Sci.

Extreme learning machine for regression and multiclass classification

IEEE Trans. Syst. Man Cybern. Part B: Cybern.