Elsevier

Neurocomputing

Volume 167, 1 November 2015, Pages 671-687
Neurocomputing

An accelerating scheme for destructive parsimonious extreme learning machine

https://doi.org/10.1016/j.neucom.2015.04.002Get rights and content

Abstract

Constructive and destructive parsimonious extreme learning machines (CP-ELM and DP-ELM) were recently proposed to sparsify ELM. In comparison with CP-ELM, DP-ELM owns the advantage in the number of hidden nodes, but it loses the edge with respect to the training time. Hence, in this paper an equivalent measure is proposed to accelerate DP-ELM (ADP-ELM). As a result, ADP-ELM not only keeps the same hidden nodes as DP-ELM but also needs less training time than CP-ELM, which is especially important for the training time sensitive scenarios. The similar idea is extended to regularized ELM (RELM), yielding ADP-RELM. ADP-RELM accelerates the training process of DP-RELM further, and it works better than CP-RELM in terms of the number of hidden nodes and the training time. In addition, the computational complexity of the proposed accelerating scheme is analyzed in theory. From reported results on ten benchmark data sets, the effectiveness and usefulness of the proposed accelerating scheme in this paper is confirmed experimentally.

Introduction

The widespread popularity of single-hidden layer feedforward networks (SLFNs) in extensive fields is mainly due to their powerfulness of approximating complex nonlinear mappings and simple forms. As a specific type of SLFNs, extreme learning machine (ELM) [1], [2], [3] recently has drawn a lot of interests from researchers and engineers. Generally, training an ELM consists of two main stages [4]: (1) random feature mapping and (2) linear parameters solving. In the first stage, ELM randomly initializes the hidden layer to map the input samples into a so-called ELM feature space with some nonlinear mapping functions, which can be any nonlinear piecewise continuous functions, such as the sigmoid and the RBF. Since in ELM the hidden node parameters are randomly generated (independent of the training samples) without tuning according to any continuous probability distribution instead of being explicitly trained, it owns the remarkable computational priority to regular gradient-descent backpropagation [5], [6]. That is, unlike conventional learning methods that must see the training samples before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training samples. In the second stage, the linear parameters can be obtained by solving the Moore-Penrose generalized inverse of hidden layer output matrix, which reaches both the smallest training error and the smallest norm of output weights [7]. Hence, different from most algorithms proposed for feedforward neural networks not considering the generalization performance when they were proposed first time, ELM can achieve better generalization performance.

From the interpolation perspective, Huang et al. [3] indicated that for any given training set, there exists an ELM network which gives sufficient small training error in squared error sense with probability one, and the number of hidden nodes is not larger than the number of distinct training samples. If the number of hidden nodes amounts to the number of distinct training samples, then training errors decreases to zero with probability one. Actually, in the implementation of ELM, it is found that the generalization performance of ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [8]. In addition, unlike traditional SLFNs, such as radial basis function neural networks and multilayer perceptron with one hidden layer, where the activation functions are required to be continuous or differentiable, ELM can choose the threshold function and many others as activation functions without sacrificing the universal approximation capability at all. In statistical learning theory, Vapnik–Chervonenkis (VC) dimension theory is one of the most widely used frameworks in generalization bound analysis. According to the structure risk minimization perspective, to obtain better generalization performance on testing set, an algorithm should not only achieve low training error on training set, but also should have a lower VC dimension. Aiming to classification tasks, Liu et al. [9] proved that the VC dimension of an ELM is equal to its number of hidden nodes with probability one, which states that ELM has a relatively low VC dimension. With respect to regression problems, the generalization ability of ELM had been comprehensively studied in [10] and [11], leading to the conclusion that ELM with some suitable activation functions, such as polynomials, sigmoid and Nadaraya–Watson function, can achieve the same optimal generalization bound as a SLFN in which all parameters are tunable. From the above analyses, it is known that ELM owns excellent universal approximation capability and generalization performance. However, due to the fact that ELM generates hidden nodes randomly, it usually requires more hidden nodes than traditional SLFN to get matched performance. Large network size always signifies more running time in the testing phase. In cost sensitive learning, the testing time should be minimized, which requires a compact network to meet testing time budget. Thus, the topic on improving the compactness of ELM has recently attracted great interests.

First, Huang et al. [12] proposed an incremental ELM (I-ELM), which randomly generates the hidden nodes and analytically calculates the output weights. Due to the fact that I-ELM does not recalculate the output weights of all the existing nodes when a new node is added, hence its convergence rate can be further improved by recalculating the output weights of the existing nodes based on a convex optimization method when a new hidden node is randomly added [13]. In I-ELM, there may be some hidden nodes, which play a very minor role in the network output and eventually increase the network complexity. In order to avoid this problem and to obtain a more compact network, an enhanced version of I-ELM was presented [14], where in each learning step several hidden nodes are randomly generated and among them the hidden node leading to the largest residual error decreasing will be added to the existing network. In [15], an error minimized ELM was proposed, which can add random hidden nodes one by one or group by group. And during the growth of the network, the output weights are updated incrementally. In [16], an optimally pruned extreme learning machine was presented, which is based on the original ELM with additional steps to make it more robust and compact. Deng et al. [17] adopted the two-stage stepwise strategy to improve the compactness of the ELM. At the first stage, the selection procedure can be automatically terminated based on the leave-one-out error. At the second stage, the contribution of each hidden nodes is reviewed, and insignificant ones are replaced. This procedure dose not terminate until no insignificant hidden nodes exist in the final model. When Bayesian approach is applied to ELM, a Bayesian ELM was got [18]. This Bayesian method can not only optimize the network architecture, but also make use of a priori knowledge and obtain the confidence interval. Subsequently, a new learning algorithm called bidirectional ELM was presented [19], which can greatly enhance the learning effectiveness, reduce the number of hidden nodes, and eventually further increase the learning speed. In addition, the traditional ELM was further extended by Zhang et al. [20] to be the one using Lebesgue p-integrable hidden activation functions, which can approximate any Lebesgue p-integrable functions on a compact input set. Further, a dynamic ELM where the hidden nodes can be recruited or deleted dynamically according to their significance to network performance was proposed [21], so that not only the parameters can be adjusted but also the architecture can be self-adapted simultaneously. Castano et al. [22] proposed a robust and pruned ELM approach, where the principal component analysis is utilized to select the hidden nodes from input features while corresponding input weights are deterministically defined as principal components rather than random ones. To solve large and complex sample issues, a stacked ELM was designed [23], which divides a single large ELM network into multiple stacked small ELMs which are serially connected. In addition, to improve the performance of ELM on high-dimension and small-sample problems, a projection vector machine [24] was proposed. Aiming at classification tasks [7], [25], [26], [27], [28], several algorithms were proposed to optimize the network size of ELM.

Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) were proposed based on recursive orthogonal least squares [29]. CP-ELM starts with a small initial network and gradually adds new hidden nodes until a satisfactory solution is found. In contrary, DP-ELM starts by training a larger than necessary network and then removes the insignificant node one by one. Further, Zhao et al. [30] extend CP-ELM and DP-ELM to the regularized ELM (RELM), thus correspondingly obtaining CP-RELM and DP-RELM. Generally speaking, compared with CP-ELM, DP-ELM can obtain more compact network size when reaching nearly the same accuracy, but it loses the edge on the training time. The main reason is that when DP-ELM removes every insignificant hidden node, a series of Givens rotations are implemented to compute the additional residual error reduction. As known, the computational burden of Givens rotation is high. Hence, a scheme is proposed to accelerate DP-ELM (ADP-ELM) instead of Givens rotations. ADP-ELM not only needs fewer hidden nodes than CP-ELM, but also wins it in the training time. When this accelerating scheme is extended to DP-RELM, ADP-RELM is yielded. ADP-RELM reduces the training time of DP-RELM further, and outperforms CP-RELM in terms of the number of hidden nodes and the training time. In theory, this proposed accelerating scheme can acquire the same effect as Givens rotations but requires less training time when removing the insignificant hidden nodes. Finally, extensive experiments are conducted on benchmark data sets, and the results verify the effectiveness of the proposed scheme in this paper.

The remainder of this paper is organized as follows. In Section 2, ELM and RELM are briefly introduced. As two destructive algorithms, DP-ELM and DP-RELM are described in Section 3. In Section 4, an equivalent measure is proposed instead of Givens rotations to respectively accelerate DP-ELM and DP-RELM, yielding ADP-RELM and ADP-RELM, and their computational complexity is analyzed. Experimental results on ten benchmark data sets are presented in Section 5. Finally, conclusions follow.

Section snippets

ELM

For N arbitrary distinct samples {(xi,ti)}i=1N, where xin and tim, the traditional SLFN with L hidden nodes and activation function h(x) can be mathematically modeled asy=i=1Lβih(x;ai,bi)where ain is the input weight vector connecting the ith hidden node and the input nodes, bi is the bias of the ith hidden node, βim is the output weight vector connecting the ith hidden node and the output nodes.

If the outputs of the SLFN amount to the targets, the following compact formulation is got

Preliminary work

According to the conclusion in [3], it has been proved that an ELM with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct samples. In fact, in real-world applications, the number of hidden nodes is always less than the number of training samples, i.e., L<N, and the matrix H is of full column rank with probability one. Hence, the following theorem is obtained

Theorem 1

The minimizer of Eq. (5) amounts to the following minimizer:

minβ{GELM=RL×LβT^L×mF2}

An equivalent measure

Assume that at the (i−1)th iteration in DP-ELM or DP-RELM, R(i1) and T(i1) are obtained, and letG^(i1)=minβ(i1){G(i1)=R(i1)β(i1)T(i1)F2}Evidently, G^(i1)=0. When the regressor rs(i1) is removed at the ith iteration, Eq. (24) becomes asG^s(i)=minβ(i1){Gs(i)=Rs(i)β(i)T(i1)F2}Then, defineΔs(i)=G^s(i)G^(i1)=G^s(i)here Δs(i) represents the increase on the objective function in (24) due to the removal of the regressor rs(i1). Therefore, the following theorem is obtained

Theorem 3

The

Experiments

In this section, we experimentally study the usefulness of the proposed scheme in accelerating the training processes of DP-ELM and DP-RELM. The experiment environment consists of: Windows 7 32-b operating system, Intel® Core™ i3-2310 M CPU @2.10 GHz, 2.00 GB RAM, Matlab2013b platform. Ten benchmark data sets (three multi-output data sets plus seven single-output ones) are utilized to do experiments, which include Energy efficiency, Sml2010, Parkinsons, Airfoil, Abalone, Winequality white, CCPP,

Conclusions

Single-hidden layer feedforward networks recently have drawn much attention due to their excellent performances and simple forms. Unlike the traditional SLFNs where the input weights and the biases of hidden layer need to be tuned, ELM randomly generates these parameters independent of the training samples, and its output weights can be analytically determined. In this situation, ELM maybe generates redundant hidden nodes. Hence, it is necessary to improve the compactness of ELM network. CP-ELM

Acknowledgment

This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly.

Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His

References (33)

  • Y.-P. Zhao et al.

    Parsimonious regularized extreme learning machine based on orthogonal transformation

    Neurocomputing

    (2015)
  • G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in:...
  • G.-B. Huang, C.-K. Slew, Extreme learning machine: RBF network case, in: Proceedings of 8th International Conference on...
  • D.E. Rumelhart et al.

    Learning representations by back-propagating errors

    Nature

    (1986)
  • Y.A. LeCun et al.

    Efficient backprop

    Lect. Notes Comput. Sci.

    (2012)
  • G.-B. Huang et al.

    Extreme learning machine for regression and multiclass classification

    IEEE Trans. Syst. Man Cybern. Part B: Cybern.

    (2012)
  • Cited by (6)

    • Feature selection of generalized extreme learning machine for regression problems

      2018, Neurocomputing
      Citation Excerpt :

      Subsequently, an optimally pruned ELM methodology was presented [56], in which an original ELM is initially constructed, and then the hidden nodes are ranked with multi-response sparse regression technique [57], the number of hidden nodes finally decided by leave-one-out cross validation [58]. In addtion, there are destructive methods of constructing parsimonious ELMs correspondingly [52,59]. According to the obtained conclusion, we know that the generalization performance of the original ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [61].

    • Retargeting extreme learning machines for classification and their applications to fault diagnosis of aircraft engine

      2017, Aerospace Science and Technology
      Citation Excerpt :

      The constructive algorithms start with a small initial size and gradually recruit new hidden nodes until a required solution is found, such as the incremental ELMs [10–14]. In contrast, the destructive algorithms, also known as pruning algorithms [15,16], are initialized with a network of a larger than necessary size, and then the redundant or less effective hidden nodes are gradually removed. In addition, there are evolutionary algorithms to optimize the structure of ELM [17–19].

    • Gram–Schmidt process based incremental extreme learning machine

      2017, Neurocomputing
      Citation Excerpt :

      The first refers to constructive algorithms [7–14], which begin with a small initial network and gradually recruit new hidden nodes until some stopping criterions are met. In contrast, the second strategy is called destructive algorithms, also known as pruning algorithms [15–19], in which a network with a larger than necessary size is initially trained, and then the redundant or less effective hidden nodes are gradually removed until the performance required deteriorates. To investigate the effectiveness and feasibility of the proposed GSI-ELM and IGSI-ELM, experiments on benchmark data sets including regression and classification are done.

    Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.

    Bing Li was born in 1990. He received the B.S. degree from Xinxiang University, China, in 2014. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.

    Ye-Bo Li studied in Shenyang Aerospace University of China from September 2004 to June 2008, majored in aircraft power engineering and received B.S. degree. From September 2008 to October 2014, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in aerospace propulsion theory and engineering, and received M.S. degree and Ph.D. degree. Now, he is an engineer at AVIC Aeroengine Control Research Institute of China. His research interests include modelling, control and fault diagnosis on aircraft engines.

    View full text