An accelerating scheme for destructive parsimonious extreme learning machine
Introduction
The widespread popularity of single-hidden layer feedforward networks (SLFNs) in extensive fields is mainly due to their powerfulness of approximating complex nonlinear mappings and simple forms. As a specific type of SLFNs, extreme learning machine (ELM) [1], [2], [3] recently has drawn a lot of interests from researchers and engineers. Generally, training an ELM consists of two main stages [4]: (1) random feature mapping and (2) linear parameters solving. In the first stage, ELM randomly initializes the hidden layer to map the input samples into a so-called ELM feature space with some nonlinear mapping functions, which can be any nonlinear piecewise continuous functions, such as the sigmoid and the RBF. Since in ELM the hidden node parameters are randomly generated (independent of the training samples) without tuning according to any continuous probability distribution instead of being explicitly trained, it owns the remarkable computational priority to regular gradient-descent backpropagation [5], [6]. That is, unlike conventional learning methods that must see the training samples before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training samples. In the second stage, the linear parameters can be obtained by solving the Moore-Penrose generalized inverse of hidden layer output matrix, which reaches both the smallest training error and the smallest norm of output weights [7]. Hence, different from most algorithms proposed for feedforward neural networks not considering the generalization performance when they were proposed first time, ELM can achieve better generalization performance.
From the interpolation perspective, Huang et al. [3] indicated that for any given training set, there exists an ELM network which gives sufficient small training error in squared error sense with probability one, and the number of hidden nodes is not larger than the number of distinct training samples. If the number of hidden nodes amounts to the number of distinct training samples, then training errors decreases to zero with probability one. Actually, in the implementation of ELM, it is found that the generalization performance of ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [8]. In addition, unlike traditional SLFNs, such as radial basis function neural networks and multilayer perceptron with one hidden layer, where the activation functions are required to be continuous or differentiable, ELM can choose the threshold function and many others as activation functions without sacrificing the universal approximation capability at all. In statistical learning theory, Vapnik–Chervonenkis (VC) dimension theory is one of the most widely used frameworks in generalization bound analysis. According to the structure risk minimization perspective, to obtain better generalization performance on testing set, an algorithm should not only achieve low training error on training set, but also should have a lower VC dimension. Aiming to classification tasks, Liu et al. [9] proved that the VC dimension of an ELM is equal to its number of hidden nodes with probability one, which states that ELM has a relatively low VC dimension. With respect to regression problems, the generalization ability of ELM had been comprehensively studied in [10] and [11], leading to the conclusion that ELM with some suitable activation functions, such as polynomials, sigmoid and Nadaraya–Watson function, can achieve the same optimal generalization bound as a SLFN in which all parameters are tunable. From the above analyses, it is known that ELM owns excellent universal approximation capability and generalization performance. However, due to the fact that ELM generates hidden nodes randomly, it usually requires more hidden nodes than traditional SLFN to get matched performance. Large network size always signifies more running time in the testing phase. In cost sensitive learning, the testing time should be minimized, which requires a compact network to meet testing time budget. Thus, the topic on improving the compactness of ELM has recently attracted great interests.
First, Huang et al. [12] proposed an incremental ELM (I-ELM), which randomly generates the hidden nodes and analytically calculates the output weights. Due to the fact that I-ELM does not recalculate the output weights of all the existing nodes when a new node is added, hence its convergence rate can be further improved by recalculating the output weights of the existing nodes based on a convex optimization method when a new hidden node is randomly added [13]. In I-ELM, there may be some hidden nodes, which play a very minor role in the network output and eventually increase the network complexity. In order to avoid this problem and to obtain a more compact network, an enhanced version of I-ELM was presented [14], where in each learning step several hidden nodes are randomly generated and among them the hidden node leading to the largest residual error decreasing will be added to the existing network. In [15], an error minimized ELM was proposed, which can add random hidden nodes one by one or group by group. And during the growth of the network, the output weights are updated incrementally. In [16], an optimally pruned extreme learning machine was presented, which is based on the original ELM with additional steps to make it more robust and compact. Deng et al. [17] adopted the two-stage stepwise strategy to improve the compactness of the ELM. At the first stage, the selection procedure can be automatically terminated based on the leave-one-out error. At the second stage, the contribution of each hidden nodes is reviewed, and insignificant ones are replaced. This procedure dose not terminate until no insignificant hidden nodes exist in the final model. When Bayesian approach is applied to ELM, a Bayesian ELM was got [18]. This Bayesian method can not only optimize the network architecture, but also make use of a priori knowledge and obtain the confidence interval. Subsequently, a new learning algorithm called bidirectional ELM was presented [19], which can greatly enhance the learning effectiveness, reduce the number of hidden nodes, and eventually further increase the learning speed. In addition, the traditional ELM was further extended by Zhang et al. [20] to be the one using Lebesgue p-integrable hidden activation functions, which can approximate any Lebesgue p-integrable functions on a compact input set. Further, a dynamic ELM where the hidden nodes can be recruited or deleted dynamically according to their significance to network performance was proposed [21], so that not only the parameters can be adjusted but also the architecture can be self-adapted simultaneously. Castano et al. [22] proposed a robust and pruned ELM approach, where the principal component analysis is utilized to select the hidden nodes from input features while corresponding input weights are deterministically defined as principal components rather than random ones. To solve large and complex sample issues, a stacked ELM was designed [23], which divides a single large ELM network into multiple stacked small ELMs which are serially connected. In addition, to improve the performance of ELM on high-dimension and small-sample problems, a projection vector machine [24] was proposed. Aiming at classification tasks [7], [25], [26], [27], [28], several algorithms were proposed to optimize the network size of ELM.
Recently, novel constructive and destructive parsimonious ELMs (CP-ELM and DP-ELM) were proposed based on recursive orthogonal least squares [29]. CP-ELM starts with a small initial network and gradually adds new hidden nodes until a satisfactory solution is found. In contrary, DP-ELM starts by training a larger than necessary network and then removes the insignificant node one by one. Further, Zhao et al. [30] extend CP-ELM and DP-ELM to the regularized ELM (RELM), thus correspondingly obtaining CP-RELM and DP-RELM. Generally speaking, compared with CP-ELM, DP-ELM can obtain more compact network size when reaching nearly the same accuracy, but it loses the edge on the training time. The main reason is that when DP-ELM removes every insignificant hidden node, a series of Givens rotations are implemented to compute the additional residual error reduction. As known, the computational burden of Givens rotation is high. Hence, a scheme is proposed to accelerate DP-ELM (ADP-ELM) instead of Givens rotations. ADP-ELM not only needs fewer hidden nodes than CP-ELM, but also wins it in the training time. When this accelerating scheme is extended to DP-RELM, ADP-RELM is yielded. ADP-RELM reduces the training time of DP-RELM further, and outperforms CP-RELM in terms of the number of hidden nodes and the training time. In theory, this proposed accelerating scheme can acquire the same effect as Givens rotations but requires less training time when removing the insignificant hidden nodes. Finally, extensive experiments are conducted on benchmark data sets, and the results verify the effectiveness of the proposed scheme in this paper.
The remainder of this paper is organized as follows. In Section 2, ELM and RELM are briefly introduced. As two destructive algorithms, DP-ELM and DP-RELM are described in Section 3. In Section 4, an equivalent measure is proposed instead of Givens rotations to respectively accelerate DP-ELM and DP-RELM, yielding ADP-RELM and ADP-RELM, and their computational complexity is analyzed. Experimental results on ten benchmark data sets are presented in Section 5. Finally, conclusions follow.
Section snippets
ELM
For N arbitrary distinct samples , where and , the traditional SLFN with L hidden nodes and activation function can be mathematically modeled aswhere is the input weight vector connecting the ith hidden node and the input nodes, is the bias of the ith hidden node, is the output weight vector connecting the ith hidden node and the output nodes.
If the outputs of the SLFN amount to the targets, the following compact formulation is got
Preliminary work
According to the conclusion in [3], it has been proved that an ELM with at most N hidden nodes and with almost any nonlinear activation function can exactly learn N distinct samples. In fact, in real-world applications, the number of hidden nodes is always less than the number of training samples, i.e., L<N, and the matrix H is of full column rank with probability one. Hence, the following theorem is obtained Theorem 1 The minimizer of Eq. (5) amounts to the following minimizer:
An equivalent measure
Assume that at the (i−1)th iteration in DP-ELM or DP-RELM, and are obtained, and letEvidently, . When the regressor is removed at the ith iteration, Eq. (24) becomes asThen, definehere represents the increase on the objective function in (24) due to the removal of the regressor . Therefore, the following theorem is obtained Theorem 3 The
Experiments
In this section, we experimentally study the usefulness of the proposed scheme in accelerating the training processes of DP-ELM and DP-RELM. The experiment environment consists of: Windows 7 32-b operating system, Intel® Core™ i3-2310 M CPU @2.10 GHz, 2.00 GB RAM, Matlab2013b platform. Ten benchmark data sets (three multi-output data sets plus seven single-output ones) are utilized to do experiments, which include Energy efficiency, Sml2010, Parkinsons, Airfoil, Abalone, Winequality white, CCPP,
Conclusions
Single-hidden layer feedforward networks recently have drawn much attention due to their excellent performances and simple forms. Unlike the traditional SLFNs where the input weights and the biases of hidden layer need to be tuned, ELM randomly generates these parameters independent of the training samples, and its output weights can be analytically determined. In this situation, ELM maybe generates redundant hidden nodes. Hence, it is necessary to improve the compactness of ELM network. CP-ELM
Acknowledgment
This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly.
Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His
References (33)
- et al.
Extreme learning machine: theory and applications
Neurocomputing
(2006) - et al.
Trends in extreme learning machines: a review
Neural Netw.
(2015) - et al.
Optimization method based extreme learning machine for classification
Neurocomputing
(2010) - et al.
A comparative analysis of support vector machines and extreme learning machines
Neural Netw.
(2012) - et al.
Convex incremental extreme learning machine
Neurocomputing
(2007) - et al.
Enhanced random search based incremental extreme learning machine
Neurocomputing
(2008) - et al.
Fast automatic two-stage nonlinear model identification based on the extreme learning machine
Neurocomputing
(2011) - et al.
Projection vector machine
Neurocomputing
(2013) - et al.
A fast pruned-extreme learning machine for classification problem
Neurocomputing
(2008) - et al.
A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine
Neurocomputing
(2013)
Parsimonious regularized extreme learning machine based on orthogonal transformation
Neurocomputing
Learning representations by back-propagating errors
Nature
Efficient backprop
Lect. Notes Comput. Sci.
Extreme learning machine for regression and multiclass classification
IEEE Trans. Syst. Man Cybern. Part B: Cybern.
Cited by (6)
Soft extreme learning machine for fault detection of aircraft engine
2019, Aerospace Science and TechnologyFeature selection of generalized extreme learning machine for regression problems
2018, NeurocomputingCitation Excerpt :Subsequently, an optimally pruned ELM methodology was presented [56], in which an original ELM is initially constructed, and then the hidden nodes are ranked with multi-response sparse regression technique [57], the number of hidden nodes finally decided by leave-one-out cross validation [58]. In addtion, there are destructive methods of constructing parsimonious ELMs correspondingly [52,59]. According to the obtained conclusion, we know that the generalization performance of the original ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes is large enough [61].
Retargeting extreme learning machines for classification and their applications to fault diagnosis of aircraft engine
2017, Aerospace Science and TechnologyCitation Excerpt :The constructive algorithms start with a small initial size and gradually recruit new hidden nodes until a required solution is found, such as the incremental ELMs [10–14]. In contrast, the destructive algorithms, also known as pruning algorithms [15,16], are initialized with a network of a larger than necessary size, and then the redundant or less effective hidden nodes are gradually removed. In addition, there are evolutionary algorithms to optimize the structure of ELM [17–19].
Gram–Schmidt process based incremental extreme learning machine
2017, NeurocomputingCitation Excerpt :The first refers to constructive algorithms [7–14], which begin with a small initial network and gradually recruit new hidden nodes until some stopping criterions are met. In contrast, the second strategy is called destructive algorithms, also known as pruning algorithms [15–19], in which a network with a larger than necessary size is initially trained, and then the redundant or less effective hidden nodes are gradually removed until the performance required deteriorates. To investigate the effectiveness and feasibility of the proposed GSI-ELM and IGSI-ELM, experiments on benchmark data sets including regression and classification are done.
Parsimonious kernel extreme learning machine in primal via Cholesky factorization
2016, Neural NetworksFault diagnosis of Tennessee-eastman process using orthogonal incremental extreme learning machine based on driving amount
2018, IEEE Transactions on Cybernetics
Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, he received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.
Bing Li was born in 1990. He received the B.S. degree from Xinxiang University, China, in 2014. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.
Ye-Bo Li studied in Shenyang Aerospace University of China from September 2004 to June 2008, majored in aircraft power engineering and received B.S. degree. From September 2008 to October 2014, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in aerospace propulsion theory and engineering, and received M.S. degree and Ph.D. degree. Now, he is an engineer at AVIC Aeroengine Control Research Institute of China. His research interests include modelling, control and fault diagnosis on aircraft engines.