Parsimonious regularized extreme learning machine based on orthogonal transformation
Introduction
As a new learning algorithm for single-hidden-layer feed-forward neural networks (SLFN), the theory of extreme learning machines (ELMs) [1], [2], [3] has recently become increasingly popular in a wide scope of applications such as function approximation [4], [5], classification [5], [6], and density estimation [7]. The main reasons are due to their advantages of low computational cost, good generalization ability, and ease of implementation. Traditionally, all the parameters of the feed-forward networks are tuned with the commonly-used gradient descent-based methods. It is clear that gradient descent-based learning methods are generally very slow because of improper learning steps or easily stuck in the local minima. In contrast, the input weights and the hidden layer biases are chosen randomly in ELM, where the output weights can be analytically determined via the simple generalized inverse operation on the hidden layer output matrices. Hence, the learning speed of ELM is thousands of times faster than the traditional feed-forward network learning algorithms like back-propagation [8]. On the other side, ELM not only emphasizes to obtain the smallest training errors but also the smallest norm of weights. According to Bartlett׳s theory [9], the smaller the norm of weights is, the better generalization performance the networks tend to have. Therefore, ELM tends to have good generalization performance for feed-forward networks. Another well-known machine learning algorithm, viz. support vector machine (SVM) [10], [11], has been extensively used in classification and other fields in the last two decades due to its outstanding learning performance. In [12], ELM and SVM are compared from two viewpoints: one is the Vapnik–Chervonenkis dimension, and another is their performance under different training sample sizes. It is shown that the VC dimension of an ELM is equal to its number of hidden nodes with probability one. In addition, their generalization ability and computational complexity are demonstrated with changing training sample size. ELMs have weaker generalization ability than SVMs for small samples but can generalize as well as SVMs for large samples. Evidently, great superiority in computational speed especially for large-scale sample problems is found in ELMs.
In theory, ELM (with N hidden nodes) with randomly chosen input weights and hidden layer biases can exactly learn N distinct samples [13]. Actually, investigations show that N hidden nodes are usually not needed to construct good ELM. On one hand, the property of random choice of input weights and biases may generate redundant or unimportant hidden nodes for ELM, thus leading to the hidden layer output matrix being not of full column rank. In this context, ELM is ill-posed, often suffering from the deterioration of the generalization performance. On the other hand, according to the Occam׳s razor principle “plurality should not be posited without necessity” [14], it is known that a practical nonlinear modeling principle is to find the smallest model that generalizes well. Sparse1 models are preferable in engineering applications since a model׳s computational complexity scales with its model complexity. Moreover, a sparse model is easier to interpret from the viewpoint of knowledge extraction [15]. Hence, it is necessary and important to find an sparse model for randomly generated hidden nodes, which owns good generalization performance meanwhile needing fewest hidden nodes. Additionally, many learning algorithms like online sequential learning [16], [17], [18] also need a thrifty mode before starts. That is, they were developed on the basis of ELM with an a prior fixed network size.
Generally speaking, two different approaches are usually utilized to sparsify ELM. The first is referred to as constructive algorithms, which begin with a small initial network (even a null model) and then gradually recruit the hidden nodes required until a satisfactory solution is sought. The incremental ELM [3] and its two variants [19], [20] belong to typical constructive algorithms. Subsequently, an error minimized extreme learning machine [21] was proposed, which can add random hidden nodes one by one or group by group and automatically determine the number of hidden nodes. Based on multi-response sparse regression and unbiased risk-estimation-based criterion Cp, two constructive algorithms [22], [23] for ELM were developed to cope with single- and multioutput regression problems, respectively. Recently, a constructive parsimonious extreme learning machine (CP-ELM) [24] was presented, which firstly transforms the hidden nodes output matrix into an upper triangular matrix equivalently using Givens rotation, and then recruits the hidden nodes incurring the largest additional residual error reductions to construct the final network subsequently. By contrast, the second, named as destructive algorithms (a.k.a. pruning algorithms), starts by training a larger than necessary network and then removes the redundant or unimportant hidden nodes. Aiming at classification problems, a pruned-ELM [25] was proposed to reach a compact network classifier, in which the irrelevant nodes are pruned by considering their relevance to the class labels from an initial large number of hidden nodes. In [26], an optimally pruned ELM (OP-ELM) methodology was presented, which firstly builds the original ELM, and then ranks the hidden nodes by applying multi-response sparse regression and selects the hidden nodes through leave-one-out validation. In parallel with CP-ELM, there is a destructive parsimonious ELM (DP-ELM) developed in [24], correspondingly.
To improve the performance on the original ELM, ELM was regularized using Tikhonov׳s method [27], i.e., regularized extreme learning machine (RELM) [28]. The sequential learning algorithm for RELM [29] was developed as well. And experimental studies [30], [31], [32], [33] demonstrated that RELM can overcome the ill-posed problem and control the machine learning complexity to enhance the generalization performance. Similar to the original ELM, the solution of RELM is also dense. Compared with a lot of efforts above-mentioned on sparsifying the solution of ELM, to date much less attention has been attracted to RELM. An improvement of OP-ELM was proposed to determine a compact network for RELM, which uses a cascade of two regularization penalties: first a L1 penalty to rank the neurons of the hidden layer using least angle regression [34], followed by Tikhonov-regularized (TR) penalty on the regression weights for numerical stability and efficient pruning of the neurons, so this destructive algorithm was named TROP-ELM for short [35]. To obtain low complexity and sparse solution, a fast sparse approximation scheme for RELM (FSA-RELM) [36] was recently introduced, which begins with a null solution and gradually selects a new hidden node according to some criteria. And this procedure is repeated until the stopping criterion is met. Evidently, we know that FSA-RELM is a typical constructive algorithm.
Hence, we will do some research work to realize the sparseness of RELM, which includes the extensions of CP-ELM and DP-ELM to RELM. The main contributions in this paper are listed as:
- (1)
Enlightened by the ideas of CP-ELM and DP-ELM, we extend them to RELM, thus yielding CP-RELM and DP-RELM, respectively. CP-RELM and DP-RELM dominate CP-ELM and DP-ELM in terms of the number of hidden nodes, respectively, under nearly the same prediction accuracy.
- (2)
During the process of sparsifying RELM, two schemes are adopted. Although these two schemes are completely equivalent from theoretical viewpoint, their empirical studies show different to some extent.
- (3)
In CP-ELM and DP-ELM, the number of candidate hidden nodes is small because we need to transform the candidate hidden nodes output matrix into an upper triangular matrix by the orthogonal decomposition. If the number of candidate hidden nodes is large, the hidden nodes output matrix is singular or nearly singular. Due to the Tikhonov regularization, we can initialize RELM with more candidate hidden nodes (even their number is larger than the number of training samples) in order that much better hidden nodes are recruited to improve the generalization performance and generate more compact network.
The rest of the paper is organized as follows. In Section 2, ELM and RELM are introduced. And then the ideas behind CP-ELM and DP-ELM are described. In Section 4, we extend the ideas of CP-ELM and DP-ELM to sparsify RELM with two schemes, yielding CP-RELM and DP-RELM. To confirm the effectiveness and feasibility of the proposed algorithms, eleven benchmark data sets divided into two groups are utilized to do experiments in Section 5. Finally, conclusions follow.
Section snippets
ELM and RELM
In this section, we will briefly depict the essence of ELM firstly, and then RELM is introduced. As for the theoretical foundation of ELM, Huang and Babri [37] studied the learning performance of SLFN on small size data set and found that SLFN with at most N hidden neurons can learn N distinct samples with zero error by adopting any bounded nonlinear activation function. On the basis of this concept, Huang et al. [1], [2], [3] proposed the ELM algorithm. The main concept behind ELM lies in the
CP-ELM and DP-ELM
From some preliminary work in [2], it is easily known that the original ELM needing N hidden nodes at most can approximate N distinct samples with any accuracy. In fact, in real-world applications, it is common that the number of hidden nodes is always less than the number of training samples, i.e., , and is of full column rank. Therefore, the following theorem is obtained: Theorem 1 The minimizer of Eq. (4) is equivalent to the following minimizer:where
CP-RELM and DP-RELM
In order to use the ideas behind CP-ELM and DP-ELM to sparsify RELM, some preliminaries must be done on Eq. (7). Hence, the following theorems are introduced in advance. Theorem 2 The minimizer of Eq. (7) is completely equivalent to the following minimizer:where , . Proof The optimal solution of Eq. (7) is found by solvingThen, the solution of Eq. (32) in the least square sense can be got via the following optimal problem:This
Experiments
To show the usefulness of the proposed methods empirically, some regression experiments including 11 benchmark data sets specified in Table 1 are performed, which are divided into two groups. In our experiments, all the inputs (attributes) have been normalized into the range [−1, 1] while the outputs (targets) have been normalized into [0, 1]. In this paper, two typical hidden nodes are chosen as the activation functions of hidden layer, i.e. the Sigmoid and the RBF
Conclusions
Extreme learning machine as an emerging learning tool in machine learning community has attracted much attention in the past few years. Because of its simple formula and high computational efficiency, ELM has rapidly obtained wide applications in walks of life. In theory, it has been shown that ELM with N random hidden nodes can exactly learn N distinct samples. Actually, there are two potential risks of doing this. On one hand, N hidden nodes may lead to the network too complicated and thus
Acknowledgments
This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly.
Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His
References (44)
- et al.
Extreme learning machinetheory and applications
Neurocomputing
(2006) - et al.
Fast learning circular complex-valued extreme learning machine (CC–ELM) for real-valued classification problems
Inf. Sci.
(2012) - et al.
A comparative analysis of support vector machines and extreme learning machines
Neural Netw.
(2012) - et al.
Ensemble of online sequential extreme learning machine
Neurocomputing
(2009) - et al.
Online sequential extreme learning machine with forgetting mechanism
Neurocomputing
(2012) - et al.
Convex incremental extreme learning machine
Neurocomputing
(2007) - et al.
Enhanced random search based incremental extreme learning machine
Neurocomputing
(2008) - et al.
Constructive hidden nodes selection of extreme learning machine for regression
Neurocomputing
(2010) - et al.
Constructive multi-output extreme learning machine with application to large tanker motion dynamics identification
Neurocomputing
(2014) - et al.
A fast pruned-extreme learning machine for classification problem
Neurocomputing
(2008)
Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks
Pattern Recognit. Lett.
TROP-ELMa double-regularized ELM using LARS and Tikhonov regularization
Neurocomputing
Fast sparse approximation of extreme learning machine
Neurocomputing
Optimization method based extreme learning machine for classification
Neurocomputing
Universal approximation using incremental constructive feedforward networks with random hidden nodes
IEEE Trans. Neural Netw.
Evolutionary selection extreme learning machine optimization for regression
Soft Comput.
Extreme learning machine for regression and multiclass classification
IEEE Trans. Syst. Man Cybern. Part B: Cybern.
Learning representations by back-propagating errors
Nature
Sample complexity of pattern classification with neural networksthe size of the weights is more important than the size of the network
IEEE Trans. Inf. Theory
Support-vector networks
Mach. Learn.
Cited by (12)
Soft extreme learning machine for fault detection of aircraft engine
2019, Aerospace Science and TechnologyC-loss based extreme learning machine for estimating power of small-scale turbojet engine
2019, Aerospace Science and TechnologyA robust extreme learning machine for modeling a small-scale turbojet engine
2018, Applied EnergyFeature selection of generalized extreme learning machine for regression problems
2018, NeurocomputingCitation Excerpt :Moreover, an error-minimization-based method [51] was presented to grow hidden nodes for ELM one by one or group by group, and output weights are incrementally updated which significantly reduces the computational complexity. Recently, the constructive parsimonious ELMs [52–54] were developed based on the orthogonal transformation. These aforementioned algorithms are inside the range of the forward learning algorithms.
Retargeting extreme learning machines for classification and their applications to fault diagnosis of aircraft engine
2017, Aerospace Science and TechnologyCitation Excerpt :Thereinto, the constructive and destructive algorithms play an important role in this respect. The constructive algorithms start with a small initial size and gradually recruit new hidden nodes until a required solution is found, such as the incremental ELMs [10–14]. In contrast, the destructive algorithms, also known as pruning algorithms [15,16], are initialized with a network of a larger than necessary size, and then the redundant or less effective hidden nodes are gradually removed.
The selection of input weights of extreme learning machine: A sample structure preserving point of view
2017, NeurocomputingCitation Excerpt :This point of view makes the classical results (e.g. Koksma–Hlawka inequality) of QMC method applicable and the theoretical analysis more direct, while in [30,31] a new discrepancy measure, called box discrepancy, had to be defined because the variation of the deduced integrand corresponding to the shift-invariant kernel is not bounded. In previous works [34–37], the orthogonalization technique has been used in ELM. In [34,35], the orthogonalization operation are conducted on the output matrix of hidden layer of ELM not on the input weights.
Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.
Kang-Kang Wang was born in 1989. He received the B.Eng. degree in Heilongjiang University of Science and Technology, China, in 2012. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.
Ye-Bo Li received his B.S. degree in the thermal energy and power engineering field from Shenyang Aerospace University, Shenyang, China in July 2008. From September 2008 to April 2011, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in Aerospace Propulsion Theory and Engineering, and received M.S. degree. Now, he is pursuing the Ph.D. degree from Nanjing University of Aeronautics and Astronautics. His research interests include aero-engine control system design and establishment of aero-engines׳ mathematical model.