Elsevier

Neurocomputing

Volume 156, 25 May 2015, Pages 280-296
Neurocomputing

Parsimonious regularized extreme learning machine based on orthogonal transformation

https://doi.org/10.1016/j.neucom.2014.12.046Get rights and content

Abstract

Recently, two parsimonious algorithms were proposed to sparsify extreme learning machine (ELM), i.e., constructive parsimonious ELM (CP-ELM) and destructive parsimonious ELM (DP-ELM). In this paper, the ideas behind CP-ELM and DP-ELM are extended to the regularized ELM (RELM), thus obtaining CP-RELM and DP-RELM. For CP-RELM(DP-RELM), there are two schemes to realize it, viz. CP-RELM-I and CP-RELM-II(DP-RELM-I and DP-RELM-II). Generally speaking, CP-RELM-II(DP-RELM-II) outperforms CP-RELM-I(DP-RELM-I) in terms of parsimoniousness. Under nearly the same generalization, compared with CP-ELM(DP-ELM), CP-RELM-II(DP-RELM-II) usually needs fewer hidden nodes. In addition, different from CP-ELM and DP-ELM, for CP-RELM and DP-RELM the number of candidate hidden nodes may be larger than the number of training samples, which assists the selection of much better hidden nodes for constructing more compact networks. Finally, eleven benchmark data sets divided into two groups are utilized to do experiments and the usefulness of the proposed algorithms is reported.

Introduction

As a new learning algorithm for single-hidden-layer feed-forward neural networks (SLFN), the theory of extreme learning machines (ELMs) [1], [2], [3] has recently become increasingly popular in a wide scope of applications such as function approximation [4], [5], classification [5], [6], and density estimation [7]. The main reasons are due to their advantages of low computational cost, good generalization ability, and ease of implementation. Traditionally, all the parameters of the feed-forward networks are tuned with the commonly-used gradient descent-based methods. It is clear that gradient descent-based learning methods are generally very slow because of improper learning steps or easily stuck in the local minima. In contrast, the input weights and the hidden layer biases are chosen randomly in ELM, where the output weights can be analytically determined via the simple generalized inverse operation on the hidden layer output matrices. Hence, the learning speed of ELM is thousands of times faster than the traditional feed-forward network learning algorithms like back-propagation [8]. On the other side, ELM not only emphasizes to obtain the smallest training errors but also the smallest norm of weights. According to Bartlett׳s theory [9], the smaller the norm of weights is, the better generalization performance the networks tend to have. Therefore, ELM tends to have good generalization performance for feed-forward networks. Another well-known machine learning algorithm, viz. support vector machine (SVM) [10], [11], has been extensively used in classification and other fields in the last two decades due to its outstanding learning performance. In [12], ELM and SVM are compared from two viewpoints: one is the Vapnik–Chervonenkis dimension, and another is their performance under different training sample sizes. It is shown that the VC dimension of an ELM is equal to its number of hidden nodes with probability one. In addition, their generalization ability and computational complexity are demonstrated with changing training sample size. ELMs have weaker generalization ability than SVMs for small samples but can generalize as well as SVMs for large samples. Evidently, great superiority in computational speed especially for large-scale sample problems is found in ELMs.

In theory, ELM (with N hidden nodes) with randomly chosen input weights and hidden layer biases can exactly learn N distinct samples [13]. Actually, investigations show that N hidden nodes are usually not needed to construct good ELM. On one hand, the property of random choice of input weights and biases may generate redundant or unimportant hidden nodes for ELM, thus leading to the hidden layer output matrix being not of full column rank. In this context, ELM is ill-posed, often suffering from the deterioration of the generalization performance. On the other hand, according to the Occam׳s razor principle “plurality should not be posited without necessity” [14], it is known that a practical nonlinear modeling principle is to find the smallest model that generalizes well. Sparse1 models are preferable in engineering applications since a model׳s computational complexity scales with its model complexity. Moreover, a sparse model is easier to interpret from the viewpoint of knowledge extraction [15]. Hence, it is necessary and important to find an sparse model for randomly generated hidden nodes, which owns good generalization performance meanwhile needing fewest hidden nodes. Additionally, many learning algorithms like online sequential learning [16], [17], [18] also need a thrifty mode before starts. That is, they were developed on the basis of ELM with an a prior fixed network size.

Generally speaking, two different approaches are usually utilized to sparsify ELM. The first is referred to as constructive algorithms, which begin with a small initial network (even a null model) and then gradually recruit the hidden nodes required until a satisfactory solution is sought. The incremental ELM [3] and its two variants [19], [20] belong to typical constructive algorithms. Subsequently, an error minimized extreme learning machine [21] was proposed, which can add random hidden nodes one by one or group by group and automatically determine the number of hidden nodes. Based on multi-response sparse regression and unbiased risk-estimation-based criterion Cp, two constructive algorithms [22], [23] for ELM were developed to cope with single- and multioutput regression problems, respectively. Recently, a constructive parsimonious extreme learning machine (CP-ELM) [24] was presented, which firstly transforms the hidden nodes output matrix into an upper triangular matrix equivalently using Givens rotation, and then recruits the hidden nodes incurring the largest additional residual error reductions to construct the final network subsequently. By contrast, the second, named as destructive algorithms (a.k.a. pruning algorithms), starts by training a larger than necessary network and then removes the redundant or unimportant hidden nodes. Aiming at classification problems, a pruned-ELM [25] was proposed to reach a compact network classifier, in which the irrelevant nodes are pruned by considering their relevance to the class labels from an initial large number of hidden nodes. In [26], an optimally pruned ELM (OP-ELM) methodology was presented, which firstly builds the original ELM, and then ranks the hidden nodes by applying multi-response sparse regression and selects the hidden nodes through leave-one-out validation. In parallel with CP-ELM, there is a destructive parsimonious ELM (DP-ELM) developed in [24], correspondingly.

To improve the performance on the original ELM, ELM was regularized using Tikhonov׳s method [27], i.e., regularized extreme learning machine (RELM) [28]. The sequential learning algorithm for RELM [29] was developed as well. And experimental studies [30], [31], [32], [33] demonstrated that RELM can overcome the ill-posed problem and control the machine learning complexity to enhance the generalization performance. Similar to the original ELM, the solution of RELM is also dense. Compared with a lot of efforts above-mentioned on sparsifying the solution of ELM, to date much less attention has been attracted to RELM. An improvement of OP-ELM was proposed to determine a compact network for RELM, which uses a cascade of two regularization penalties: first a L1 penalty to rank the neurons of the hidden layer using least angle regression [34], followed by Tikhonov-regularized (TR) penalty on the regression weights for numerical stability and efficient pruning of the neurons, so this destructive algorithm was named TROP-ELM for short [35]. To obtain low complexity and sparse solution, a fast sparse approximation scheme for RELM (FSA-RELM) [36] was recently introduced, which begins with a null solution and gradually selects a new hidden node according to some criteria. And this procedure is repeated until the stopping criterion is met. Evidently, we know that FSA-RELM is a typical constructive algorithm.

Hence, we will do some research work to realize the sparseness of RELM, which includes the extensions of CP-ELM and DP-ELM to RELM. The main contributions in this paper are listed as:

  • (1)

    Enlightened by the ideas of CP-ELM and DP-ELM, we extend them to RELM, thus yielding CP-RELM and DP-RELM, respectively. CP-RELM and DP-RELM dominate CP-ELM and DP-ELM in terms of the number of hidden nodes, respectively, under nearly the same prediction accuracy.

  • (2)

    During the process of sparsifying RELM, two schemes are adopted. Although these two schemes are completely equivalent from theoretical viewpoint, their empirical studies show different to some extent.

  • (3)

    In CP-ELM and DP-ELM, the number of candidate hidden nodes is small because we need to transform the candidate hidden nodes output matrix into an upper triangular matrix by the orthogonal decomposition. If the number of candidate hidden nodes is large, the hidden nodes output matrix is singular or nearly singular. Due to the Tikhonov regularization, we can initialize RELM with more candidate hidden nodes (even their number is larger than the number of training samples) in order that much better hidden nodes are recruited to improve the generalization performance and generate more compact network.

The rest of the paper is organized as follows. In Section 2, ELM and RELM are introduced. And then the ideas behind CP-ELM and DP-ELM are described. In Section 4, we extend the ideas of CP-ELM and DP-ELM to sparsify RELM with two schemes, yielding CP-RELM and DP-RELM. To confirm the effectiveness and feasibility of the proposed algorithms, eleven benchmark data sets divided into two groups are utilized to do experiments in Section 5. Finally, conclusions follow.

Section snippets

ELM and RELM

In this section, we will briefly depict the essence of ELM firstly, and then RELM is introduced. As for the theoretical foundation of ELM, Huang and Babri [37] studied the learning performance of SLFN on small size data set and found that SLFN with at most N hidden neurons can learn N distinct samples with zero error by adopting any bounded nonlinear activation function. On the basis of this concept, Huang et al. [1], [2], [3] proposed the ELM algorithm. The main concept behind ELM lies in the

CP-ELM and DP-ELM

From some preliminary work in [2], it is easily known that the original ELM needing N hidden nodes at most can approximate N distinct samples with any accuracy. In fact, in real-world applications, it is common that the number of hidden nodes is always less than the number of training samples, i.e., L<N, and H is of full column rank. Therefore, the following theorem is obtained:

Theorem 1

The minimizer of Eq. (4) is equivalent to the following minimizer:minβ{RL×Lβd^L×122}whereH=Q[RL×L0(NL)×L]QTd=[d^L×

CP-RELM and DP-RELM

In order to use the ideas behind CP-ELM and DP-ELM to sparsify RELM, some preliminaries must be done on Eq. (7). Hence, the following theorems are introduced in advance.

Theorem 2

The minimizer of Eq. (7) is completely equivalent to the following minimizer:minβ{J̇=Ḣβḋ22}where Ḣ=HTH+λI, ḋ=HTd.

Proof

The optimal solution of Eq. (7) is found by solving(HTH+λI)β=HTdThen, the solution of Eq. (32) in the least square sense can be got via the following optimal problem:minβ{(HTH+λI)βHTd22=Ḣβḋ22}This

Experiments

To show the usefulness of the proposed methods empirically, some regression experiments including 11 benchmark data sets specified in Table 1 are performed, which are divided into two groups. In our experiments, all the inputs (attributes) have been normalized into the range [−1, 1] while the outputs (targets) have been normalized into [0, 1]. In this paper, two typical hidden nodes are chosen as the activation functions of hidden layer, i.e. the Sigmoid h(x)=1/(1+exp{xTai}) and the RBF h(x)=

Conclusions

Extreme learning machine as an emerging learning tool in machine learning community has attracted much attention in the past few years. Because of its simple formula and high computational efficiency, ELM has rapidly obtained wide applications in walks of life. In theory, it has been shown that ELM with N random hidden nodes can exactly learn N distinct samples. Actually, there are two potential risks of doing this. On one hand, N hidden nodes may lead to the network too complicated and thus

Acknowledgments

This research was partially supported by the National Natural Science Foundation of China under Grant no. 51006052, and the NUST Outstanding Scholar Supporting Program. Moreover, the authors wish to thank the anonymous reviewers for their constructive comments and great help in the writing process, which improve the manuscript significantly.

Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His

References (44)

  • H.T. Huynh et al.

    Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks

    Pattern Recognit. Lett.

    (2011)
  • Y. Miche et al.

    TROP-ELMa double-regularized ELM using LARS and Tikhonov regularization

    Neurocomputing

    (2011)
  • X. Li et al.

    Fast sparse approximation of extreme learning machine

    Neurocomputing

    (2014)
  • G.-B. Huang et al.

    Optimization method based extreme learning machine for classification

    Neurocomputing

    (2010)
  • G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in:...
  • G.-B. Huang et al.

    Universal approximation using incremental constructive feedforward networks with random hidden nodes

    IEEE Trans. Neural Netw.

    (2006)
  • G. Feng et al.

    Evolutionary selection extreme learning machine optimization for regression

    Soft Comput.

    (2012)
  • G.-B. Huang et al.

    Extreme learning machine for regression and multiclass classification

    IEEE Trans. Syst. Man Cybern. Part B: Cybern.

    (2012)
  • M. Han, D. Li, A fast two stage density estimation based on extreme learning machine, in: Proceedings of the 2009...
  • D.E. Rumelhart et al.

    Learning representations by back-propagating errors

    Nature

    (1986)
  • P.L. Bartlett

    Sample complexity of pattern classification with neural networksthe size of the weights is more important than the size of the network

    IEEE Trans. Inf. Theory

    (1998)
  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • Cited by (12)

    • Feature selection of generalized extreme learning machine for regression problems

      2018, Neurocomputing
      Citation Excerpt :

      Moreover, an error-minimization-based method [51] was presented to grow hidden nodes for ELM one by one or group by group, and output weights are incrementally updated which significantly reduces the computational complexity. Recently, the constructive parsimonious ELMs [52–54] were developed based on the orthogonal transformation. These aforementioned algorithms are inside the range of the forward learning algorithms.

    • Retargeting extreme learning machines for classification and their applications to fault diagnosis of aircraft engine

      2017, Aerospace Science and Technology
      Citation Excerpt :

      Thereinto, the constructive and destructive algorithms play an important role in this respect. The constructive algorithms start with a small initial size and gradually recruit new hidden nodes until a required solution is found, such as the incremental ELMs [10–14]. In contrast, the destructive algorithms, also known as pruning algorithms [15,16], are initialized with a network of a larger than necessary size, and then the redundant or less effective hidden nodes are gradually removed.

    • The selection of input weights of extreme learning machine: A sample structure preserving point of view

      2017, Neurocomputing
      Citation Excerpt :

      This point of view makes the classical results (e.g. Koksma–Hlawka inequality) of QMC method applicable and the theoretical analysis more direct, while in [30,31] a new discrepancy measure, called box discrepancy, had to be defined because the variation of the deduced integrand corresponding to the shift-invariant kernel is not bounded. In previous works [34–37], the orthogonalization technique has been used in ELM. In [34,35], the orthogonalization operation are conducted on the output matrix of hidden layer of ELM not on the input weights.

    View all citing articles on Scopus

    Yong-Ping Zhao received his B.S. degree in the thermal energy and power engineering field from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in July 2004. Since then, he had been working toward the M.S. and Ph.D. degrees in kernel methods at Nanjing University of Aeronautics and Astronautics. In December 2009, He received Ph.D. degree. Currently, he is an associate professor and with the school of mechanical engineering, Nanjing University of Science & Technology. His research interests include machine learning and kernel methods.

    Kang-Kang Wang was born in 1989. He received the B.Eng. degree in Heilongjiang University of Science and Technology, China, in 2012. He is currently pursuing the M.Eng. degree from Nanjing University of Science and Technology, China. His research interests include machine learning, pattern recognition, etc.

     Ye-Bo Li received his B.S. degree in the thermal energy and power engineering field from Shenyang Aerospace University, Shenyang, China in July 2008. From September 2008 to April 2011, he studied in Nanjing University of Aeronautics and Astronautics of China, majored in Aerospace Propulsion Theory and Engineering, and received M.S. degree. Now, he is pursuing the Ph.D. degree from Nanjing University of Aeronautics and Astronautics. His research interests include aero-engine control system design and establishment of aero-engines׳ mathematical model.

    View full text