Elsevier

Neurocomputing

Volume 174, Part A, 22 January 2016, Pages 42-49
Neurocomputing

Deep extreme learning machines: supervised autoencoding architecture for classification

https://doi.org/10.1016/j.neucom.2015.03.110Get rights and content

Abstract

We present a method for synthesising deep neural networks using Extreme Learning Machines (ELMs) as a stack of supervised autoencoders. We test the method using standard benchmark datasets for multi-class image classification (MNIST, CIFAR-10 and Google Streetview House Numbers (SVHN)), and show that the classification error rate can progressively improve with the inclusion of additional autoencoding ELM modules in a stack. Moreover, we found that the method can correctly classify up to 99.19% of MNIST test images, which surpasses the best error rates reported for standard 3-layer ELMs or previous deep ELM approaches when applied to MNIST. The approach simultaneously offers a significantly faster training algorithm to achieve its best performance (in the order of 5 min on a four-core CPU for MNIST) relative to a single ELM with the same total number of hidden units as the deep ELM, hence offering the best of both worlds: lower error rates and fast implementation.

Introduction

In recent years several hardware platforms optimised for neural network implementation have been developed. These implementations range from massively parallel custom-built System-on-Chip (SoC) silicon microprocessor arrays (e.g. SpiNNaker [1]), to analog VLSI processors directly emulating the ion channels of the neurons as leakage currents in CMOS subthreshold region (e.g. Neurogrid [2]). The emergence of these platforms has been accompanied by a parallel effort to develop algorithms which mimic the computational capability of the human brain, particularly in developing synthesised (engineered) neural networks. These algorithms are now utilised for both investigating brain function in computational neuroscience (for example, models of controlling eye position and working memory [3]), and for implementing computing systems in machine learning. In machine learning, an emerging algorithm is the Extreme Learning Machine (ELM) [4], [5], [6], which is known to be relatively fast to train in comparison with iterative training methods, and performs with similar accuracy to Support Vector Machines (SVMs) [7]. This current paper is motivated by recent work that has aimed to produce neuromorphic implementations of ELM [8] and related methods [9], [10], based on hardware that simulates ‘spiking neurons’. See [11] for further discussion of neuromorphic implementations. One potential limitation of hardware implementations, or implementations on resources-constrained platforms, is the number of hidden units available for concurrent activation. Our focus is on developing an ELM algorithm that enables the number of hidden units that need to be concurrently activated to be reduced, as well as offering even faster training times, whilst maintaining good performance.

The neural network architecture of standard existing ELM approaches is a three layer feedforward structure. The first layer is the input layer, the second—the hidden layer—is activated by weighted projections of the input to non-linear sigmoid neurons, and the third and final layer is the output, consisting of units with linear input–output characteristics (see Fig. 1). In ELM, the connection weights between the input and the hidden layer neurons are randomly specified and remain untrained [4], [5]. For example, the input layer connection weights may be uniformly distributed with values between −0.5 and +0.5. This is analogous with neurobiology, in the sense that a negative connection weight inhibits a neuron׳s activity, and a positive weight excites neuronal activity. After projecting the input to the hidden layer, each hidden-unit׳s non-linear sigmoid function generates responses. Then using training data, the connection weights between the hidden and the output layer is trained in a single pass by mathematical optimisation. Only this connection weight matrix is altered during training. It is calculated by a least squares regression method such as the Moore–Penrose pseudoinverse [12].

The methodology used in the above approach can be summarised as follows:

  • 1.

    Using random and fixed weights, project an input layer to a hidden layer of sigmoidal units.

  • 2.

    Using training data, numerically solve for the output weights between the hidden layer and the output layer by calculating the pseudoinverse of the matrix product of the hidden layer values for all training data, and the corresponding desired output responses.

This class of methods has been referred to as Linear Solutions of Higher Dimensional Interlayer (LSHDI) networks [11]. It is a significant deviation from classical artificial neural network training methods. In classical artificial neural networks, the input weights are iteratively trained, rather than computing the output weights only, in a single batch. This interesting property can significantly enhance the efficiency of training since the full and final solution is obtained by mathematical optimisation of a convex function, in one single step. LSHDI methods also can solve significant problems in computational neuroscience simulations of real neurobiological neurons [3]. Although widely accepted and very capable models exist at the single neuron level to mimic neurobiology, until the emergence of LSHDI, there had been no widely applicable method to synthesise (train) a network to solve multiple tasks [13]. This class of methods are now emerging as the core of a generic neural compiler for creating silicon neural systems [1].

One drawback of classical ELMs is the number of neurons in its single hidden layer are typically very large and hence training the network can be computationally impractical, given a large dataset (the algorithm order of complexity for solving for the output weights is O(KM2), where K is the number of training points and M is the number of hidden units [14]). It also makes use of batch training, meaning that the network is trained using the entire dataset at once, which usually requires large memory and processing power. In [15], [11], [16] on-line training methods (as opposed to batch training) have been proposed to overcome this, but due to the large number of neurons typically used in the single hidden layer, the training time still largely depends on the size of the network, retaining an O(M2) implementation complexity.

In this paper we introduce a different way to address the problem of large hidden layer sizes. Our approach takes inspiration from biology and the recent advances in deep learning architectures [17], [18], [19]. We show that by constructing a deep ELM network as a stack of supervised autoencoder ELM modules, and training module by module, the network training time and memory usage can be significantly improved, whilst simultaneously boosting classification error-rate performance above what can be achieved using a single ELM with the same total number of hidden units.

There have been several previous approaches to multi-layered ELM networks. Two approaches result in a similar deep network architecture to ours: (i) [20], uses unsupervised autoencoding of hidden-layer activations as a method for constructing a deep network; (ii) [21] introduces a ‘random shifts and kernalization’ method to define the input to each hidden layer in the network. Another relevant approach is that of [22], which splits the input variables amongst a cascade of multiple ELM modules, with modules beyond the first module also receiving responses from the previous module. In Discussion (Section 4) we describe how our approach fundamentally differs from these networks.

The advances made by our algorithm are a result of two key factors:

  • 1.

    Selection of the untrained input weights using our recently introduced weight shaping method known as constrained receptive field ELM (RF-C-ELM) [14] (which builds upon the Constrained ELM (C-ELM) method of [23]), rather than selecting these weights from a random distribution.

  • 2.

    We train each ELM module in the stack to both autoencode its input and classify it, and then feed both the autoencoding and the classification vectors into the subsequent module.

As we shall show, training ELM modules in the stack using both the training data and the classification results of the previous module leads to an iteratively improved classification of test data with each subsequent module. This enhancement occurs simultaneously with a reduction in the training order of complexity for the same total number of hidden units. Thus our method offers the ‘best of both worlds’: enhanced classification rates and enhanced runtime complexity.

Section snippets

Methodology

In this section, we introduce the methods that we use to construct our deep ELM network.

Experiments

We firstly describe three image classification tasks that we tested our method on, and then present results on each of these benchmarks.

Discussion

In previous work, a deep ELM structure that exploits autoencoding was proposed [20]. In that method, the initial step is to train an ELM using the training data as the target, without using any labels. Then, the transpose of these trained autoencoding output weights replace the input weights in the ELM. Then, the hidden-layer activations are trained in a similar autoencoding fashion multiple times, before finally projecting into a larger hidden layer to train as the classifier output.

This

Acknowledgements

Mark D. McDonnell׳s contribution was supported by the Australian Research Council under ARC grant DP1093425 (including an Australian Research Fellowship).

Migel D. Tissera completed his B.E. in electrical and mechatronics engineering at the University of South Australia in 2010. After completing a research internship in early 2011, he worked in industry as an electrical engineer, steadily gaining experience in areas such as mining, water, utilities and power generation. He is passionate about robotics and hardware engineering, and currently a Ph.D. research student studying machine learning and biological neural computation.

References (32)

  • G.-B. Huang et al.

    Extreme learning machinesa survey

    Int. J. Mach. Learn. Cybern.

    (2011)
  • E. Cambria et al.

    Extreme learning machines

    IEEE Intell. Syst.

    (2013)
  • G.-B. Huang

    An insight into extreme learning machinesrandom neurons, random features and kernels

    Cognit. Comput.

    (2014)
  • F. Galluppi, S. Davies, S. Furber, T. Stewart, C. Eliasmith, Real time on-chip implementation of dynamical systems with...
  • S. Choudhary, S. Sloan, S. Fok, A. Neckar, E. Trautmann, P. Gao, T. Stewart, C. Eliasmith, K. Boahen, Silicon neurons...
  • R. Penrose, A generalized inverse for matrices, Math. Proc. Camb. Philos. Soc. 51 (1955)...
  • Cited by (108)

    • An Edge Computing-oriented Net Power Forecasting for PV-assisted Charging Station: Model Complexity and Forecasting Accuracy Trade-off

      2022, Applied Energy
      Citation Excerpt :

      The DA-ELM achieves fast training time without implementing a fine-tuning process whose module is exceedingly time-consuming. Apart from that, DA-ELM still maintains satisfactory forecasting performance inherited from deep learning, the characteristics of original data space is transformed into the new data space via unsupervised training layer by layer, which can effectively characterize complex functions with strong nonlinear mapping ability [29]. The hyperparameter of DA-ELM includes the number of hidden layers and hidden nodes λ={λlayer, λnode}.

    View all citing articles on Scopus

    Migel D. Tissera completed his B.E. in electrical and mechatronics engineering at the University of South Australia in 2010. After completing a research internship in early 2011, he worked in industry as an electrical engineer, steadily gaining experience in areas such as mining, water, utilities and power generation. He is passionate about robotics and hardware engineering, and currently a Ph.D. research student studying machine learning and biological neural computation.

    Mark D. McDonnell received the B.E. and Ph.D. degrees in electronic engineering in 1998 and 2006 respectively, and a B.Sc with First Class Honours in applied mathematics in 2001, all from The University of Adelaide, Australia. He is currently Associate Research Professor, and Principal Investigator of the Computational and Theoretical Neuroscience Laboratory, at the Institute for Telecommunications Research at University of South Australia, which he joined in 2007. Prior to this, he was a Lecturer in the School of Electrical and Electronic Engineering at University of Adelaide. He is a member of the editorial board of PLOS One and Fluctuation and Noise Letters, and has served as a Guest Editor for Proceedings of the IEEE and Frontiers in Computational Neuroscience. McDonnell׳s research focuses on the use of computational and engineering methods to advance scientific knowledge about the influence of noise and random variability in brain signals and structures during neurobiological computation. His contributions to this area of computational neuroscience have been recognized by the award of a five-year Australian Research Fellowship from the Australian Research Council in 2010, and a South Australian Tall Poppy Award for Science in 2008, as well as numerous invited talks. McDonnell has published over 80 papers, including several review articles, and a book on Stochastic Resonance published by Cambridge University Press. He has served as Vice President and Secretary of the IEEE South Australia (SA) Section Joint Communications and Signal Processing Chapter, and co-founded Neuroeng: The Australian Association for Computational Neuroscientists and Neuroengineers.

    View full text