Elsevier

Neurocomputing

Volume 277, 14 February 2018, Pages 4-11
Neurocomputing

Multi-modal local receptive field extreme learning machine for object recognition

https://doi.org/10.1016/j.neucom.2017.04.077Get rights and content

Abstract

Learning rich representations efficiently plays an important role in the multi-modal recognition task, which is crucial to achieving high generalization performance. To address this problem, in this paper, we propose an effective Multi-Modal Local Receptive Field Extreme Learning Machine (MM-LRF-ELM) structure, while maintaining ELM’s advantages of training efficiency. In this structure, LRF-ELM is first conducted for feature extraction for each modality separately. And then, the shared layer is developed by combining these features from each modality. Finally, the Extreme Learning Machine (ELM) is used as supervised feature classifier for the final decision. Experimental validation on Washington RGB-D Object Dataset illustrates that the proposed multiple modality fusion method achieves better recognition performance.

Introduction

Object recognition is a challenging task in computer vision and important for making robots useful in home environments. With the recent advent of depth cameras, an increasing amount of visual data not only contains color but also depth measurements. Compared to RGB data, which provides information about appearance and texture, depth data contains additional information about object shape and it is invariant to lighting or color variations [1].

In recent years, various approaches that have been proposed for RGB-D object recognition: methods with hand-crafted features [2], [3], [4], and methods with learned feature [5], [6], [7], [8], [9], [10]. Moreover, the classical neural network structure, like convolutional neural network networks (CNNs), is also applied to the object recognition field [24], [25], [26] and it have recently been shown to be remarkably successful for recognition on RGB images [23].

Though traditional gradient-based learning algorithms (like BP Neural network) [11] have been widely used in the training of multilayer feedforward neural networks [21], [22], these gradient-based learning algorithms are still relatively slow in learning and easily get stuck in local minima [13]. Furthermore, the activation functions used in these gradient-based tuning methods need to be differentiable.

In order to overcome the drawbacks of gradient-based methods, Huang et al. proposed an efficient training algorithm for the single-hidden layer feedforward neural network (SLFN) called Extreme Learning Machine (ELM) [12], [14]. It increases the learning speed by means of randomly generating input weights and hidden biases, and the output weights are determined by using Moore–Penrose (MP) generalized inverse. Compared with the traditional gradient-based learning algorithms, ELM not only learns much faster with higher generalization performance [27], [30] but also avoids many difficulties faced by gradient-based learning methods such as stopping criteria, learning rate, learning epochs, and local minima. What is more, more and more deep ELM learning algorithms has been proposed [33], [34] to capture relevant higher-level abstractions. However, ELM with local connections has not attracted much research attention yet. Ref. [15] has proved that the application of the local receptive fields based ELM (LRF-ELM) has better performance than conventional deep learning solutions [16], [31], [32] in image processing and speech recognition.

However, the aforementioned works do not refer to the multi-modal problem [28], [29]. Thus, in this paper, we extend the LRF-ELM and propose a Multi-Modal LRF-ELM (MM-LRF-ELM) framework. The proposed MM-LRF-ELM is applied to multi-modal learning task, while maintaining its advantages of training efficiency. The contributions of this work are summarized as follows:

  • 1.

    We propose an architecture: multi-modal LRF-ELM framework, to construct the nonlinear representation from different aspects of information sources. The important merit of such a method is that the training time is greatly shortened and the testing efficiency is highly improved.

  • 2.

    We evaluate our multimodal network architecture on the Washington RGB-D Object Dataset [4]. The obtained results show that the proposed fusion method obtains rather promising results.

The remainder of this paper is organized as follows: Section 2 introduces the related works, including the fundamental concepts and theories of ELM; Section 3 describes the proposed MM-LRF-ELM framework; Section 4 compares the performance of MM-LRF-ELM with single modality and other methods; while Section 5 concludes this paper.

Section snippets

Brief review for ELM

ELM was proposed in Huang et al. [12] (Fig. 1). Suppose we are training SLFNs with K hidden neurons and activation function g(x) to learn N distinct samples {X,T}={Xj,tj}j=1N,where xjRn and tjRm. In ELM, the input weights and hidden biases are randomly generated instead of tuned. By doing so, the nonlinear system has been converted to a linear system Yj=i=1Lβigi(xj)=i=1Lβig(wiTxj+bi)=tj,j=1,2,...Nwhere YjRm is the output vector of the jth training sample, WiRn is the input weight vector

Model architecture

Our architecture, which is depicted in Fig. 2, employs the LRF-ELM as the learning unit to learn shallow and deep information. The multi-modal training architecture is structurally divided into three separate phases: unsupervised feature representation for each modality separately, feature fusion representation and supervised feature classification.

As shown in Fig. 2, we perform feature learning to have representations of each modality (RGB and Depth) before they are mixed. Each modality is

Data set

The Washington RGB-D Object Dataset consists of 41,877 RGB-D images containing household objects organized into 51 different classes and a total of 300 instances of these classes which are captured under three different viewpoint angles (Fig. 5). For the evaluation, every 5th frame is subsampled.

Our experiments focused on category recognition and instance recognition. After subsampling every 5th frame from the videos, there were some 34000 images for training and 6900 images for testing. Before

Conclusions

In this paper, we have proposed a novel multi-modal training scheme MM-LRF-ELM, in which information of each modality has been learned and combined in an effective way without iterative fine-tuning. In this structure, MM-LRF-ELM takes full advantage of the LRF-ELM to learn the high-level representation of the multi-modal data. Thus, the proposed method could obtain more robust and better performance.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under grants U1613212, 61673238, 91420302, and 61327809, in part by the National High-Tech Research and Development Plan under grant 2015AA042306.

Huaping Liu is currently an associate professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His current research interests include robot perception and learning. He serves as an associate editor of some journals, including the IEEE ROBOTICS AND AUTOMATION LETTERS, Neurocomputing, the International Journal of Control, Automation and Systems.

References (35)

  • L. Bo et al.

    Unsupervised feature learning for RGB-D based object recognition

    Proceedings of the Int. Symposium on Experimental Robotics (ISER)

    (2012)
  • R. Socher et al.

    Convolutional-recursive deep learning for 3D object classification

    Proceedings of the Neural Information Processing Systems (NIPS)

    (2012)
  • J. Wang et al.

    Locality constrained linear coding for image classification

    Proceedings of the Int. Conf. on Computer Vision and Pattern Recognition (CVPR)

    (2010)
  • K. Yu et al.

    Learning image representations from the pixel level via hierarchical sparse coding

    Proceedings of the Computer Vision and Pattern Recognition (CVPR)

    (2011)
  • D.E. Rumelhart et al.

    Parallel distributed processing

    Encyclopedia of Database Systems

    (1986)
  • G. Huang et al.

    Extreme learning machine: a new learning scheme of feedforward neural networks

    Proceedings of the International Joint Conference Neural Networks (IJCNN)

    (2004)
  • G. Huang et al.

    Extreme learning machine for regression and multi-class classification

    Proceedings of the IEEE Systems Man & Cybernetics Society

    (2012)
  • Cited by (33)

    • Prediction of melt pool shape in additive manufacturing based on machine learning methods

      2023, Optics and Laser Technology
      Citation Excerpt :

      The predictions used to make accurate decisions are relied on the application models according to the learning algorithms and empirical data. ML is frequently applied in medical diagnosis [10–12], as well as in the prediction mechanical performances of material [13–15], intelligent manufacturing [16–19], autonomous driving [20–23], natural language processing [24–27] and object recognition [28–31]. ML algorithms are mainly divided into supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning.

    • Unsupervised skill transfer learning for autonomous robots using distributed Growing Self Organizing Maps

      2021, Robotics and Autonomous Systems
      Citation Excerpt :

      Skill transfer or transfer learning aims to take advantage of transferring knowledge between such bodies of data. In a robotics context, it is the transfer of skills learned from past behaviors to new situations, reducing the need of relearning new scenarios [6] and learning rich latent representations [7]. Self-Organizing Map (SOM) [8] is an unsupervised machine learning algorithm for mapping high dimensional data in input space onto a low dimensional (usually two dimensional) output space, while preserving the topological relations of the input space.

    • Private and common feature learning with adversarial network for RGBD object classification

      2021, Neurocomputing
      Citation Excerpt :

      The spatial information can be retained by this framework. Based on this idea, several methods are proposed by substituting RNNs [2], adding constraints [26] or using deeper architectures [7]. To further exploit the relationship between modalities, some literatures introduce several carefully designed frameworks, which can fuse the features in various hierarchies [49,27,1].

    • R-ELMNet: Regularized extreme learning machine network

      2020, Neural Networks
      Citation Excerpt :

      It shows extremely fast learning speed and competitive performance compared with supervised convolutional neural networks. Multi-scale local receptive field extreme learning machine (MSLRF-ELM) (Huang et al., 2017) and multi-modal local receptive field extreme learning machine (MMLRF-ELM) (Liu, Li, Xu, & Sun, 2018) extend LRF-ELM into multi-scale and multi-modal scenarios respectively. As the convolutional kernels are randomly generated that expanded feature maps impose limited influence on learning speed.

    • Non-iterative and Fast Deep Learning: Multilayer Extreme Learning Machines

      2020, Journal of the Franklin Institute
      Citation Excerpt :

      Differently, ELM-LRF can provide more flexible and wider type of local receptive fields, but CNN only uses convolutional hidden nodes; ELM-LRF randomly generates the input weights and analytically calculates the output weights, but CNN needs to be tuned. Liu et al. [97] proposed a multi-modal ELM-LRF (MM-ELM-LRF) framework for constructing the nonlinear representation from different aspects of information sources, which has three separated phases, including unsupervised feature representation for each modality separately, feature fusion representation, and supervised feature classification. The authors performed feature learning to have representations of each modality, i.e., RGB and Depth, before they were mixed.

    View all citing articles on Scopus

    Huaping Liu is currently an associate professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His current research interests include robot perception and learning. He serves as an associate editor of some journals, including the IEEE ROBOTICS AND AUTOMATION LETTERS, Neurocomputing, the International Journal of Control, Automation and Systems.

    Fengxue Li graduated from School of Information Sciences, Taiyuan University of Technology from 2013. She is now graduate student and her interests are machine learning and its applications.

    Xinying Xu is an associate professor in School of Information Sciences, Taiyuan University of Technology. His research interests are machine learning and its applications.

    Fuchun Sun is currently a full professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His current research interests include intelligent control and robotics. He was a recipient of the National Science Fund for Distinguished Young Scholars. He serves as an associate editor of a series of international journals, including the IEEE TRANSACTIONS ON FUZZY SYSTEMS, Mechatronics, Robotics, and Autonomous Systems.

    View full text