Multi-modal local receptive field extreme learning machine for object recognition
Introduction
Object recognition is a challenging task in computer vision and important for making robots useful in home environments. With the recent advent of depth cameras, an increasing amount of visual data not only contains color but also depth measurements. Compared to RGB data, which provides information about appearance and texture, depth data contains additional information about object shape and it is invariant to lighting or color variations [1].
In recent years, various approaches that have been proposed for RGB-D object recognition: methods with hand-crafted features [2], [3], [4], and methods with learned feature [5], [6], [7], [8], [9], [10]. Moreover, the classical neural network structure, like convolutional neural network networks (CNNs), is also applied to the object recognition field [24], [25], [26] and it have recently been shown to be remarkably successful for recognition on RGB images [23].
Though traditional gradient-based learning algorithms (like BP Neural network) [11] have been widely used in the training of multilayer feedforward neural networks [21], [22], these gradient-based learning algorithms are still relatively slow in learning and easily get stuck in local minima [13]. Furthermore, the activation functions used in these gradient-based tuning methods need to be differentiable.
In order to overcome the drawbacks of gradient-based methods, Huang et al. proposed an efficient training algorithm for the single-hidden layer feedforward neural network (SLFN) called Extreme Learning Machine (ELM) [12], [14]. It increases the learning speed by means of randomly generating input weights and hidden biases, and the output weights are determined by using Moore–Penrose (MP) generalized inverse. Compared with the traditional gradient-based learning algorithms, ELM not only learns much faster with higher generalization performance [27], [30] but also avoids many difficulties faced by gradient-based learning methods such as stopping criteria, learning rate, learning epochs, and local minima. What is more, more and more deep ELM learning algorithms has been proposed [33], [34] to capture relevant higher-level abstractions. However, ELM with local connections has not attracted much research attention yet. Ref. [15] has proved that the application of the local receptive fields based ELM (LRF-ELM) has better performance than conventional deep learning solutions [16], [31], [32] in image processing and speech recognition.
However, the aforementioned works do not refer to the multi-modal problem [28], [29]. Thus, in this paper, we extend the LRF-ELM and propose a Multi-Modal LRF-ELM (MM-LRF-ELM) framework. The proposed MM-LRF-ELM is applied to multi-modal learning task, while maintaining its advantages of training efficiency. The contributions of this work are summarized as follows:
- 1.
We propose an architecture: multi-modal LRF-ELM framework, to construct the nonlinear representation from different aspects of information sources. The important merit of such a method is that the training time is greatly shortened and the testing efficiency is highly improved.
- 2.
We evaluate our multimodal network architecture on the Washington RGB-D Object Dataset [4]. The obtained results show that the proposed fusion method obtains rather promising results.
The remainder of this paper is organized as follows: Section 2 introduces the related works, including the fundamental concepts and theories of ELM; Section 3 describes the proposed MM-LRF-ELM framework; Section 4 compares the performance of MM-LRF-ELM with single modality and other methods; while Section 5 concludes this paper.
Section snippets
Brief review for ELM
ELM was proposed in Huang et al. [12] (Fig. 1). Suppose we are training SLFNs with K hidden neurons and activation function g(x) to learn N distinct samples where and . In ELM, the input weights and hidden biases are randomly generated instead of tuned. By doing so, the nonlinear system has been converted to a linear system where is the output vector of the jth training sample, is the input weight vector
Model architecture
Our architecture, which is depicted in Fig. 2, employs the LRF-ELM as the learning unit to learn shallow and deep information. The multi-modal training architecture is structurally divided into three separate phases: unsupervised feature representation for each modality separately, feature fusion representation and supervised feature classification.
As shown in Fig. 2, we perform feature learning to have representations of each modality (RGB and Depth) before they are mixed. Each modality is
Data set
The Washington RGB-D Object Dataset consists of 41,877 RGB-D images containing household objects organized into 51 different classes and a total of 300 instances of these classes which are captured under three different viewpoint angles (Fig. 5). For the evaluation, every 5th frame is subsampled.
Our experiments focused on category recognition and instance recognition. After subsampling every 5th frame from the videos, there were some 34000 images for training and 6900 images for testing. Before
Conclusions
In this paper, we have proposed a novel multi-modal training scheme MM-LRF-ELM, in which information of each modality has been learned and combined in an effective way without iterative fine-tuning. In this structure, MM-LRF-ELM takes full advantage of the LRF-ELM to learn the high-level representation of the multi-modal data. Thus, the proposed method could obtain more robust and better performance.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under grants U1613212, 61673238, 91420302, and 61327809, in part by the National High-Tech Research and Development Plan under grant 2015AA042306.
Huaping Liu is currently an associate professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His current research interests include robot perception and learning. He serves as an associate editor of some journals, including the IEEE ROBOTICS AND AUTOMATION LETTERS, Neurocomputing, the International Journal of Control, Automation and Systems.
References (35)
- et al.
Extreme learning machine: theory and applications
Neurocomputing
(2006) Approximation capabilities of multilayer feedforwardnetworks
Neural Networks
(1991)- et al.
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
Neural Networks
(1993) - et al.
Fully complex extreme learning machine
Neurocomputing
(2005) - et al.
High-accuracy 3D sensing for mobile manipulation: improving object detection and door opening
Proceedings of the IEEE Int. Conf. on Robotics & Automation (ICRA)
(2009) - et al.
Depth kernel descriptors for object recognition
Proceedings of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS)
(2011) - et al.
Going into depth: evaluating 2D and 3D cues for object classification on a new, large-scale object dataset
Proceedings of the Computer Vision Workshops (ICCV Workshops)
(2011) - et al.
A large-scale hierarchical multi-view RGB-D object dataset
Proceedings of the IEEE Int. Conf. on Robotics & Automation (ICRA)
(2011) - et al.
A learned feature descriptor for object recognition in RGB-D data
Proceedings of the IEEE Int. Conf. on Robotics & Automation (ICRA)
(2012) - et al.
Hierarchical matching pursuit for image classification: architecture and fast algorithms
Proceedings of the Neural Information Processing Systems (NIPS)
(2011)
Unsupervised feature learning for RGB-D based object recognition
Proceedings of the Int. Symposium on Experimental Robotics (ISER)
Convolutional-recursive deep learning for 3D object classification
Proceedings of the Neural Information Processing Systems (NIPS)
Locality constrained linear coding for image classification
Proceedings of the Int. Conf. on Computer Vision and Pattern Recognition (CVPR)
Learning image representations from the pixel level via hierarchical sparse coding
Proceedings of the Computer Vision and Pattern Recognition (CVPR)
Parallel distributed processing
Encyclopedia of Database Systems
Extreme learning machine: a new learning scheme of feedforward neural networks
Proceedings of the International Joint Conference Neural Networks (IJCNN)
Extreme learning machine for regression and multi-class classification
Proceedings of the IEEE Systems Man & Cybernetics Society
Cited by (33)
Prediction of melt pool shape in additive manufacturing based on machine learning methods
2023, Optics and Laser TechnologyCitation Excerpt :The predictions used to make accurate decisions are relied on the application models according to the learning algorithms and empirical data. ML is frequently applied in medical diagnosis [10–12], as well as in the prediction mechanical performances of material [13–15], intelligent manufacturing [16–19], autonomous driving [20–23], natural language processing [24–27] and object recognition [28–31]. ML algorithms are mainly divided into supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning.
When CNNs meet random RNNs: Towards multi-level analysis for RGB-D object and scene recognition
2022, Computer Vision and Image UnderstandingUnsupervised skill transfer learning for autonomous robots using distributed Growing Self Organizing Maps
2021, Robotics and Autonomous SystemsCitation Excerpt :Skill transfer or transfer learning aims to take advantage of transferring knowledge between such bodies of data. In a robotics context, it is the transfer of skills learned from past behaviors to new situations, reducing the need of relearning new scenarios [6] and learning rich latent representations [7]. Self-Organizing Map (SOM) [8] is an unsupervised machine learning algorithm for mapping high dimensional data in input space onto a low dimensional (usually two dimensional) output space, while preserving the topological relations of the input space.
Private and common feature learning with adversarial network for RGBD object classification
2021, NeurocomputingCitation Excerpt :The spatial information can be retained by this framework. Based on this idea, several methods are proposed by substituting RNNs [2], adding constraints [26] or using deeper architectures [7]. To further exploit the relationship between modalities, some literatures introduce several carefully designed frameworks, which can fuse the features in various hierarchies [49,27,1].
R-ELMNet: Regularized extreme learning machine network
2020, Neural NetworksCitation Excerpt :It shows extremely fast learning speed and competitive performance compared with supervised convolutional neural networks. Multi-scale local receptive field extreme learning machine (MSLRF-ELM) (Huang et al., 2017) and multi-modal local receptive field extreme learning machine (MMLRF-ELM) (Liu, Li, Xu, & Sun, 2018) extend LRF-ELM into multi-scale and multi-modal scenarios respectively. As the convolutional kernels are randomly generated that expanded feature maps impose limited influence on learning speed.
Non-iterative and Fast Deep Learning: Multilayer Extreme Learning Machines
2020, Journal of the Franklin InstituteCitation Excerpt :Differently, ELM-LRF can provide more flexible and wider type of local receptive fields, but CNN only uses convolutional hidden nodes; ELM-LRF randomly generates the input weights and analytically calculates the output weights, but CNN needs to be tuned. Liu et al. [97] proposed a multi-modal ELM-LRF (MM-ELM-LRF) framework for constructing the nonlinear representation from different aspects of information sources, which has three separated phases, including unsupervised feature representation for each modality separately, feature fusion representation, and supervised feature classification. The authors performed feature learning to have representations of each modality, i.e., RGB and Depth, before they were mixed.
Huaping Liu is currently an associate professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His current research interests include robot perception and learning. He serves as an associate editor of some journals, including the IEEE ROBOTICS AND AUTOMATION LETTERS, Neurocomputing, the International Journal of Control, Automation and Systems.
Fengxue Li graduated from School of Information Sciences, Taiyuan University of Technology from 2013. She is now graduate student and her interests are machine learning and its applications.
Xinying Xu is an associate professor in School of Information Sciences, Taiyuan University of Technology. His research interests are machine learning and its applications.
Fuchun Sun is currently a full professor with the Department of Computer Science and Technology, Tsinghua University, Beijing, China. His current research interests include intelligent control and robotics. He was a recipient of the National Science Fund for Distinguished Young Scholars. He serves as an associate editor of a series of international journals, including the IEEE TRANSACTIONS ON FUZZY SYSTEMS, Mechatronics, Robotics, and Autonomous Systems.