Elsevier

Pattern Recognition Letters

Volume 73, 1 April 2016, Pages 1-6
Pattern Recognition Letters

A comparative study of data fusion for RGB-D based visual recognition

https://doi.org/10.1016/j.patrec.2015.12.006Get rights and content

Highlights

  • This study investigates key aspects for RGB-D based visual recognition, including the data fusion schemes and the classifiers.

  • This study conducts a comparative study for evaluating how the recognition performance is affected.

  • This study is the first work to explicitly address the fusion issue for RGB-D data with deep learning.

  • This study can serve as useful guidance for developing visual recognition systems in the related application fields.

Abstract

Data fusion from different modalities has been extensively studied for a better understanding of multimedia contents. On one hand, the emergence of new devices and decreasing storage costs cause growing amounts of data being collected. Though bigger data makes it easier to mine information, methods for big data analytics are not well investigated. On the other hand, new machine learning techniques, such as deep learning, have been shown to be one of the key elements in achieving state-of-the-art inference performances in a variety of applications. Therefore, some of the old questions in data fusion are in need to be addressed again for these new changes. These questions are: What is the most effective way to combine data for various modalities? Does the fusion method affect the performance with different classifiers? To answer these questions, in this paper, we present a comparative study for evaluating early and late fusion schemes with several types of SVM and deep learning classifiers on two challenging RGB-D based visual recognition tasks: hand gesture recognition and generic object recognition. The findings from this study provide useful policy and practical guidance for the development of visual recognition systems.

Introduction

Multimodal fusion is an active research topic in multimedia analysis [1], [28]. For example, researchers improved the recognition performance by integrating visual features (lips reading), in addition to conventional single modality audio features (voice analysis) for speech recognition or similarly [15] combined different classifiers trained with data from several modalities. Although data fusion has been extensively investigated for audio-visual applications, the availability of new sensory devices capable of capturing synchronized depth and color streams has brought new challenges. In particular, there is still a very much open issue of how, exactly, to fuse depth and color. Moreover, new machine learning techniques, such as deep learning, have been shown to be one of the key elements in achieving state-of-the-art inference performances in a variety of applications. However, these new devices and machine learning techniques still raise the same old questions: What is the most effective way to integrate heterogeneous information from multimodal sensors? Does the design of the fusion method depend on the corresponding applications? Does the employed classification algorithm have an impact on the fusion method and the resultant accuracy? In this paper, we provide answers to the above questions.

In the literature, early fusion and late fusion are the two most popular fusion schemes. While early fusion approaches integrate data from different modalities before being passed to a classifier, late fusion approaches integrate, at the last stage, of the responses obtained after individual features learning the model for each descriptor. Although the employment of fusion schemes is a common technique in audio-visual domains [6], [22], [25], the works using RGB-D data [13], [18], [21], [30] are still developed through a unimodal fashion, lacking of studies on how to effectively integrate color and depth modalities [2], [3], [20], [30]. In addition, although deep learning methods have recently reported promising results when applied to various multimedia applications [11], [23], [27], [31], there is no explicit comparison between the deep architectures and traditional classifiers to explore which is the most suitable classification paradigm for visual recognition with RGB-D data. Typically, in RGB-D applications a depth image is used to segment better the object of interest and then some features are computed for depth and RGB images to afterwards train a classifier [8], [9], [12]. In contrast, we want to focus on different levels of feature fusion and deep learning classifiers, where an object itself is already localized and segmented from the image, and no pre-processing steps or other machine learning techniques are needed.

Therefore, in this work we conduct a comparative evaluation study of RGB-D visual recognition tasks by assessing the effectiveness of various settings, which include different fusion schemes (e.g., early fusion vs. late fusion) and two state-of-the-art learning mechanisms (e.g., SVM vs. deep learning). To the best of our knowledge, this work is the first to explicitly address the fusion evaluation for RGB-D data with deep learning classifiers.

The rest of the paper is structured as follows. 2 Fusion schemes, 3 Classifiers, and 4 give details about the adopted fusion methods, classifiers, and recognition tasks, respectively. Section 5 describes how experiments are carried out, and Section 6 draws the conclusions.

Section snippets

Fusion schemes

Given RGB-D data from a sensor, a typical operation is to extract features from the data and use the feature based representations to learn a multi-class classifier, i.e., a discriminant function f:RM×CR where RM is the observation space, C={1,...,C} is the set of labels, and C is the number of classes. A new unlabeled observation xRM is classified with: c*(x)=argmaxcCf(x;c)

In this study, the feature vector x is composed from data of two different modalities, i.e. color and depth. We define x

Classifiers

The classifiers adopted in this study are mainly in two different paradigms: Kernel Method and Deep Learning. The Kernel Method paradigm has been successfully used in the past decades for computer vision tasks such as object recognition and detection. More recently, Deep Learning paradigm based on neural networks has been demonstrated as powerful as the Kernel Method or even better in some cases. Therefore, as a representative of the Kernel Method, the support vector machine (SVM) algorithm is

Recognition tasks

Two recognition tasks (generic household objects and hand gestures) are chosen in this study because they both have publicly available RGB-D datasets with challenging data conditions, i.e. containing large amount of data with large number of classes, as detailed below.

The LaRED dataset [10] is a RGB-D hand gesture dataset consisting of 27 different gestures and two rotated counterparts for each basis gesture, which makes a total of 81 different classes. This dataset is large: 243,000 tuples,

Experiments

The experiments were performed to optimize the performance of two visual recognition tasks on the corresponding RGB-D datasets, with objectives to: (1) investigate which is the better way to combine RGB-D data by using either the early or the late fusion, (2) compare the widely used SVM against the newly popular deep learning classifiers, (3) determine whether use of descriptors or just raw image pixels is better for a more discriminative data representation, and (4) evaluate if the above

Conclusions

A comparison between two different fusion methods, early and late fusion, using RGB-D data is evaluated in this paper. Moreover, to extract more general conclusions, two distinct datasets are considered. Furthermore, a comparison with several existing classifications/fusion techniques is made to provide valuable information on which one should be chosen for RGB-D based visual recognition tasks. Based on the experimental results, three main conclusions can be drawn to answer the questions raised

Acknowledgments

This work was supported by the Ministry of Science and Technology of Taiwan under Grants MOST-103-2221-E-001-007-MY2 and MOST-103-2221-E-011-105.

References (31)

  • P.K. Atrey et al.

    Multimodal fusion for multimedia analysis: a survey

    Multimed. Syst.

    (2010)
  • L. Bo et al.

    Object recognition with hierarchical kernel descriptors

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2011)
  • L. Bo et al.

    Unsupervised feature learning for RGB-D based object recognition

    Proceedings of Symposium on Experimental Robotics

    (2013)
  • M. Brown et al.

    Recognising panoramas

    Proceedings of IEEE International Conference on Computer Vision

    (2003)
  • C.-C. Chang et al.

    LIBSVM: a library for support vector machines

    Trans. Intell. Syst. Technol.

    (2011)
  • G. Fanelli et al.

    Hough transform-based mouth localization for audio-visual speech recognition

    Proceedings of British Machine Vision Conference

    (2009)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • S. Gupta et al.

    Learning rich features from RGB-D images for object detection and segmentation

    Proceedings of IEEE European Conference on Computer Vision

    (2014)
  • N. Hoft et al.

    Fast semantic segmentation of RGB-D scenes with GPU-accelerated deep neural networks

    Proceedings of KI 2014: Advances in Artificial Intelligence

    (2014)
  • Y.-S. Hsiao et al.

    LaRED: a large RGB-D extensible hand gesture dataset

    Proceedings of ACM Multimedia Systems Conference

    (2014)
  • J. Huang et al.

    Audio-visual deep learning for noise robust speech recognition.

    Proceedings of International Conference on Acoustics, Speech and Signal Processing

    (2013)
  • A. Janoch et al.

    A category-level 3-D object dataset: putting the kinect to work

    Proceedings of IEEE International Conference on Computer Vision Workshop

    (2011)
  • H. Jiang et al.

    A linear approach to matching cuboids in RGBD images

    Proceedings of IEEE International Conference onComputer Vision and Pattern Recognition

    (2013)
  • K.I. Kim et al.

    Support vector machines for texture classification

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • J. Kittler et al.

    On combining classifiers

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • Cited by (61)

    • PMMN: Pre-trained multi-Modal network for scene text recognition

      2021, Pattern Recognition Letters
      Citation Excerpt :

      The proposed pre-trained multi-modal network (PMMN) surpasses most existing methods both on regular and irregular text benchmarks, achieving the new state-of-the-art performance (e.g, 85.5% on ICDAR2015, 89.0% on SVTP, and 91.9% on CUTE80). With the deep learning and CNN being dominant in visual recognition tasks [19–22], many recent scene text recognition methods have been developed achieving promising performance. As mentioned in Section 1 and illustrated in Fig. 1, these methods can be roughly divided into language-free methods and language-based methods.

    • ROSNet: Robust one-stage network for CT lesion detection

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Moreover, the predictions of medical CT images require high stability. Neural networks were proven vulnerable to adversarial attacks [10–12], i.e., imperceptible image perturbations lead to drastically different predictions. For example, it is critical in medical CADx that a photographed lesion with a slight shift of its location in CT scan images would cause drastic changes in the prediction results.

    • Scene understanding in construction and buildings using image processing methods: A comprehensive review and a case study

      2021, Journal of Building Engineering
      Citation Excerpt :

      The main advantage of working in the RGB color space is the ease of composing/decomposing the base three color channels [44]. The disadvantage of RGB, however, is the need for big data storages due to image sizes prior to compression [45–47]. In the RGB color space, the majority of pixels contain shades of the three base colors (see Fig. 4).

    • T-SignSys: An Efficient CNN-Based Turkish Sign Language Recognition System

      2024, Communications in Computer and Information Science
    • Research Progress in Skeleton-Based Human Action Recognition

      2023, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics
    View all citing articles on Scopus

    This paper has been recommended for acceptance by R. Davies.

    View full text