A comparative study of data fusion for RGB-D based visual recognition☆
Introduction
Multimodal fusion is an active research topic in multimedia analysis [1], [28]. For example, researchers improved the recognition performance by integrating visual features (lips reading), in addition to conventional single modality audio features (voice analysis) for speech recognition or similarly [15] combined different classifiers trained with data from several modalities. Although data fusion has been extensively investigated for audio-visual applications, the availability of new sensory devices capable of capturing synchronized depth and color streams has brought new challenges. In particular, there is still a very much open issue of how, exactly, to fuse depth and color. Moreover, new machine learning techniques, such as deep learning, have been shown to be one of the key elements in achieving state-of-the-art inference performances in a variety of applications. However, these new devices and machine learning techniques still raise the same old questions: What is the most effective way to integrate heterogeneous information from multimodal sensors? Does the design of the fusion method depend on the corresponding applications? Does the employed classification algorithm have an impact on the fusion method and the resultant accuracy? In this paper, we provide answers to the above questions.
In the literature, early fusion and late fusion are the two most popular fusion schemes. While early fusion approaches integrate data from different modalities before being passed to a classifier, late fusion approaches integrate, at the last stage, of the responses obtained after individual features learning the model for each descriptor. Although the employment of fusion schemes is a common technique in audio-visual domains [6], [22], [25], the works using RGB-D data [13], [18], [21], [30] are still developed through a unimodal fashion, lacking of studies on how to effectively integrate color and depth modalities [2], [3], [20], [30]. In addition, although deep learning methods have recently reported promising results when applied to various multimedia applications [11], [23], [27], [31], there is no explicit comparison between the deep architectures and traditional classifiers to explore which is the most suitable classification paradigm for visual recognition with RGB-D data. Typically, in RGB-D applications a depth image is used to segment better the object of interest and then some features are computed for depth and RGB images to afterwards train a classifier [8], [9], [12]. In contrast, we want to focus on different levels of feature fusion and deep learning classifiers, where an object itself is already localized and segmented from the image, and no pre-processing steps or other machine learning techniques are needed.
Therefore, in this work we conduct a comparative evaluation study of RGB-D visual recognition tasks by assessing the effectiveness of various settings, which include different fusion schemes (e.g., early fusion vs. late fusion) and two state-of-the-art learning mechanisms (e.g., SVM vs. deep learning). To the best of our knowledge, this work is the first to explicitly address the fusion evaluation for RGB-D data with deep learning classifiers.
The rest of the paper is structured as follows. 2 Fusion schemes, 3 Classifiers, and 4 give details about the adopted fusion methods, classifiers, and recognition tasks, respectively. Section 5 describes how experiments are carried out, and Section 6 draws the conclusions.
Section snippets
Fusion schemes
Given RGB-D data from a sensor, a typical operation is to extract features from the data and use the feature based representations to learn a multi-class classifier, i.e., a discriminant function where is the observation space, is the set of labels, and C is the number of classes. A new unlabeled observation is classified with:
In this study, the feature vector x is composed from data of two different modalities, i.e. color and depth. We define x
Classifiers
The classifiers adopted in this study are mainly in two different paradigms: Kernel Method and Deep Learning. The Kernel Method paradigm has been successfully used in the past decades for computer vision tasks such as object recognition and detection. More recently, Deep Learning paradigm based on neural networks has been demonstrated as powerful as the Kernel Method or even better in some cases. Therefore, as a representative of the Kernel Method, the support vector machine (SVM) algorithm is
Recognition tasks
Two recognition tasks (generic household objects and hand gestures) are chosen in this study because they both have publicly available RGB-D datasets with challenging data conditions, i.e. containing large amount of data with large number of classes, as detailed below.
The LaRED dataset [10] is a RGB-D hand gesture dataset consisting of 27 different gestures and two rotated counterparts for each basis gesture, which makes a total of 81 different classes. This dataset is large: 243,000 tuples,
Experiments
The experiments were performed to optimize the performance of two visual recognition tasks on the corresponding RGB-D datasets, with objectives to: (1) investigate which is the better way to combine RGB-D data by using either the early or the late fusion, (2) compare the widely used SVM against the newly popular deep learning classifiers, (3) determine whether use of descriptors or just raw image pixels is better for a more discriminative data representation, and (4) evaluate if the above
Conclusions
A comparison between two different fusion methods, early and late fusion, using RGB-D data is evaluated in this paper. Moreover, to extract more general conclusions, two distinct datasets are considered. Furthermore, a comparison with several existing classifications/fusion techniques is made to provide valuable information on which one should be chosen for RGB-D based visual recognition tasks. Based on the experimental results, three main conclusions can be drawn to answer the questions raised
Acknowledgments
This work was supported by the Ministry of Science and Technology of Taiwan under Grants MOST-103-2221-E-001-007-MY2 and MOST-103-2221-E-011-105.
References (31)
- et al.
Multimodal fusion for multimedia analysis: a survey
Multimed. Syst.
(2010) - et al.
Object recognition with hierarchical kernel descriptors
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2011) - et al.
Unsupervised feature learning for RGB-D based object recognition
Proceedings of Symposium on Experimental Robotics
(2013) - et al.
Recognising panoramas
Proceedings of IEEE International Conference on Computer Vision
(2003) - et al.
LIBSVM: a library for support vector machines
Trans. Intell. Syst. Technol.
(2011) - et al.
Hough transform-based mouth localization for audio-visual speech recognition
Proceedings of British Machine Vision Conference
(2009) - et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
(2014) - et al.
Learning rich features from RGB-D images for object detection and segmentation
Proceedings of IEEE European Conference on Computer Vision
(2014) - et al.
Fast semantic segmentation of RGB-D scenes with GPU-accelerated deep neural networks
Proceedings of KI 2014: Advances in Artificial Intelligence
(2014) - et al.
LaRED: a large RGB-D extensible hand gesture dataset
Proceedings of ACM Multimedia Systems Conference
(2014)
Audio-visual deep learning for noise robust speech recognition.
Proceedings of International Conference on Acoustics, Speech and Signal Processing
A category-level 3-D object dataset: putting the kinect to work
Proceedings of IEEE International Conference on Computer Vision Workshop
A linear approach to matching cuboids in RGBD images
Proceedings of IEEE International Conference onComputer Vision and Pattern Recognition
Support vector machines for texture classification
IEEE Trans. Pattern Anal. Mach. Intell.
On combining classifiers
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (61)
PMMN: Pre-trained multi-Modal network for scene text recognition
2021, Pattern Recognition LettersCitation Excerpt :The proposed pre-trained multi-modal network (PMMN) surpasses most existing methods both on regular and irregular text benchmarks, achieving the new state-of-the-art performance (e.g, 85.5% on ICDAR2015, 89.0% on SVTP, and 91.9% on CUTE80). With the deep learning and CNN being dominant in visual recognition tasks [19–22], many recent scene text recognition methods have been developed achieving promising performance. As mentioned in Section 1 and illustrated in Fig. 1, these methods can be roughly divided into language-free methods and language-based methods.
ROSNet: Robust one-stage network for CT lesion detection
2021, Pattern Recognition LettersCitation Excerpt :Moreover, the predictions of medical CT images require high stability. Neural networks were proven vulnerable to adversarial attacks [10–12], i.e., imperceptible image perturbations lead to drastically different predictions. For example, it is critical in medical CADx that a photographed lesion with a slight shift of its location in CT scan images would cause drastic changes in the prediction results.
Scene understanding in construction and buildings using image processing methods: A comprehensive review and a case study
2021, Journal of Building EngineeringCitation Excerpt :The main advantage of working in the RGB color space is the ease of composing/decomposing the base three color channels [44]. The disadvantage of RGB, however, is the need for big data storages due to image sizes prior to compression [45–47]. In the RGB color space, the majority of pixels contain shades of the three base colors (see Fig. 4).
T-SignSys: An Efficient CNN-Based Turkish Sign Language Recognition System
2024, Communications in Computer and Information ScienceResearch on visual recognition and positioning of industrial robots based on big data technology
2024, Applied Mathematics and Nonlinear SciencesResearch Progress in Skeleton-Based Human Action Recognition
2023, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics
- ☆
This paper has been recommended for acceptance by R. Davies.