Elsevier

Pattern Recognition

Volume 86, February 2019, Pages 376-385
Pattern Recognition

Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection

https://doi.org/10.1016/j.patcog.2018.08.007Get rights and content

Highlights

  • Using CNNs to fuse RGB and depth data with only single path is not sufficient.

  • Both global reasoning and local capturing are important for saliency detection.

  • Bottom-up cross-modal interactions are also beneficial for learning complements.

Abstract

Paired RGB and depth images are becoming popular multi-modal data adopted in computer vision tasks. Traditional methods based on Convolutional Neural Networks (CNNs) typically fuse RGB and depth by combining their deep representations in a late stage with only one path, which can be ambiguous and insufficient for fusing large amounts of cross-modal data. To address this issue, we propose a novel multi-scale multi-path fusion network with cross-modal interactions (MMCI), in which the traditional two-stream fusion architecture with single fusion path is advanced by diversifying the fusion path to a global reasoning one and another local capturing one and meanwhile introducing cross-modal interactions in multiple layers. Compared to traditional two-stream architectures, the MMCI net is able to supply more adaptive and flexible fusion flows, thus easing the optimization and enabling sufficient and efficient fusion. Concurrently, the MMCI net is equipped with multi-scale perception ability (i.e., simultaneously global and local contextual reasoning). We take RGB-D saliency detection as an example task. Extensive experiments on three benchmark datasets show the improvement of the proposed MMCI net over other state-of-the-art methods.

Introduction

Recent years have witnessed increasing pervasion of RGB-D data in a wide range of computer vision [1] and robotic vision systems [2]. Compared to RGB data, which supply detailed appearance and texture, depth data additionally present clear object shapes and spatial layouts. Compared to RGB sensors, depth sensors are more robust to lighting change and color variations. Hence, RGB and depth data are complementary in terms of both data distributions and applicable scenes. Consequently, how to fuse the RGB and depth information in a sufficient manner has been a fundamental problem in dealing with RGB-D data.

Previous works on dealing with RGB-D data can be generally categorized into two sets: (1) designing handcrafted features from RGB-D data with domain-specific knowledge [3], [4], [5], [6], [7]; (2) operating RGB and depth data separately and then fusing the decisions. For the first solution, the handcrafting process highly relies on domain-specific knowledge, making the features hard to be generalized to other tasks readily. Moreover, the handcrafted features are deficient in high-level reasoning, which is important for scene understanding. Some unsupervised learning methods including sparse coding [8] and auto-encoder [9] are further introduced to address the drawbacks in handcrafted RGB-D features. Nonetheless, these methods, limited by the shallow architectures, are still far from learning high-level representations and satisfactory generalization ability.

Recently, mounting works resort to Convolutional Neural Networks (CNNs) [10] to process RGB-D data for the powerful ability of CNNs in excavating high-level representations and modelling complex correlations. Motivated by the advantages, a wide range of CNNs based on different architectures have been introduced [11], [12], [13], [14], [15], [16], [17]. Even so, the question of how to design the fusion architecture is still under-studied in existing works. Related works mainly involves three veins: 1) RGB and depth data are intergrated as joint inputs to a CNN (noted as ‘Input fusion’ in Fig. 1 (a)); 2) RGB and depth data are fed into each stream separately and then their low-level or high-level representations are combined as joint representations for further decision (noted as ‘Early fusion’ and ‘Late fusion’ in Fig. 1 (b), (c), respectively); 3) conducting each stream independently and fusing their decisions. Although some recent works consider the relationships between RGB and depth data (e.g., the independence and consistency) and achieve inspiring performance [11], [15], a common limitation in these networks is that the fusion path for RGB and depth data is typically onefold, which, in our opinion, is deficient to integrate all the information from RGB and depth. An ideal RGB-D saliency detection system is expected to combine the multi-scale cross-modal complements for joint global contextual reasoning and local spatial capturing. Fulfilling these objectives calls for multiple fusion paths to avoid conflicting optimization between sub-tasks. Otherwise, it is unlikely to achieve collective optimization sufficiently. Therefore, we argue that a well-engineered multi-modal fusion architecture should be equipped with multiple fusion paths to reduce fusion ambiguity and improve fusion sufficiency.

Our argument is supported by recent advances [18], [19], [20], [21] in designing basic CNN architectures, which reveal that the philosophy of designing a CNN has diverted from merely increasing the depth [18], [19] in plain nets to additionally enrich the connection paths [20], [21], [22]. To ease the gradient-based optimization, authors in [20] introduce gating units to allow the flow of information to cross multiple layers unimpededly. He et al. [21] propose shortcut connections to reformulate learning desired unreferenced layers as approximating residual functions. The proposed shortcut with identity function in the ResNet [21] can be viewed as a way of facilitating information transmission by promoting the flow path. More recently, Xie et al. [23] conclude the split-transform-merge strategy in [20], [21], [22] as a new dimension termed “cardinality” accompanying depth and width. The cardinality serves as the size of the set of transformations conjunct with multiple flow paths. In this work, it has been experimentally proven that increasing the cardinality is a more effective and efficient way to boost performance than going deeper or wider.

Motivated by the philosophy and success inherited in [20], [21], [22], [23], we believe that the multi-modal fusion problem will benefit a lot in terms of optimization efficiency and fusion sufficiency from introducing more paths in the fusion network, while few multi-modal fusion networks take this into account. Given this limitation, our further question is how to design a multi-path multi-modal fusion network. We find that human beings perceive and understand scenes in an integrated way [24], [25], i.e., locating target objects with a global perspective and capturing fined details with a local view. Similarly in various robotic vision and computer vision tasks, both global understanding and local capturing are typically indispensable. For example, an ideal grasper is expected to identify the target object (e.g., a cup) and meanwhile highlight the specific subpart for grasping (e.g., the handgrip). Also, the salient object detection task [7], [26], [27], [28], which aims at highlighting the object attracting human beings most, also needs global reasoning to judge which object is the most salient and local perspective to obtain precise object boundaries.

Based on the significance of introducing multi-path and incorporating global and local perspectives jointly, we tailor a multi-path multi-scale multi-modal (MMCI net) fusion network (shown in Fig. 2(b)), in which the fusion path is diversified to a global contextual reasoning one and a local capturing one. Our proposed MMCI net can ease the joint optimization process and simultaneously endow the multi-modal fusion network with multi-scale perception. Firstly, the network stream for each modality, which embeds a global reasoning branch and a local capturing branch, is trained separately with the same architecture shown in Fig. 2(a). Then we connect their local and global branches respectively and the predictions of the combined global and local paths are summed as the final prediction.

Another insight drawn from the work is that the complementary information between RGB and depth data is concurrent from low-level representations to high-level contexts. To this end, we add cross-modal interactions from the depth stream to the RGB stream in shallower layers to further encourage cross-modal combinations. Without these cross-modal interactions, the RGB and depth streams would have to be learnt separately and their complements could not be fully utilized in the feature extraction process. As a result, the cross-modal complementarity in shallow layers is unlikely to be sufficiently explored. In addition to the main late fusion stage, the cross-modal interactions enable the learning of cross-modal complements in the bottom-up process, allowing explorations on more discriminative multi-modal features in different feature levels. Besides, the cross-modal interactions introduce additional back-propagation gradients [21] from the RGB stream to the depth stream, which further encourage the depth stream to learn complementary features.

In this work, we use the salient object detection task to exemplify and verify our proposed multi-modal fusion strategies. In summary, the contributions of this work are three-fold:

  • (1)

    We propose a multi-path multi-modal fusion network. The diversified fusion paths reduce fusing ambiguity, ease the optimization process and afford better fusion adaptability than previous fusion networks relying on a single straightforward path.

  • (2)

    The MMCI net is endowed with multi-scale contextual reasoning ability, incorporating global reasoning and local capturing simultaneously in an end-to-end architecture. The cross-modal interactions not only empower additional gradients for encouraging the learning of the depth stream, but also allow exploration on cross-modal complements across low-level representations and high-level contexts.

  • (3)

    Extensive evaluations on three public datasets show substantial and consistent improvements of our method over state-of-the-art.

Section snippets

Related work

Although various models have been devised on RGB saliency detection, the RGB-D saliency detection works are severely limited. Most of previous works on RGB-D saliency detection can be categorized into three modes: input fusion, feature fusion and result fusion. Methods based on input fusion concatenate the RGB-D pair by directly regarding depth image as an undifferentiated channel [29] or setting constant weights on RGB and depth channels [30], [31]. The joint inputs are then followed by

The proposed method

Considering the gap between RGB and depth data in terms of distribution and structure as well as the insufficiency of existing RGB-D training samples, we adopt a stage-wise training manner. It means that the network for each modality, including a global understanding branch and a local capturing branch, is trained separately and then combined as the fusion net to train jointly. As shown in Fig. 2, we firstly train the RGB-induced saliency detection network (R_SalNet) and then train the

Dataset

To evaluate the effectiveness of our network, we perform comprehensive experiments on three datasets, NLPR [29], NJUD [4] and STEREO [39], which consist of 1000, 2003 and 797 paired RGB-D images and corresponding ground truth saliency masks collected from a large range of indoor and outdoor scenes. We sample 650 image pairs from the NLPR dataset and 1400 image pairs from the NJUD dataset randomly and combine them as the training dataset. We also choose 50 image pairs from the NLPR dataset and

Conclusion

In this paper, we propose to utilize CNNs for RGB-D saliency detection. We improve the traditional two-stream architecture by diversifying the multi-modal fusion paths and introducing cross-modal interactions in multiple layers. The proposed strategies reduce fusion ambiguity, ease the multi-modal fusion optimization and incorporate global reasoning and local capturing together, thus allowing more adaptive and sufficient fusion and enjoying encouraging accuracy gains. More importantly, the

Acknowledgements

This work was supported by the Research Grants Council of Hong Kong (Project No. CityU 11205015 and CityU 11255716).

Hao Chen received his bachelor degree in the school of Control Science and Engineering, Shandong University in 2013 and his master degree in School of Automation, Northwestern Polytechnical University in 2016. He is now a PhD candidate in the City University of Hong Kong. His research interests involve deep learning, computer vision and robotics.

References (49)

  • Q.V. Le

    Building high-level features using large scale unsupervised learning

  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Adv. Neural Inf. Process. Syst.

    (2012)
  • A. Wang et al.

    Mmss: multi-modal sharable and specific feature learning for RGB-D object recognition

  • M. Schwarz et al.

    RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features

  • A. Eitel et al.

    Multimodal deep learning for robust RGB-D object recognition

  • S. Gupta et al.

    Learning rich features from RGB-D images for object detection and segmentation

  • H. Zhu et al.

    Discriminative multi-modal feature fusion for RGBD indoor scene recognition

  • K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556,...
  • C. Szegedy et al.

    Going deeper with convolutions

  • R.K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks, arXiv:1505.00387,...
  • K. He et al.

    Deep residual learning for image recognition

  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks,...
  • J. Allman et al.

    Stimulus specific responses from beyond the classical receptive field: neurophysiological mechanisms for local-global comparisons in visual neurons

    Ann. Rev. Neurosci.

    (1985)
  • Cited by (0)

    Hao Chen received his bachelor degree in the school of Control Science and Engineering, Shandong University in 2013 and his master degree in School of Automation, Northwestern Polytechnical University in 2016. He is now a PhD candidate in the City University of Hong Kong. His research interests involve deep learning, computer vision and robotics.

    Dr. Youfu Li obtained his BSc and MSc degrees in Electrical Engineering from Harbin Institute of Technology (HIT), China. He obtained his PhD degree in robotics from the Department of Engineering Science, Oxford University, UK, in 1993. From 1993 to 1995, he worked as a research staff in the Department of Computer Science at the University of Wales, Aberystwyth, UK. He joined City University of Hong Kong in 1995. Dr. Li's research interests include robot vision, visual tracking, robot sensing and sensor based control, mechatronics and automation. In these areas, he has published over a hundred papers in refereed international journals. He is an associate editor of IEEE Transactions on Automation Science and Engineering (T-ASE).

    Dan Su was born in Zhengjiang, China, in August 1989. He received the B.S. degree from Hunan University, Hunan in 2011 and master degree in Beihang University, Beijing, China in 2014. Now he is pursuing Ph.D degree in City University of Hong Kong. His research interests involve the precise control of magnetic bearings, robot control system, computer vision and the human-machine interface.

    View full text