Skip to main content
Log in

A central multimodal fusion framework for outdoor scene image segmentation

  • 1177: Advances in Deep Learning for Multimodal Fusion and Alignment
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Robust multimodal fusion is one of the challenging research problems in semantic scene understanding. In real-world applications, the fusion system can overcome the drawbacks of individual sensors by taking different feature representations and statistical properties of multiple modalities (e.g., RGB-depth cameras, multispectral cameras). In this paper, we propose a novel central multimodal fusion framework for semantic image segmentation of road scenes, aiming to effectively learn joint feature representations and optimally combine deep neural networks with statistical priors. More specifically, the proposed fusion framework can automatically generate a central branch by sequentially mapping multimodal features into a common space, including both low-level and high-level features. Besides, in order to reduce the model uncertainty, we employ statistical fusion to compute the final prediction, which leads to significant performance improvement. We conduct extensive experiments on various outdoor scene datasets. Both qualitative and quantitative experiments demonstrate that our central fusion framework achieves competitive performance against existing multimodal fusion methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Alghamdi A, Hammad M, Ugail H, Abdel-Raheem A, Muhammad K, Khalifa H, Abd El-Latif A (2020) Detection of myocardial infarction based on novel deep transfer learning methods for urban healthcare in smart cities. Multimedia Tools Appl https://doi.org/10.1007/s11042-020-08769-x

  2. Arevalo J, Solorio T, Montes-y Gómez M, González FA (2017) Gated multimodal units for information fusion. arXiv:1702.01992

  3. Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Sys 16(6):345–379

    Article  Google Scholar 

  4. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Patt Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  5. Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620

  6. Blanchon M, Morel O, Zhang Y, Seulin R, Crombez N, Sidibé D (2019) Outdoor scenes pixel-wise semantic segmentation using polarimetry and fully convolutional network. In: 14th international conference on computer vision theory and applications (VISAPP 2019). Prague, Czech Republic. https://hal-univ-bourgogne.archives-ouvertes.fr/hal-02024107

  7. Blum H, Gawel A, Siegwart R, Cadena C (2018) Modular sensor fusion for semantic segmentation. In: 2018 IEEE/RSJ International conference on intelligent robots and systems (IROS). IEEE, pp 3670–3677

  8. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611

  9. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  10. Deng J, Dong W, Socher R, Li LJ, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision & pattern recognition

  11. Deng L, Yang M, Li T, He Y, Wang C (2019) Rfbnet: deep multimodal networks with residual fusion blocks for rgb-d semantic segmentation. arXiv:1907.00135

  12. Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res (IJRR)

  13. Gupta S, Girshick R, Arbeláez P, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision. Springer, pp 345–360

  14. Harchanko JS, Chenault DB (2005) Water-surface object detection and classification using imaging polarimetry. In: Polarization science and remote sensing II, vol 5888. International Society for Optics and Photonics, p 588815

  15. Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. Springer, pp 213–228

  16. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE, et al. (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87

    Article  Google Scholar 

  17. Jiang J, Zheng L, Luo F, Zhang Z (2018) Rednet: residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv:1806.01054

  18. Kaymak Ç, Uçar A (2019) A brief survey and an application of semantic image segmentation for autonomous driving. In: Handbook of deep learning applications. Springer, pp 161–200

  19. Lahat D, Adali T, Jutten C (2015) Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE 103(9):1449–1477

    Article  Google Scholar 

  20. Li Y, Zhang J, Cheng Y, Huang K, Tan T (2017) Semantics-guided multi-level rgb-d feature fusion for indoor semantic segmentation. In: 2017 IEEE International conference on image processing (ICIP). IEEE, pp 1262–1266

  21. Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling. In: European conference on computer vision. Springer, pp 541–557

  22. Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400

  23. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  24. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more know more: unsupervised video object segmentation with co-attention siamese networks

  25. Oberweger M, Wohlhart P, Lepetit V (2015) Hands deep in deep learning for hand pose estimation. arXiv:1502.06807

  26. Park SJ, Hong KS, Lee S (2017) Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 4980–4989

  27. Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147

  28. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch

  29. Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Proc Mag 34(6):96–108

    Article  Google Scholar 

  30. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241

  31. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Ence

  32. Valada A, Mohan R, Burgard W (2018) Self-supervised model adaptation for multimodal semantic segmentation. arXiv:1808.03833

  33. Valada A, Oliveira G, Brox T, Burgard W (2016) Deep multispectral semantic scene understanding of forested environments using multimodal fusion. http://ais.informatik.uni-freiburg.de/publications/papers/valada16iser.pdf

  34. Valada A, Vertens J, Dhall A, Burgard W (2017) Adapnet: adaptive semantic segmentation in adverse environmental conditions. In: Proceedings of the IEEE international conference on robotics and automation (ICRA). IEEE, pp 4644–4651

  35. Vielzeuf V, Lechervy A, Pateux S, Jurie F (2018) Centralnet: a multilayer approach for multimodal fusion. In: Proceedings of the European conference on computer vision (ECCV), pp 0–0

  36. Wang W, Lu X, Shen J, Crandall D, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks

  37. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122

  38. Zhang Y, Morel O, Blanchon M, Seulin R, Rastgoo M, Sidibé D (2019) Exploration of deep learning-based multimodal fusion for semantic road scene segmentation. In: VISAPP 2019 14Th international conference on computer vision theory and applications

Download references

Acknowledgements

This research was supported by the French Agence Nationale de la Recherche (ANR), under the ANR-17-CE22-0011 (project ICUB). We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used in this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifei Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Morel, O., Seulin, R. et al. A central multimodal fusion framework for outdoor scene image segmentation. Multimed Tools Appl 81, 12047–12060 (2022). https://doi.org/10.1007/s11042-020-10357-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10357-y

Keywords

Navigation