A central multimodal fusion framework for outdoor scene image segmentation

Zhang, Yifei; Morel, Olivier; Seulin, Ralph; Mériaudeau, Fabrice; Sidibé, Désiré

doi:10.1007/s11042-020-10357-y

A central multimodal fusion framework for outdoor scene image segmentation

1177: Advances in Deep Learning for Multimodal Fusion and Alignment
Published: 17 February 2021

Volume 81, pages 12047–12060, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yifei Zhang ORCID: orcid.org/0000-0002-1950-137X¹,
Olivier Morel¹,
Ralph Seulin¹,
Fabrice Mériaudeau¹ &
…
Désiré Sidibé²

713 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Robust multimodal fusion is one of the challenging research problems in semantic scene understanding. In real-world applications, the fusion system can overcome the drawbacks of individual sensors by taking different feature representations and statistical properties of multiple modalities (e.g., RGB-depth cameras, multispectral cameras). In this paper, we propose a novel central multimodal fusion framework for semantic image segmentation of road scenes, aiming to effectively learn joint feature representations and optimally combine deep neural networks with statistical priors. More specifically, the proposed fusion framework can automatically generate a central branch by sequentially mapping multimodal features into a common space, including both low-level and high-level features. Besides, in order to reduce the model uncertainty, we employ statistical fusion to compute the final prediction, which leads to significant performance improvement. We conduct extensive experiments on various outdoor scene datasets. Both qualitative and quantitative experiments demonstrate that our central fusion framework achieves competitive performance against existing multimodal fusion methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

Article 08 July 2019

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

Article 27 March 2023

References

Alghamdi A, Hammad M, Ugail H, Abdel-Raheem A, Muhammad K, Khalifa H, Abd El-Latif A (2020) Detection of myocardial infarction based on novel deep transfer learning methods for urban healthcare in smart cities. Multimedia Tools Appl https://doi.org/10.1007/s11042-020-08769-x
Arevalo J, Solorio T, Montes-y Gómez M, González FA (2017) Gated multimodal units for information fusion. arXiv:1702.01992
Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Sys 16(6):345–379
Article Google Scholar
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Patt Anal Mach Intell 39(12):2481–2495
Article Google Scholar
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Blanchon M, Morel O, Zhang Y, Seulin R, Crombez N, Sidibé D (2019) Outdoor scenes pixel-wise semantic segmentation using polarimetry and fully convolutional network. In: 14th international conference on computer vision theory and applications (VISAPP 2019). Prague, Czech Republic. https://hal-univ-bourgogne.archives-ouvertes.fr/hal-02024107
Blum H, Gawel A, Siegwart R, Cadena C (2018) Modular sensor fusion for semantic segmentation. In: 2018 IEEE/RSJ International conference on intelligent robots and systems (IROS). IEEE, pp 3670–3677
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Deng J, Dong W, Socher R, Li LJ, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision & pattern recognition
Deng L, Yang M, Li T, He Y, Wang C (2019) Rfbnet: deep multimodal networks with residual fusion blocks for rgb-d semantic segmentation. arXiv:1907.00135
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res (IJRR)
Gupta S, Girshick R, Arbeláez P, Malik J (2014) Learning rich features from rgb-d images for object detection and segmentation. In: European conference on computer vision. Springer, pp 345–360
Harchanko JS, Chenault DB (2005) Water-surface object detection and classification using imaging polarimetry. In: Polarization science and remote sensing II, vol 5888. International Society for Optics and Photonics, p 588815
Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision. Springer, pp 213–228
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE, et al. (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87
Article Google Scholar
Jiang J, Zheng L, Luo F, Zhang Z (2018) Rednet: residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv:1806.01054
Kaymak Ç, Uçar A (2019) A brief survey and an application of semantic image segmentation for autonomous driving. In: Handbook of deep learning applications. Springer, pp 161–200
Lahat D, Adali T, Jutten C (2015) Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE 103(9):1449–1477
Article Google Scholar
Li Y, Zhang J, Cheng Y, Huang K, Tan T (2017) Semantics-guided multi-level rgb-d feature fusion for indoor semantic segmentation. In: 2017 IEEE International conference on image processing (ICIP). IEEE, pp 1262–1266
Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling. In: European conference on computer vision. Springer, pp 541–557
Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more know more: unsupervised video object segmentation with co-attention siamese networks
Oberweger M, Wohlhart P, Lepetit V (2015) Hands deep in deep learning for hand pose estimation. arXiv:1502.06807
Park SJ, Hong KS, Lee S (2017) Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 4980–4989
Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Proc Mag 34(6):96–108
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Ence
Valada A, Mohan R, Burgard W (2018) Self-supervised model adaptation for multimodal semantic segmentation. arXiv:1808.03833
Valada A, Oliveira G, Brox T, Burgard W (2016) Deep multispectral semantic scene understanding of forested environments using multimodal fusion. http://ais.informatik.uni-freiburg.de/publications/papers/valada16iser.pdf
Valada A, Vertens J, Dhall A, Burgard W (2017) Adapnet: adaptive semantic segmentation in adverse environmental conditions. In: Proceedings of the IEEE international conference on robotics and automation (ICRA). IEEE, pp 4644–4651
Vielzeuf V, Lechervy A, Pateux S, Jurie F (2018) Centralnet: a multilayer approach for multimodal fusion. In: Proceedings of the European conference on computer vision (ECCV), pp 0–0
Wang W, Lu X, Shen J, Crandall D, Shao L (2019) Zero-shot video object segmentation via attentive graph neural networks
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Zhang Y, Morel O, Blanchon M, Seulin R, Rastgoo M, Sidibé D (2019) Exploration of deep learning-based multimodal fusion for semantic road scene segmentation. In: VISAPP 2019 14Th international conference on computer vision theory and applications

Download references

Acknowledgements

This research was supported by the French Agence Nationale de la Recherche (ANR), under the ANR-17-CE22-0011 (project ICUB). We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used in this research.

Author information

Authors and Affiliations

VIBOT ERL CNRS 6000, ImViA, Université de Bourgogne Franche-Comté, 71200, Le creusot, France
Yifei Zhang, Olivier Morel, Ralph Seulin & Fabrice Mériaudeau
Université Paris-Saclay, Univ Evry, IBISC, 91020, Evry, France
Désiré Sidibé

Authors

Yifei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Morel
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Seulin
View author publications
You can also search for this author in PubMed Google Scholar
Fabrice Mériaudeau
View author publications
You can also search for this author in PubMed Google Scholar
Désiré Sidibé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifei Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Morel, O., Seulin, R. et al. A central multimodal fusion framework for outdoor scene image segmentation. Multimed Tools Appl 81, 12047–12060 (2022). https://doi.org/10.1007/s11042-020-10357-y

Download citation

Received: 14 February 2020
Revised: 22 September 2020
Accepted: 22 December 2020
Published: 17 February 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11042-020-10357-y

Keywords

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A central multimodal fusion framework for outdoor scene image segmentation

Abstract

Access this article

Similar content being viewed by others

Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A central multimodal fusion framework for outdoor scene image segmentation

Abstract

Access this article

Similar content being viewed by others

Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation