MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation

Lan, Xin; Gu, Xiaojing; Gu, Xingsheng

doi:10.1007/s10489-021-02687-7

MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation

Published: 23 August 2021

Volume 52, pages 5817–5829, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

1463 Accesses
25 Citations
1 Altmetric
Explore all metrics

Abstract

Semantic segmentation is a fundamental task in computer vision. However, most networks are designed for RGB inputs which quality can degrade gracefully under low-level illumination or bad weather conditions. Recent works have achieved promising results by inputting networks with a RGB image and a corresponding registered thermal image together. However, how and when to fuse the features of RGB modality and thermal modality are still remain challenging. In this paper, we propose a Multi-Modal Multi-Stage Network (MMNet) for RGB-T image semantic segmentation. MMNet consists of two stages. Stage 1 extracts features of different modalities separately to avoid cross-modal feature conflicts. Stage 2 fuses representations from the first stage and gradually refines the details. Specifically, Stage 1 has two encoder-decoder sub-networks while Stage 2 has one. As semantic gap exists across encoders and decoders, we propose an Efficient Feature Enhancement Module (EFEM) to bridge the encoder with decoder. Moreover, we deploy a light-weight Mini Refinement Block (MRB) as the encoder at Stage 2 to do the fusion and refinement efficiently. The experimental results demonstrate that our network achieves improved performance while simultaneously being efficient in terms of parameters and FLOPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MMNet: RGB-t Semantic Segmentation Network Based on Multi-scale and Adaptively Mutual Enhancement Mechanism

DHFNet: dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation

Article 26 January 2023

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

Notes

The symbol CB(∗,k,s) denotes a convolutional layer with k × k kernel and stride s, followed by a batchnorm layer. The same symbol is used in the rest of the paper.
The symbol DCB(∗,k,s) denotes a depth-wise separable convolutional layer with k × k kernel and stride s, followed by a batchnorm layer. The same symbol is used in the rest of the paper.

References

Ha Q, Watanabe K, Karasawa T, Ushiku Y, Harada T (2017) Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: IEEE/RSJ International conference on intelligent robots and systems, IROS, IEEE, pp 5108–5115
Sun Y, Zuo W, Liu M (2019) Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot Autom Lett 4(3):2576–2583
Article Google Scholar
Shivakumar SS, Rodrigues N, Zhou A, Miller ID, Kumar V, Taylor CJ (2019) Pst900: Rgb-thermal calibration, dataset and segmentation network. arXiv:190910980
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision, ECCV, Springer International Publishing, pp 483–499
Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, Yu G, Lu H, Wei Y, Sun J (2019) Rethinking on multi-stage networks for human pose estimation. arXiv:190100148
Fu J, Liu J, Wang Y, Zhou J, Wang C, Lu H (2019) Stacked deconvolutional network for semantic segmentation. In: IEEE Transactions on Image Processing, https://doi.org/10.1109/TIP.2019.2895460
Cheng B, Chen LC, Wei Y, Zhu Y, Huang Z, Xiong J, Huang TS, Hwu WM, Shi H (2019) Spgnet: Semantic prediction guidance for scene parsing. In: IEEE International conference on computer vision (ICCV), pp 5218–5228
Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Glaeser C, Timm F, Wiesbeck W (2020) Dietmayer, K. In: IEEE Transactions on intelligent transportation systems, Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International conference on learning representations (ICLR
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Analy Mach Intell 39(12):2481–2495
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, MICCAI, Springer, pp. 234–241
Ghosh A, Ehrlich M, Shah S, Davis LS, Chellappa R (2018) Stacked u-nets for ground material segmentation in remote sensing imagery. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 257–261
Liu N, Han J, Yang MH (2018) Picanet: Learning pixel-wise contextual attention for saliency detection. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3089–3098
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017a) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Chen LC, Papandreou G, Schroff F, Adam H (2017b) Rethinking atrous convolution for semantic image segmentation. arXiv:170605587
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European conference on computer vision (ECCV), pp 801–818
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2881–2890
Zhao H, Zhang Y, Liu S, Shi J, Change Loy C, Lin D, Jia J (2018) Psanet: Point-wise spatial attention network for scene parsing. In: European conference on computer vision (ECCV), pp 267– 283
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3146–3154
Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. arXiv:180900916
Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:160602147
Romera E, Alvarez JM, Bergasa LM, Arroyo R (2017) Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 19(1):263–272
Article Google Scholar
Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H (2018) Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: European conference on computer vision (ECCV), pp 552–568
Mehta S, Rastegari M, Shapiro L, Hajishirzi H (2019) Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 9190–9200
Wang J, Xiong H, Wang H, Nian X (2020) Adscnet: asymmetric depthwise separable convolution for semantic segmentation in real-time. Appl Intell 50(4):1045–1056
Article Google Scholar
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2117–2125
Peng C, Zhang X, Yu G, Luo G, Sun J (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4353–4361
Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. In: European Conference on Computer Vision (ECCV), pp 269–284
Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information. arXiv:13013572
Valada A, Oliveira GL, Brox T, Burgard W (2016) Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In: International symposium on experimental robotics, Springer, pp 465–477
Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision, ACCV, Springer, pp 213–228
Zhu H, Weibel JB, Lu S (2016) Discriminative multi-modal feature fusion for rgbd indoor scene recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 2969–2976
Conceiçao FL, Pádua FL, Lacerda A, Machado AC, Dalip DH (2019) Multimodal data fusion framework based on autoencoders for top-n recommender systems. Appl Intell 49(9):3267–3282
Article Google Scholar
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4732
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 7291–7299
Wang T, Borji A, Zhang L, Zhang P, Lu H (2017) A stagewise refinement model for detecting salient objects in images. In: IEEE international conference on computer vision (ICCV), pp 4019–4028
Liu W, Rabinovich A, Berg AC (2015) Parsenet: Looking wider to see better. arXiv:150604579
Pang Y, Li Y, Shen J, Shao L (2019) Towards bridging semantic gap to improve semantic segmentation. In: IEEE International conference on computer vision (ICCV), pp 4230–4239
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4510–4520
Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2015) Deeply-supervised nets. In: Artificial intelligence and statistics, pp 562–570

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under Grant No. 61973122 and 61973120.

Author information

Authors and Affiliations

Key Laboratory of Smart Manufacturing in Energy Chemical Process, East China University of Science and Technology, Shanghai, 200237, China
Xin Lan, Xiaojing Gu & Xingsheng Gu

Authors

Xin Lan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojing Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xingsheng Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaojing Gu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lan, X., Gu, X. & Gu, X. MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation. Appl Intell 52, 5817–5829 (2022). https://doi.org/10.1007/s10489-021-02687-7

Download citation

Accepted: 05 July 2021
Published: 23 August 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10489-021-02687-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation

Abstract

Access this article

Similar content being viewed by others

MMNet: RGB-t Semantic Segmentation Network Based on Multi-scale and Adaptively Mutual Enhancement Mechanism

DHFNet: dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation

Abstract

Access this article

Similar content being viewed by others

MMNet: RGB-t Semantic Segmentation Network Based on Multi-scale and Adaptively Mutual Enhancement Mechanism

DHFNet: dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation

3M2RNet: Multi-Modal Multi-Resolution Refinement Network for Semantic Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation