Skip to main content
Log in

MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Semantic segmentation is a fundamental task in computer vision. However, most networks are designed for RGB inputs which quality can degrade gracefully under low-level illumination or bad weather conditions. Recent works have achieved promising results by inputting networks with a RGB image and a corresponding registered thermal image together. However, how and when to fuse the features of RGB modality and thermal modality are still remain challenging. In this paper, we propose a Multi-Modal Multi-Stage Network (MMNet) for RGB-T image semantic segmentation. MMNet consists of two stages. Stage 1 extracts features of different modalities separately to avoid cross-modal feature conflicts. Stage 2 fuses representations from the first stage and gradually refines the details. Specifically, Stage 1 has two encoder-decoder sub-networks while Stage 2 has one. As semantic gap exists across encoders and decoders, we propose an Efficient Feature Enhancement Module (EFEM) to bridge the encoder with decoder. Moreover, we deploy a light-weight Mini Refinement Block (MRB) as the encoder at Stage 2 to do the fusion and refinement efficiently. The experimental results demonstrate that our network achieves improved performance while simultaneously being efficient in terms of parameters and FLOPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The symbol CB(∗,k,s) denotes a convolutional layer with k × k kernel and stride s, followed by a batchnorm layer. The same symbol is used in the rest of the paper.

  2. The symbol DCB(∗,k,s) denotes a depth-wise separable convolutional layer with k × k kernel and stride s, followed by a batchnorm layer. The same symbol is used in the rest of the paper.

References

  1. Ha Q, Watanabe K, Karasawa T, Ushiku Y, Harada T (2017) Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: IEEE/RSJ International conference on intelligent robots and systems, IROS, IEEE, pp 5108–5115

  2. Sun Y, Zuo W, Liu M (2019) Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot Autom Lett 4(3):2576–2583

    Article  Google Scholar 

  3. Shivakumar SS, Rodrigues N, Zhou A, Miller ID, Kumar V, Taylor CJ (2019) Pst900: Rgb-thermal calibration, dataset and segmentation network. arXiv:190910980

  4. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision, ECCV, Springer International Publishing, pp 483–499

  5. Li W, Wang Z, Yin B, Peng Q, Du Y, Xiao T, Yu G, Lu H, Wei Y, Sun J (2019) Rethinking on multi-stage networks for human pose estimation. arXiv:190100148

  6. Fu J, Liu J, Wang Y, Zhou J, Wang C, Lu H (2019) Stacked deconvolutional network for semantic segmentation. In: IEEE Transactions on Image Processing, https://doi.org/10.1109/TIP.2019.2895460

  7. Cheng B, Chen LC, Wei Y, Zhu Y, Huang Z, Xiong J, Huang TS, Hwu WM, Shi H (2019) Spgnet: Semantic prediction guidance for scene parsing. In: IEEE International conference on computer vision (ICCV), pp 5218–5228

  8. Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Glaeser C, Timm F, Wiesbeck W (2020) Dietmayer, K. In: IEEE Transactions on intelligent transportation systems, Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges

  9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  10. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440

  11. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International conference on learning representations (ICLR

  12. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9

  13. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Analy Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  14. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, MICCAI, Springer, pp. 234–241

  15. Ghosh A, Ehrlich M, Shah S, Davis LS, Chellappa R (2018) Stacked u-nets for ground material segmentation in remote sensing imagery. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 257–261

  16. Liu N, Han J, Yang MH (2018) Picanet: Learning pixel-wise contextual attention for saliency detection. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3089–3098

  17. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017a) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  18. Chen LC, Papandreou G, Schroff F, Adam H (2017b) Rethinking atrous convolution for semantic image segmentation. arXiv:170605587

  19. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: European conference on computer vision (ECCV), pp 801–818

  20. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2881–2890

  21. Zhao H, Zhang Y, Liu S, Shi J, Change Loy C, Lin D, Jia J (2018) Psanet: Point-wise spatial attention network for scene parsing. In: European conference on computer vision (ECCV), pp 267– 283

  22. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3146–3154

  23. Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. arXiv:180900916

  24. Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:160602147

  25. Romera E, Alvarez JM, Bergasa LM, Arroyo R (2017) Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 19(1):263–272

    Article  Google Scholar 

  26. Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H (2018) Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: European conference on computer vision (ECCV), pp 552–568

  27. Mehta S, Rastegari M, Shapiro L, Hajishirzi H (2019) Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 9190–9200

  28. Wang J, Xiong H, Wang H, Nian X (2020) Adscnet: asymmetric depthwise separable convolution for semantic segmentation in real-time. Appl Intell 50(4):1045–1056

    Article  Google Scholar 

  29. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2117–2125

  30. Peng C, Zhang X, Yu G, Luo G, Sun J (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4353–4361

  31. Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. In: European Conference on Computer Vision (ECCV), pp 269–284

  32. Couprie C, Farabet C, Najman L, LeCun Y (2013) Indoor semantic segmentation using depth information. arXiv:13013572

  33. Valada A, Oliveira GL, Brox T, Burgard W (2016) Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In: International symposium on experimental robotics, Springer, pp 465–477

  34. Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Asian conference on computer vision, ACCV, Springer, pp 213–228

  35. Zhu H, Weibel JB, Lu S (2016) Discriminative multi-modal feature fusion for rgbd indoor scene recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 2969–2976

  36. Conceiçao FL, Pádua FL, Lacerda A, Machado AC, Dalip DH (2019) Multimodal data fusion framework based on autoencoders for top-n recommender systems. Appl Intell 49(9):3267–3282

    Article  Google Scholar 

  37. Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4732

  38. Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 7291–7299

  39. Wang T, Borji A, Zhang L, Zhang P, Lu H (2017) A stagewise refinement model for detecting salient objects in images. In: IEEE international conference on computer vision (ICCV), pp 4019–4028

  40. Liu W, Rabinovich A, Berg AC (2015) Parsenet: Looking wider to see better. arXiv:150604579

  41. Pang Y, Li Y, Shen J, Shao L (2019) Towards bridging semantic gap to improve semantic segmentation. In: IEEE International conference on computer vision (ICCV), pp 4230–4239

  42. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4510–4520

  43. Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2015) Deeply-supervised nets. In: Artificial intelligence and statistics, pp 562–570

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China under Grant No. 61973122 and 61973120.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaojing Gu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lan, X., Gu, X. & Gu, X. MMNet: Multi-modal multi-stage network for RGB-T image semantic segmentation. Appl Intell 52, 5817–5829 (2022). https://doi.org/10.1007/s10489-021-02687-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02687-7

Keywords

Navigation