Skip to main content
Log in

DHFNet: dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Recently, red-green-blue (RGB) and thermal (RGB-T) data have attracted considerable interest for semantic segmentation because they provide robust imaging under the complex lighting conditions of urban roads. Most existing RGB-T semantic segmentation methods adopt an encoder-decoder structure, and repeated upsampling causes semantic information loss during decoding. Moreover, using simple cross-modality fusion neither completely mines complementary information from different modalities nor removes noise from the extracted features. To address these problems, we developed a dual-decoding hierarchical fusion network (DHFNet) to extract RGB and thermal information for RGB-T Semantic Segmentation. DHFNet uses a novel two-layer decoder and implements boundary refinement and boundary-guided foreground/background enhancement modules. The modules process features from different levels to achieve the global guidance and local refinement of the segmentation prediction. In addition, an adaptive attention-filtering fusion module filters and extracts complementary information from the RGB and thermal modalities. Further, we introduce a graph convolutional network and an atrous spatial pyramid pooling module to obtain multiscale features and deepen the extracted semantic information. Experimental results on two benchmark datasets showed that the proposed DHFNet performed well relative to state-of-the-art semantic segmentation methods in terms of different evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The data that support the findings of this study are openly available in https://www.mi.t.u-tokyo.ac.jp/static/projects/mil_multispectral/https://github.com/ShreyasSkandanS/pst900_thermal_rgb.

References

  1. Wu, J., Zhou, W., Qian, X., Lei, J., Yu, L., Luo, T.: MENet: Lightweight multimodality enhancement network for detecting salient objects in RGB-thermal images. Neurocomputing 527, 119–129 (2023)

    Article  Google Scholar 

  2. Zhou, W., Yue, Y., Fang, M., Qian, X., Yang, R., Yu, L.: BCINet: Bilateral cross-modal interaction network for indoor scene understanding in RGB-D images. Inf. Fusion 94, 32–42 (2023)

    Article  Google Scholar 

  3. Zhou, W., Lv, Y., Lei, J., Yu, L.: Global and local-contrast guides content-aware fusion for RGB-D saliency prediction. IEEE Trans. Syst. Man Cybern. Syst. 51(6), 3641–3649 (2019)

    Article  Google Scholar 

  4. Xu, G., Zhou, W., Qian, X., Ye, L., Lei, J., Yu, L.: CCFNet: Cross-complementary fusion network for RGB-D scene parsing of clothing images. J. Vis. Commun. Image Represent. 90, 103727 (2023)

    Article  Google Scholar 

  5. Zhou, W., Hong, J.: FHENet: Lightweight feature hierarchical exploration network for real-time rail surface defect inspection in RGB-D images. IEEE Trans. Instrum. Meas. (2023). https://doi.org/10.1109/TIM.2023.3237830

    Article  Google Scholar 

  6. Wu, J., Zhou, W., Qian, X., Lei, J., Yu, L., Luo, T.: MFENet: Multitype fusion and enhancement network for detecting salient objects in RGB-T images. Digital Signal Process. 133, 103827 (2023)

  7. Zhou, W., Liu, C., Lei, J., Yu, L., Luo, T.: HFNet: Hierarchical feedback network with multilevel atrous spatial pyramid pooling for RGB-D saliency detection. Neurocomputing 490, 347–357 (2022)

    Article  Google Scholar 

  8. Jin, J., Zhou, W., Yang, R., Ye L., Yu L.: Edge detection guide network for semantic segmentation of remote-sensing images. IEEE Geosci. Remote Sens. Lett. https://doi.org/10.1109/LGRS.2023.3234257

  9. Zhou, W., Yang, E., Lei J., Yu, L.: FRNet: Feature reconstruction network for RGB-D indoor scene parsing. IEEE J. Sel. Topics Signal Process. 16(4), 677–687 (2022)

  10. Zhou, W., Liu, C., Lei, J., Yu, L.: RLLNet: a lightweight remaking learning network for saliency redetection on RGB-D images. Sci. China Inf. Sci. 65(6), 160107 (2022)

    Article  Google Scholar 

  11. Zhou, W., Guo, Q., Lei, J., Yu, L., Hwang, J.-N.: IRFR-Net: Interactive recursive feature-reshaping network for detecting salient objects in RGB-D images. IEEE Trans. Neural Netw. Learn. Syst. https://doi.org/10.1109/TNNLS.2021.3105484

  12. Zhou, W., Yu, L., Zhou, Y., Qiu, W., Wu, M., Luo, T.,: Local and global feature learning for blind quality evaluation of screen content and natural scene images. IEEE Trans. Image Process. 27(5), 2086–2095 (2018)

  13. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. (2014) arXiv preprint arXiv:1409.1556

  15. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  16. Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Asian Conference on Computer Vision (ACCV), pp. 213–228 (2016)

  17. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 325–341 (2018)

  18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  19. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  20. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

  21. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612 (2019)

  22. He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7519–7528 (2019)

  23. Hu, X., Yang, K., Fei, L., Wang, K.: Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In: 2019 IEEE International Conference on Image Processing, pp. 1440-1444 (2019)

  24. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. (2020) arXiv preprint arXiv:2010.11929

  25. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. (2022) arXiv preprint arXiv:2201.03545

  26. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  27. Rizzoli, G., Barbato, F., Zanuttigh, P.: Multimodal semantic segmentation in autonomous driving: a review of current approaches and future perspectives. Technologies 10(4), 90 (2022)

    Article  Google Scholar 

  28. Liu, H., Zhang, J., Yang, K., Hu, X., Stiefelhagen, R.: CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers (2022) arXiv preprnt arXiv:2203.04838

  29. Cui, Y., Yan, L., Cao, Z., Liu, D.: Tf-blender: temporal feature blender for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8138–8147 (2021)

  30. Gong, T., Zhou, W., Qian, X., Lei, J., Yu, L.: Global contextually guided lightweight network for RGB-thermal urban scene understanding. Eng. Appl. Artif. Intell. 117, 105510 (2023)

    Article  Google Scholar 

  31. Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4151–4160 (2017)

  32. Sun, Y., Zuo, W., Yun, P., Wang, H., Liu, M.: FuseSeg: semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 18(3), 1000–1011 (2020)

    Article  Google Scholar 

  33. Zhang, Q., Zhao, S., Luo, Y., Zhang, D., Huang, N., Han, J.: ABMDRNet: adaptive-weighted Bi-directional modality difference reduction network for RGB-T semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2633–2642 (2021)

  34. Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y., Harada, T.: MFNet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5108-5115 (2017)

  35. Sun, Y., Zuo, W., Liu, M.: Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot. Autom. Lett. 4(3), 2576–2583 (2019)

    Article  Google Scholar 

  36. Yi, S., Li, J., Liu, X., Yuan, X.: CCAFFMNet: dual-spectral semantic segmentation network with channel-coordinate attention feature fusion module. Neurocomputing 482, 236–251 (2022)

    Article  Google Scholar 

  37. Yan, L., Wang, Q., Cui, Y., Feng, F., Quan, X., Zhang, X., Liu, D.: GL-RG: Global-local representation granularity for video captioning (2022) arXiv preprint arXiv:2205.10706

  38. Zhou, W., Guo, Q., Lei, J., Yu, L., Hwang, J.-N.: ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1224–1235 (2022)

    Article  Google Scholar 

  39. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cba: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision, pp. 3–19 (2018)

  40. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  41. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  42. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. (2017) arXiv preprint arXiv:1706.05587

  43. Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolutional network for semantic segmentation (2019) arXiv preprint arXiv:1909.06121

  44. Li, G., Liu, Z., Zeng, D., Lin, W., Ling, H.: Adjacent context coordination network for salient object detection in optical remote sensing images. IEEE Trans. Cybern. (2022). https://doi.org/10.1109/TCYB.2022.3162945

    Article  Google Scholar 

  45. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020) arXiv preprint arXiv:2004.10934

  46. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, pp. 234–241 (2015)

  47. Zhou, W., Liu, J., Lei, J., Yu, L., Hwang, J.-N.: Gmnet: gradedfeature multilabel-learning network for rgb-thermal urban scene semantic segmentation. IEEE Trans. Image Process. 7790–7802 (2021)

  48. Shivakumar, S.S., Rodrigues, N., Zhou, A., Miller, I.D., Kumar, V., Taylor, C.J.: Pst900: Rgb-thermal calibration, dataset and segmentation network. In: 2020 IEEE International Conference on Robotics and Automation, pp. 9441-9447 (2020)

  49. Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4151-4160 (2017)

  50. Zhou, W., Jin, J., Lei, J., Yu, L.: CIMFNet: Cross-layer interaction and multiscale fusion network for semantic segmentation of high-resolution remote sensing images. IEEE J. Sel. Topics Signal Process. 16(4), 666–676 (2022)

    Article  Google Scholar 

  51. Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1857–1866 (2018)

  52. Liu, J., He, J., Zhang, J., Ren, J.S., Li, H.: Efficientfcn: holistically-guided decoding for semantic segmentation. Eur. Conf. Comput. Vis. 1–17 (2020)

Download references

Funding

This work was supported by National Natural Science Foundation of China (61502429, 61672337, 61972357); the Zhejiang Provincial Natural Science Foundation of China (LY18F020012, LY17F020011); and Zhejiang Key R & D Program (2019C03135).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wujie Zhou.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, Y., Zhou, W., Zhang, L. et al. DHFNet: dual-decoding hierarchical fusion network for RGB-thermal semantic segmentation. Vis Comput 40, 169–179 (2024). https://doi.org/10.1007/s00371-023-02773-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02773-6

Keywords

Navigation