Skip to main content
Log in

Joint pyramid attention network for real-time semantic segmentation of urban scenes

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Semantic segmentation is an advanced research topic in computer vision and can be regarded as a fundamental technique for image understanding and analysis. However, most of the current semantic segmentation networks only focus on segmentation accuracy while ignoring the requirements for high processing speed and low computational complexity in mobile terminal fields such as autonomous driving systems, drone applications, and fingerprint recognition systems. Aiming at the problems that the current semantic segmentation task are facing, it is difficult to meet the actual industrial needs due to its high computational cost. We propose a joint pyramid attention network (JPANet) for real-time semantic segmentation. First, we propose a joint feature pyramid (JFP) module, which can combine multiple network stages with learning multi-scale feature representations with strong semantic information, hence improving pixel classification performance. Second, we built a spatial detail extraction (SDE) module to capture the shallow network multi-level local features and make up for the geometric information lost in the down-sampling stage. Finally, we design a bilateral feature fusion (BFF) module, which properly integrates spatial information and semantic information through a hybrid attention mechanism in spatial dimensions and channel dimensions, making full use of the correspondence between high-level features and low-level features. We conducted a series of experiments on two challenging urban road scene datasets (Cityscapes and CamVid) and achieved excellent results. Among them, the experimental results on the Cityscapes dataset show that for 512 × 1024 high-resolution images, our method achieves 71.62% Mean Intersection over Union (mIoU) with 109.9 frames per second (FPS) on a single 1080Ti GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Hu X, Jing L (2020) LDPNEt: A lightweight densely connected pyramid network for real-time semantic segmentation. IEEE Access 8:212647–212658

    Article  Google Scholar 

  2. Yu C, Wang J, Gao C, Yu G, Shen C, Sang N (2020) Context prior for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 12416–12425

  3. Zhong Z, Lin ZQ, Bidart R, Hu X, Daya IB, Li Z, Zheng W, Li J, Wong A (2020) Squeeze-and-attention networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 13065–13074

  4. Li H, Xiong P, Fan H, Sun J (2019) Dfanet: Deep feature aggregation for real-time semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 9522–9531

  5. Zhang B, Li W, Hui Y, Liu J, Guan Y (2020) MFENEt: Multi-level feature enhancement network for real-time semantic segmentation. Neurocomputing 393:54–65

    Article  Google Scholar 

  6. Hu P, Perazzi F, Heilbron FC, Wang O, Lin Z, Saenko K, Sclaroff S (2020) Real-time semantic segmentation with fast attention. IEEE Robot Autom Lett 6(1):263–270

    Article  Google Scholar 

  7. Molchanov P, Tyree S, Karras T, Aila T, Kautz J (2017) Pruning convolutional neural networks for resource efficient inference. In: Proceedings of international conference on learning representations (ICLR), pp 1–17

  8. Luo P, Zhu Z, Liu Z, Wang X, Tang X (2016) Face model compression by distilling knowledge from neurons. Proc AAAI Conf Artif Intell (AAAI) 30(1):3560–3566

    Google Scholar 

  9. Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Adv Neural Inform Process Syst 27:1269–1277

    Google Scholar 

  10. Jiang W, Xie Z, Li Y, Liu C, Lu H (2020) LRNNET: A light-weighted network with efficient reduced non-local operation for real-time semantic segmentation. In: 2020 IEEE international conference on multimedia & expo workshops (ICMEW), pp 1–6

  11. Emara T, Abd El Munim HE, Abbas HM (2019) LiteSeg: A Novel Lightweight ConvNet for Semantic Segmentation. In: 2019 Digital image computing: Techniques and applications (DICTA), pp 1–7

  12. Yu C, Wang J, Peng C, Gao C, Yu G, Sang N (2018) Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 325–341

  13. Orsic M, Kreso I, Bevandic P, Segvic S (2019) In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 12607–12616

  14. Howard A, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861

  15. Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131

  16. Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6848–6856

  17. Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) GhostNet: More features from cheap operations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1580–1589

  18. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 603–612

  19. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3146–3154

  20. Zhou W, Yuan J, Lei J, Luo T (2020) TSNet: three-stream self-attention network for RGB-D indoor semantic segmentation. IEEE Intelligent Systems

  21. Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–C848

    Article  Google Scholar 

  22. Chen L, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587

  23. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–C818

  24. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2881–C2890

  25. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–C3440

  26. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 234–241

  27. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122

  28. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  29. Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147

  30. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  31. Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 761–769

  32. Wu H, Zhang J, Huang K, Liang K, Yu Y (2019) Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv:1903.11816

  33. Treml M, Arjona-Medina J, Unterthiner T, Durgesh R, Friedmann F, Schuberth P, Mayr A, Heusel M, Hofmarcher M, Widrich M, Nessler B, Hochreiter S (2016) Speeding up semantic segmentation for autonomous driving. In: MLITS NIPS Workshop 2(7)

  34. Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H (2018) Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 552–568

  35. Yang Z, Yu H, Feng M, Sun W, Lin X, Sun M, Mao Z, Mian A (2020) Small object augmentation of urban scenes for Real-Time semantic segmentation. IEEE Trans Image Process 29:5175–5190

    Article  Google Scholar 

  36. Hu X, Wang H (2020) Efficient fast semantic segmentation using continuous shuffle dilated convolutions. IEEE Access 8:70913–70924

    Article  Google Scholar 

  37. Xiang W, Mao H, Athitsos V (2019) ThunderNet: A turbo unified network for real-time semantic segmentation. In: 2019 IEEE winter conference on applications of computer vision (WACV), pp 1789–1796

  38. Wang J, Xiong H, Wang H, Nian X (2020) ADSCNEt: Asymmetric depthwise separable convolution for semantic segmentation in real-time. Appl Intell 50(4):1045–1056

    Article  Google Scholar 

  39. Chen X, Lou X, Bai L, Han J (2019) Residual pyramid learning for single-shot semantic segmentation. IEEE Trans Intell Transp Syst 21(7):2990–3000

    Article  Google Scholar 

  40. Romera E, Alvarez JM, Bergasa LM, Arroyo R (2017) Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 19(1):263–272

    Article  Google Scholar 

  41. Chen PR, Hang HM, Chan SW, Lin JJ (2020) DSNEt: An efficient CNN for road scene segmentation. APSIPA Trans Signa Inform Process 9:1–14

    Article  Google Scholar 

  42. Zhou Q, Wang Y, Fan Y, Wu X, Zhang S, Kang B, Latecki L (2020) AGLNEt: Towards real-time semantic segmentation of self-driving images via attention-guided lightweight network. Appl Soft Comput 96:106682

    Article  Google Scholar 

  43. Zhao H, Qi X, Shen X, Shi J, Jia J (2018) Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European conference on computer vision (ECCV), pp 405–420

  44. Si H, Zhang Z, Lv F, Yu G, Lu F (2019) Real-time semantic segmentation via multiply spatial fusion network. arXiv:1911.07217

  45. Wu T, Tang S, Zhang R, Gao J, Zhang Y (2020) Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans Image Process 30:1169–1179

    Article  Google Scholar 

  46. Zhang X, Chen Z, Wu QMJ, Cai L, Lu D, Li X (2018) Fast semantic segmentation for scene perception. IEEE Trans Indust Inform 15(2):1183–1192

    Article  Google Scholar 

  47. Lo SY, Hang HM, Chan SW, Lin JJ (2019) Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In: Proceedings of the ACM Multimedia Asia, pp 1–6

  48. Li G, Jiang S, Yun I, Kim J, Kim J (2020) Depth-Wise Asymmetric bottleneck with Point-Wise aggregation decoder for Real-Time semantic segmentation in urban scenes. IEEE Access 8:27495–27506

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liyuan Jing.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China (under Grant 52076044) and the key project of the Natural Science Foundation of Chongqing, China (under Grant cstc2017jcyjBX0037).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, X., Jing, L. & Sehar, U. Joint pyramid attention network for real-time semantic segmentation of urban scenes. Appl Intell 52, 580–594 (2022). https://doi.org/10.1007/s10489-021-02446-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02446-8

Keywords

Navigation