skip to main content
research-article

MCFNet: Multi-Attentional Class Feature Augmentation Network for Real-Time Scene Parsing

Published: 08 March 2024 Publication History

Abstract

For real-time scene parsing tasks, capturing multi-scale semantic features and performing effective feature fusion is crucial. However, many existing solutions ignore stripe-shaped things like poles, traffic lights and are so computationally expensive that cannot meet the high real-time requirements. This article presents a novel model, the Multi-Attention Class Feature Augmentation Network (MCFNet) to address this challenge. MCFNet is designed to capture long-range dependencies across different scales with low computational cost and to perform a weighted fusion of feature maps. It features the BAM (Strip Matrix Based Attention Module) for extracting strip objects in images. The BAM module replaces the conventional self-attention method using square matrices with strip matrices, which allows it to focus more on strip objects while reducing computation. Additionally, MCFNet has a parallel branch that focuses on global information based on self-attention to avoid wasting computation. The two branches are merged to enhance the performance of traditional self-attention modules. Experimental results on two mainstream datasets demonstrate the effectiveness of MCFNet. On the Camvid and Cityscapes test sets, MCFNet achieved 207.5 FPS/73.5% mIoU and 136.1 FPS/71.63% mIoU, respectively. The experiments show that MCFNet outperforms other models on the Camvid dataset and can significantly improve the performance of real-time scene parsing tasks.

References

[1]
G. Dong, Y. Yan, C. Shen, and H. Wang. 2020. Real-time high-performance semantic image segmentation of urban street scenes. IEEE Transactions on Intelligent Transportation Systems 22, 6 (2020), 3258–3274.
[2]
M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, M. Jagersand, and H. Zhang. 2018. A comparative study of real-time semantic segmentation for autonomous driving. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition Workshops (2018) 587–597.
[3]
M. Ding, Z. Wang, B. Zhou, J. Shi, Z. Lu, and P. Luo. 2020. Every frame counts: Joint learning of video segmentation and optical flow. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (2020), 10713–10720.
[4]
Y. Liu, C. Shen, C. Yu, and J. Wang. 2020. Efficient semantic video segmentation with per-frame inference. In European Conference on Computer Vision, (2020) Springer, 352–368.
[5]
Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia. 2021. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), 8741–8750.
[6]
D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool. 2018. Towards end-to-end lane detection: An instance segmentation approach. In 2018 IEEE Intelligent Vehicles Symposium (IV), 2018: IEEE, 286–291.
[7]
Y. Hou, Z. Ma, C. Liu, and C. C. Loy. 2019. Learning lightweight lane detection CNNs by self attention distillation. In 2019 IEEE/CVF International Conference on Computer Vision (2019), 1013--1021.
[8]
J. Zhuang, Z. Wang, and B. Wang. 2020. Video semantic segmentation with distortion-aware feature correction. IEEE Transactions on Circuits and Systems for Video Technology 31, 8 (2020), 3128--3139.
[9]
Z. Tan, B. Liu, Q. Chu, H. Zhong, Y. Wu, W. Li, and N. Yu. 2021. Real time video object segmentation in compressed domain. IEEE Transactions on Circuits and Systems for Video Technology 31, 1 (2021), 175--188.
[10]
Y. Hao, Y. Liu, Z. Wu, L. Han, Y. Chen, G. Chen, L. Chu, S. Tang, Z. Yu, Z. Chen, and B. Lai. 2021. Edgeflow: Achieving practical interactive segmentation with edge-guided flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, 1551–1560.
[11]
X. Sun, C. Chen, X. Wang, J. Dong, H. Zhou, and S. Chen. 2021. Gaussian dynamic convolution for efficient single-image segmentation. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2937–2948.
[12]
H. Li, P. Xiong, H. Fan, and J. Sun. 2019. DFANet: Deep feature aggregation for real-time semantic segmentation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 9514–9523.
[13]
H. Si, Z. Zhang, F. Lv, G. Yu, and F. Lu. 2020. Real-time semantic segmentation via multiply spatial fusion network. In Proceedings of the British Machine Vision (Virtual) Conference (2020).
[14]
M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei. 2021. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition (2021), 9716–9725.
[15]
G. Li, S. Jiang, I. Yun, J. Kim and J. Kim. 2020. Depth-wise asymmetric bottleneck with point-wise aggregation decoder for real-time semantic segmentation in urban scenes. IEEE Access 8, 1 (2020), 27495--27506.
[16]
C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference On Computer Vision (ECCV) (2018), 325–341.
[17]
J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition (2019), 3146–3154.
[18]
X. Pan, C. Ge, R. Lu, S. Song, G. Chen, Z. Huang, and G. Huang. 2022. On the integration of self-attention and convolution. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'22). 805--815.
[19]
P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens. 2019. Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems 32, 7(2019), 68--80.
[20]
X. Wang, R. B. Girshick, A. K. Gupta, and K. He. 2018. Non-local neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 7794–7803.
[21]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017), 834–848.
[22]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 2881–2890.
[23]
X. Li, W. Wang, X. Hu, and J. Yang. 2019. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, 510–519.
[24]
X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, S. Tan, and Y. Tong. 2020. Semantic flow for fast and accurate scene parsing. In European Conference on Computer Vision 2020: Springer, 775–793.
[25]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 6000--6010.
[26]
Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, 603–612.
[27]
D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun. 2020. Feature pyramid transformer. In European Conference on Computer Vision, (2020), Springer, 323–339.
[28]
S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), 6881–6890.
[29]
D. Zhou, Z. Yu, E. Xie, C. Xiao, A. Anandkumar, J. Feng, and J. M. Alvarez. 2022. Understanding The robustness in vision transformers. In Proceedings of the 39th International Conference on Machine Learning (ICML). 27378--27394.
[30]
Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang. 2021. OCNet: Object context for semantic segmentation. Int. J. Comput. Vis. 129, 8 (2021), 2375--2398.
[31]
C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang. 2021. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision 129, 11 (2021), 3051–3068.
[32]
S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. 2018. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 8759–8768.
[33]
S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr. 2019. Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 2 (2019), 652–662.
[34]
H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. 2018. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), 405–420.
[35]
P. Hu, F. Perazzi, F. C. Heilbron, O. Wang, Z. Lin, K. Saenko, and S. Sclaroff. 2020. Real-time semantic segmentation with fast attention. IEEE Robotics and Automation Letters 6, 1 (2020), 263–270.
[36]
H. Zha, R. Liu, X. Yang, D. Zhou, Q. Zhang, and X. Wei. 2021. ASFNet: Adaptive multiscale segmentation fusion network for real-time semantic segmentation. Computer Animation and Virtual Worlds 32, 3-4 (2021), e2022.
[37]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 770–778.
[38]
E. Romera, J. M. Álvarez, L. M. Bergasa, and R. Arroyo. 2018. ERFNet: Efficient residual factorized ConvNet for real- time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 263--272.
[39]
Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng. 2020. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 4003–4012.
[40]
Q. Song, K. Mei, and R. Huang. 2021. AttaNet: Attention-augmented network for fast and accurate scene parsing. In AAAI (2021).
[41]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 3213–3223.
[42]
G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. 2008. Segmentation and recognition using structure from motion point clouds. In ECCV, (2008).
[43]
A. Shrivastava, A. K. Gupta, and R. B. Girshick. 2016. Training region-based object detectors with online hard example mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 761–769.
[44]
X. Zhang, X. Zhou, M. Lin, and J. Sun. 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), 6848–6856.
[45]
S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin. 2019. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. Proceedings of the ACM Multimedia Asia.
[46]
X. Zhang, B. Du, Z. Wu, and T. Wan. 2022. LAANet: Lightweight attention-guided asymmetric network for real-time semantic segmentation. Neural Computing and Applications 34, 1 (2022), 3573--3587.
[47]
J. Liu, X. Xu, Y. Shi, C. Deng, and M. Shi. 2022. RELAXNet: Residual efficient learning and attention expected fusion network for real-time semantic segmentation. Neurocomputing 474, 1(2022), 115--127.
[48]
X.-L. Zhang, B.-C. Du, Z.-C. Luo, and K. Ma. 2022. Lightweight and efficient asymmetric network design for real-time semantic segmentation. Applied Intelligence 52, 1 (2022), 564–579.
[49]
G. Gao, G. Xu, J. Li, Y. Yu, H. Lu, and J. Yang. 2023. FBSNet: A fast bilateral symmetrical network for real-time semantic segmentation. In IEEE Transactions on Multimedia 25, 1 (2023), 3273--3283.
[50]
Q. Yi, G. Dai, M. Shi, Z. Huang, and A. Luo. 2023. ELANet: Effective lightweight attention-guided network for real-time semantic segmentation. Neural Processing Letters 55, 5 (2023), 6425--6442.
[51]
J. Fan, F. Wang, H. Chu, X. Hu, Y. Cheng, and B. Gao. 2023. MLFNet: Multi-level fusion network for real-time semantic segmentation of autonomous driving. IEEE Transactions on Intelligent Vehicles 8, 1 (2023), 756--767.
[52]
T. Singha, D.-S. Pham, and A. Krishna. 2023. A real-time semantic segmentation model using iteratively shared features in multiple sub-encoders. Pattern Recognit 140, 1 (2023), 109557.
[53]
T. Singha, D.-S. Pham, and A. Krishna. 2022. SDBNet: Lightweight Real-Time Semantic Segmentation Using Short-Term Dense Bottleneck, In 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), 1–8.
[54]
V. Badrinarayanan, A. Kendall, and R. Cipolla. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2481--2495.
[55]
Y. Wang, Q. Zhou, J. Liu, J. Xiong, G. Gao, X. Wu, and L. J. Latecki. 2019. Lednet: A lightweight encoder-decoder network for real-time semantic segmentation. In 2019 IEEE International Conference on Image Processing (ICIP) (2019), 1860–1864.

Cited By

View all
  • (2024)Network Information Security Monitoring Under Artificial Intelligence EnvironmentInternational Journal of Information Security and Privacy10.4018/IJISP.34503818:1(1-25)Online publication date: 21-Jun-2024
  • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
  • (2024)MultiRider: Enabling Multi-Tag Concurrent OFDM Backscatter by Taming In-band InterferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661862(292-303)Online publication date: 3-Jun-2024
  • Show More Cited By

Index Terms

  1. MCFNet: Multi-Attentional Class Feature Augmentation Network for Real-Time Scene Parsing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
    June 2024
    715 pages
    EISSN:1551-6865
    DOI:10.1145/3613638
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2024
    Online AM: 15 January 2024
    Accepted: 17 December 2023
    Revised: 11 October 2023
    Received: 22 February 2023
    Published in TOMM Volume 20, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Computer vision
    2. CNN
    3. real-time semantic segmentation
    4. attention mechanism

    Qualifiers

    • Research-article

    Funding Sources

    • Key Project of NSFC
    • Program for Innovative Research Team in University of Liaoning Province
    • Support Plan for Key Field Innovation Team of Dalian
    • Support Plan for Leading Innovation Team of Dalian University
    • Science and Technology Innovation Fund of Dalian
    • 111 Project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)143
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Network Information Security Monitoring Under Artificial Intelligence EnvironmentInternational Journal of Information Security and Privacy10.4018/IJISP.34503818:1(1-25)Online publication date: 21-Jun-2024
    • (2024)Heterogeneous Fusion and Integrity Learning Network for RGB-D Salient Object DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365647620:7(1-24)Online publication date: 15-May-2024
    • (2024)MultiRider: Enabling Multi-Tag Concurrent OFDM Backscatter by Taming In-band InterferenceProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661862(292-303)Online publication date: 3-Jun-2024
    • (2024)Driver intention prediction based on multi-dimensional cross-modality information interactionMultimedia Systems10.1007/s00530-024-01282-330:2Online publication date: 15-Mar-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media