skip to main content
10.1145/3653804.3656278acmotherconferencesArticle/Chapter ViewAbstractPublication PagescvdlConference Proceedingsconference-collections
research-article

LAtt-Yolov8-seg: Video Real-time Instance Segmentation for Urban Street Scenes Based on Focused Linear Attention Mechanism

Published: 01 June 2024 Publication History

Abstract

Abstract: Recently, instance segmentation models with complex architectures and large parameter sets have shown impressive levels of precision. Nonetheless, considering a practical perspective, balancing precision and speed is more desirable. Real-time instance segmentation faces efficiency and quality challenges in complex urban street scenes. In the present research, we propose a YOLOv8-seg based model named LAtt-Yolov8-seg. A pivotal advancement lies in the introduction of a mechanism called Focused Linear Attention, which effectively reduces the computational complexity of traditional attention while maintaining representational capacity. This mechanism first designs a focusing function to adjust the orientations of query and key features to bring similar features together and dissimilar features apart, thereby mimicking the distribution of Softmax attention. Secondly, depthwise convolutions are used to recover the rank of the linear attention matrix, improving feature diversity. On the Cityscapes dataset, LAtt-Yolov8-seg achieves the optimal balance between real-time performance and quality compared to convolutional and transformer models. This work provides an effective and practical instance segmentation solution for resource-constrained real-world applications.

References

[1]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[2]
Yan L, Wang Q, Ma S, Solve the puzzle of instance segmentation in videos: A weakly supervised framework with spatio-temporal collaboration[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(1): 393-406.
[3]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[4]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[5]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 2
[6]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In ECCV, 2020. 1, 2, 3, 4, 6, 7, 8, 11
[7]
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 1, 2, 3, 4, 5, 6, 7, 8, 12
[8]
Han D, Pan X, Han Y, Flatten transformer: Vision transformer using focused linear attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 5961-5971.
[9]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016. 2, 3, 5
[10]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696.
[11]
Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In ICCV, 2019. 6.
[12]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
[13]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
[14]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ̧ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. 2, 3
[15]
Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022. 3
[16]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ́e J ́egou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 1, 4, 5, 6
[17]
Kaiming He, Georgia Gkioxari, Piotr Doll ́ar, and Ross Girshick. Mask R-CNN. In ICCV, 2017. 1, 2
[18]
Perreault H, Bilodeau G A, Saunier N, Centerpoly: Real-time instance segmentation using bounding polygons[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 2982-2991.
[19]
Qi Y, He Y, Qi X, Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6070-6079.
[20]
He J, Li P, Geng Y, FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23663-23672.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CVDL '24: Proceedings of the International Conference on Computer Vision and Deep Learning
January 2024
506 pages
ISBN:9798400718199
DOI:10.1145/3653804
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2024

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • the National Key R&D Program of China

Conference

CVDL 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 68
    Total Downloads
  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)14
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media