research-article

LAtt-Yolov8-seg: Video Real-time Instance Segmentation for Urban Street Scenes Based on Focused Linear Attention Mechanism

Authors:

Jianhui ZhangAuthors Info & Claims

CVDL '24: Proceedings of the International Conference on Computer Vision and Deep Learning

Article No.: 77, Pages 1 - 5

https://doi.org/10.1145/3653804.3656278

Published: 01 June 2024 Publication History

Abstract

Abstract: Recently, instance segmentation models with complex architectures and large parameter sets have shown impressive levels of precision. Nonetheless, considering a practical perspective, balancing precision and speed is more desirable. Real-time instance segmentation faces efficiency and quality challenges in complex urban street scenes. In the present research, we propose a YOLOv8-seg based model named LAtt-Yolov8-seg. A pivotal advancement lies in the introduction of a mechanism called Focused Linear Attention, which effectively reduces the computational complexity of traditional attention while maintaining representational capacity. This mechanism first designs a focusing function to adjust the orientations of query and key features to bring similar features together and dissimilar features apart, thereby mimicking the distribution of Softmax attention. Secondly, depthwise convolutions are used to recover the rank of the linear attention matrix, improving feature diversity. On the Cityscapes dataset, LAtt-Yolov8-seg achieves the optimal balance between real-time performance and quality compared to convolutional and transformer models. This work provides an effective and practical instance segmentation solution for resource-constrained real-world applications.

References

[1]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.

[2]

Yan L, Wang Q, Ma S, Solve the puzzle of instance segmentation in videos: A weakly supervised framework with spatio-temporal collaboration[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 33(1): 393-406.

[3]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.

[4]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.

[5]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 2

Digital Library

[6]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In ECCV, 2020. 1, 2, 3, 4, 6, 7, 8, 11

Digital Library

[7]

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 1, 2, 3, 4, 5, 6, 7, 8, 12

[8]

Han D, Pan X, Han Y, Flatten transformer: Vision transformer using focused linear attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 5961-5971.

[9]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016. 2, 3, 5

[10]

Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696.

[11]

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In ICCV, 2019. 6.

[12]

Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.

[13]

Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.

[14]

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ̧ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. 2, 3

[15]

Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022. 3

[16]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ́e J ́egou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021. 1, 4, 5, 6

[17]

Kaiming He, Georgia Gkioxari, Piotr Doll ́ar, and Ross Girshick. Mask R-CNN. In ICCV, 2017. 1, 2

[18]

Perreault H, Bilodeau G A, Saunier N, Centerpoly: Real-time instance segmentation using bounding polygons[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 2982-2991.

[19]

Qi Y, He Y, Qi X, Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 6070-6079.

[20]

He J, Li P, Geng Y, FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23663-23672.

Recommendations

NDAM-YOLOseg: a real-time instance segmentation model based on multi-head attention mechanism
Abstract
The primary objective of deep learning-based instance segmentation is to achieve accurate segmentation of individual objects in input images or videos. However, there exist challenges such as feature loss resulting from down-sampling operations, ...
Deep gated attention networks for large-scale street-level scene segmentation
Highlights
- A novel spatial gated attention mechanism is proposed in the context of pixel-wise labeling tasks.
Abstract
Street-level scene segmentation aims to label each pixel of street-view images into specific semantic categories. It has been attracting growing interest due to various real-world applications, especially in the area of autonomous ...
Urban street tree dataset for image classification and instance segmentation
Highlights
- A new large-scale urban street tree dataset for image classification and instance segmentation was proposed.
Abstract
Tree species identification and tree organ segmentation using images are challenging problems that are useful in many forestry-related tasks. In this paper, the urban street tree dataset is proposed as a comprehensive, publicly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CVDL '24: Proceedings of the International Conference on Computer Vision and Deep Learning

January 2024

506 pages

ISBN:9798400718199

DOI:10.1145/3653804

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the National Key R&D Program of China

Conference

CVDL 2024

CVDL 2024: The International Conference on Computer Vision and Deep Learning

January 19 - 21, 2024

Changsha, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
68
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)14

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents