Multi-scale feature fusion with attention mechanism for crowded road object detection

Wu, Jingtao; Dai, Guojun; Zhou, Wenhui; Zhu, Xudong; Wang, Zengguan

doi:10.1007/s11554-023-01409-1

Multi-scale feature fusion with attention mechanism for crowded road object detection

Research
Published: 08 February 2024

Volume 21, article number 29, (2024)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Jingtao Wu¹,
Guojun Dai¹,
Wenhui Zhou¹,
Xudong Zhu² &
…
Zengguan Wang²

255 Accesses
Explore all metrics

Abstract

Crowded object detection under the heavy traffic environment is always a challenging task in the field of autonomous driving and robotics, because the dense gathering of vehicles or pedestrians inevitably bring heavy occlusion. It is difficult to distinguish highly overlapped objects and predict their bounding boxes accurately, especially for small objects far down the road. To address this challenge, this paper proposes an improved YOLOv5s network integrating a multi-scale feature fusion module with attention mechanism for crowded road object detection task. Specifically, to enhance the multi-scale representation of semantic features and to model the object scale variation flexibly, we introduce an attention-guided pyramid feature fusion strategy into the YOLOv5s backbone network. Then a C3CA module is designed by embedding the coordinate attention (CA) into the concentrated-comprehensive convolution (C3) module of the original YOLOv5s, which can boost the ability of extracting distinguishing features from the overlapped objects. In addition, we add implicit detection heads (IDHs) into the original YOLOv5s’s detection head part, which helps the network to learn implicit knowledge and improves the detection accuracy. Finally, a simplified optimal transport assignment (SimOTA) and a bounding box regression loss with dynamic focusing mechanism are used to improve the detector’s overall performance. Extensive experiments on the public dataset BDD100K and our self-built crowded road object dataset (XMRD) demonstrate the superiority of our model in crowded road scenarios. The mean average precision (mAP) of our model can achieve 71.2% and 88.2% on the BDD100K and XMRD datasets, respectively, which provides an improvement of +3% over the existing state of the art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EfficientLiteDet: a real-time pedestrian and vehicle detection algorithm

Article 12 April 2022

A single-shot model for traffic-related pedestrian detection

Article 09 June 2022

Multi-target vehicle detection based on corner pooling with attention mechanism

Article 21 October 2023

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv:2004.10934 (2020)
Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: European conference on computer vision (ECCV), pp 354–370. Springer, Berlin (2016)
Dong, X., Han, Y., Li, W., Li, B.: Pedestrian detection in metro station based on improved ssd. In: IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), IEEE, pp 936–939 (2019)
Faster, R.: Towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 9199(10.5555), 2969239–2969250 (2015)
Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: exceeding yolo series in 2021. arXiv:2107.08430 (2021)
Ghiasi, G., Lin, T.Y., Le, Q.V.: Nas-fpn: Learning scalable feature pyramid architecture for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7036–7045 (2019)
Ghosh, R.: On-road vehicle detection in varying weather conditions using faster r-cnn with several region proposal networks. Multimed Tools Appl 80(17), 25985–25999 (2021)
Article Google Scholar
Guo, W., Shen, N., Zhang, T.: Overlapped pedestrian detection based on yolov5 in crowded scenes. In: 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), IEEE, pp 412–416 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778 (2016)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 13713–13722 (2021)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 4700–4708 (2017)
Jocher, G., Stoken, A., Borovec, J., Chaurasia, A., Changyu, L., Laughing, A., Hogan, A., Hajek, J., Diaconu, L., Marc, Y., et al.: ultralytics/yolov5: v5.0-yolov5-p6 1280 models aws supervise. ly and youtube integrations. Zenodo 11 (2021)
Li, Y., Li, S., Du, H., Chen, L., Zhang, D., Li, Y.: Yolo-acn: focusing on small target and occluded object detection. IEEE Access 8, 227288–227303 (2020)
Article Google Scholar
Lijingyu, Y., Kongbin, W.Z.: Multi-scale vehicle and pedestrian detection algorithm based on attention mechanism. Optics 29(6), 1448 (2021)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2117–2125 (2017)
Liu, D., Wang, Z., Meng, X.: Fast intensive crowd counting model of internet of things based on multi-scale attention mechanism. IET Image Processing (2022)
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 8759–8768 (2018a)
Liu, Y., Jing, X.Y., Nie, J., Gao, H., Liu, J., Jiang, G.P.: Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in rgb-d videos. IEEE Trans Multim 21(3), 664–677 (2018)
Article Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp 10012–10022 (2021)
Nguyen, H.: Improving faster r-cnn framework for fast vehicle detection. Math Probl Eng 2019, 1–11 (2019)
Google Scholar
Ong, J., Vo, B.T., Vo, B.N., Kim, D.Y., Nordholm, S.: A bayesian filter for multi-view 3d multi-object tracking with occlusion handling. IEEE Trans Pattern Anal Mach Intell 44(5), 2246–2263 (2020)
Article Google Scholar
Rajan, S.K.S., Damodaran, N.: Maffn_yolov5: Multi-scale attention feature fusion network on the yolov5 model for the health detection of coral-reefs using a built-in benchmark dataset. Analytics 2(1), 77–104 (2023)
Article Google Scholar
Reddy, N.D., Vo, M., Narasimhan, S.G.: Occlusion-net: 2d/3d occluded keypoint localization using graph networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7326–7335 (2019)
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7263–7271 (2017)
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv:1804.02767 (2018)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (ICML), PMLR, pp 6105–6114 (2019)
Tian, Q., Wang, M., Zhang, Y., Wang, Y.: A research for automatic pedestrian detection with ace enhancement on fasters r-cnn. In: 11th International Congress on Image and Signal Processing, pp. 1–9. BioMedical Engineering and Informatics (CISP-BMEI), IEEE (2018)
Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp 9627–9636 (2019)
Tong, Z., Chen, Y., Xu, Z., Yu, R.: Wise-iou: Bounding box regression loss with dynamic focusing mechanism. arXiv:2301.10051 (2023)
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: (2020) Cspnet: A new backbone that can enhance learning capability of cnn. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 390–391
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 13029–13038 (2021a)
Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: Unified network for multiple tasks. arXiv:2105.04206 (2021b)
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 7464–7475 (2023)
Wang, G., Liu, Z., Sun, H., Zhu, C., Yang, Z.: Yolox-BTFPN: an anchor-free conveyor belt damage detector with a biased feature extraction network. Measurement 200, 111675 (2022)
Article Google Scholar
Wang, J., Chen, Y., Dong, Z., Gao, M.: Improved yolov5 network for real-time multi-scale traffic sign detection. Neural Comput Appl 35(10), 7853–7865 (2023)
Article Google Scholar
Yan, S., Liu, Q.: Inferring occluded features for fast object detection. Signal Process 110, 188–198 (2015)
Article Google Scholar
Yangwei, W., Zhangji, Z.: An improved algorithm for real-time vehicle detection based on faster-rcnn. J Nanjing Univ 55(2), 231–237 (2019)
Google Scholar
Zhang, H., Du, Y., Ning, S., Zhang, Y., Yang, S., Du, C.: Pedestrian detection method based on faster r-cnn. In: 13th International Conference on Computational Intelligence and Security (CIS), IEEE, pp 427–430 (2017)
Zhang, H., Wang, Y., Dayoub, F., Sunderhauf, N.: Varifocalnet: An iou-aware dense object detector. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 8514–8523 (2021)
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Occlusion-aware r-cnn: Detecting pedestrians in a crowd. In: European conference on computer vision (ECCV), pp 637–653 (2018)
Zhao, M., Zhong, Y., Sun, D., Chen, Y.: Accurate and efficient vehicle detection framework based on ssd algorithm. IET Image Process 15(13), 3094–3104 (2021)
Article Google Scholar
Zhao, S., Zhang, S., Zhang, L.: Towards occlusion handling: object tracking with background estimation. IEEE Trans Cybern 48(7), 2086–2100 (2017)
Article Google Scholar
Zhou, X., Koltun, V., Krähenbühl, P.: Probabilistic two-stage detection. arXiv:2103.07461 (2021)
Zhousu, Z.L.: Vehicle detection and tracking algorithm based on vehicle video images. J Tongji Univ 47(S1), 191–198 (2019)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159 (2020)

Download references

Acknowledgements

This work is supported in part by Joint Funds of the Zhejiang Provincial Natural Science Foundation (LTY22F020001) and National Key R &D Program of China (2017YFE0118200).

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
Jingtao Wu, Guojun Dai & Wenhui Zhou
Hangzhou Xinmai Silicon Technology Inc., Hangzhou, China
Xudong Zhu & Zengguan Wang

Authors

Jingtao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Guojun Dai
View author publications
You can also search for this author in PubMed Google Scholar
Wenhui Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zengguan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenhui Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, J., Dai, G., Zhou, W. et al. Multi-scale feature fusion with attention mechanism for crowded road object detection. J Real-Time Image Proc 21, 29 (2024). https://doi.org/10.1007/s11554-023-01409-1

Download citation

Received: 06 September 2023
Accepted: 27 December 2023
Published: 08 February 2024
DOI: https://doi.org/10.1007/s11554-023-01409-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale feature fusion with attention mechanism for crowded road object detection

Abstract

Access this article

Similar content being viewed by others

EfficientLiteDet: a real-time pedestrian and vehicle detection algorithm

A single-shot model for traffic-related pedestrian detection

Multi-target vehicle detection based on corner pooling with attention mechanism

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-scale feature fusion with attention mechanism for crowded road object detection

Abstract

Access this article

Similar content being viewed by others

EfficientLiteDet: a real-time pedestrian and vehicle detection algorithm

A single-shot model for traffic-related pedestrian detection

Multi-target vehicle detection based on corner pooling with attention mechanism

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation