skip to main content
10.1145/3579895.3579914acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicnccConference Proceedingsconference-collections
research-article

Improved YOLOv4 Based on Transformer and Feature Fusion Module

Published:04 April 2023Publication History

ABSTRACT

In order to solve the problem that mainstream object detection models such as R-CNN, YOLO and SSD focus too much on network depth and neglect the fusion of deep semantic feature information with shallow layers, this paper proposes a new network model based on transformer and feature fusion module, named as YOLOv4-RFEM and FGFM(YOLOv4-RF). YOLOv4-RF can show better results in the use and fusion of multi-scale feature maps. Firstly, to solve the spatial pyramid pooling layer used in YOLOv4 is deficient in integrating global information, we propose a Receptive Field Enhancement Module, named RFEM, based on the idea of transformer. RFEM can better handle the global information and enhance the feature map to achieve higher detection accuracy. Secondly, a Feature-Guided Fusion Module (FGFM) is designed to solve the problem of inadequate fusion between different feature maps. By introducing channel attention, different feature maps are fully fused to enhance network performance. Our experiments on the PASCAL VOC2007+2012 dataset achieved the mAP of 91.55%. The experimental results show that the model can improve the accuracy of object detection.

References

  1. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, Columbus, OH, USA, 580-587. https://doi.org /10.1109/CVPR.2014.81Google ScholarGoogle Scholar
  2. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. J. IEEE 37, 9 (September 2015), 1904 – 1916. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ross Girshick. 2015. Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV’15). IEEE, Santiago, Chile, 1440-1448. https://doi.org /10.1109/ICCV.2015.169Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. J. IEEE 39, 6 (June 2017), 1137 - 1149. https://doi.org/10.1109/TPAMI.2016.2577031Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 2980-2988. https://doi.org /10.1109/ICCV.2017.322Google ScholarGoogle ScholarCross RefCross Ref
  6. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Las Vegas, NV, USA, 779-788. https://doi.org/10.1109/CVPR.2016.91Google ScholarGoogle ScholarCross RefCross Ref
  7. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector (ECCV’16). Springer, Cham, Amsterdam, Netherlands, 21-37. https://doi.org/10.1007/978-3-319-46448-0_2Google ScholarGoogle ScholarCross RefCross Ref
  8. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. J. arXiv preprint arXiv:1704.04861 (April 2017). https://doi.org/10.48550/arXiv.1704.04861Google ScholarGoogle ScholarCross RefCross Ref
  9. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Salt Lake City, UT, USA, 4510-4520. https://doi.org/10.1109/CVPR.2016.91Google ScholarGoogle ScholarCross RefCross Ref
  10. Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. 2019. Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Seoul, Korea (South),1314-1324. https://doi.org/10.1109/CVPR.2016.91Google ScholarGoogle ScholarCross RefCross Ref
  11. François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 1800-1807. https://doi.org/10.1109/CVPR.2016.91Google ScholarGoogle ScholarCross RefCross Ref
  12. Joseph Redmon, and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 6517-6525. https://doi.org/10.1109/CVPR.2016.91Google ScholarGoogle ScholarCross RefCross Ref
  13. Joseph Redmon, and Ali Farhadi. YOLOv3: An Incremental Improvement. J. arXiv preprint arXiv:1804.02767 (April 2018). https://doi.org/10.48550/arXiv.1804.02767Google ScholarGoogle ScholarCross RefCross Ref
  14. Cui Gao, Qiang Cai, and Shaofeng Ming. 2020. YOLOv4 Object Detection Algorithm with Efficient Channel Attention Mechanism. 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE’20). IEEE, Harbin, China, 1764-1770. https://doi.org/10.1109/CVPR.2016.91Google ScholarGoogle ScholarCross RefCross Ref
  15. Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. 2020. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). IEEE, Seattle, WA, USA, 1571-1580. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 936-944. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zhenwei Yu, Yonggang Shen, and Chenkai Shen. A real-time detection approach for bridge cracks based on YOLOv4-FPM. J. Elsevier (December 2020). https://doi.org/10.1016/j.autcon.2020.103514Google ScholarGoogle ScholarCross RefCross Ref
  18. Li Tan, Xinyue Lv, Xiaofeng Lian, and Ge Wang. YOLOv4_Drone: UAV image target detection based on an improved YOLOv4 algorithm. J. Elsevier (June 2021). https://doi.org/10.1016/j.compeleceng.2021.107261Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. J. Advances in neural information processing systems (December 2017) 30. https://doi.org/10.48550/arXiv.1706.03762Google ScholarGoogle ScholarCross RefCross Ref
  20. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. J. arXiv preprint arXiv:2010.11929 (June 2021). https://doi.org/10.48550/arXiv.2010.11929Google ScholarGoogle ScholarCross RefCross Ref
  21. Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation Networks. J. IEEE 42, 8 (April 2019), 2011 – 2023. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. Proceedings of the European conference on computer vision, New York, NY, USA, 3-19. https://doi.org/10.1007/978-3-030-01234-2_1Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improved YOLOv4 Based on Transformer and Feature Fusion Module

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing
      December 2022
      365 pages
      ISBN:9781450398039
      DOI:10.1145/3579895

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 April 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format