research-article

Improved YOLOv4 Based on Transformer and Feature Fusion Module

Authors:
Shan Zhao

School of Software, Henan Polytechnic University, China

School of Software, Henan Polytechnic University, China

0000-0002-3376-6649
View Profile

,
Yang Yuan

School of Software, Henan Polytechnic University, China

School of Software, Henan Polytechnic University, China

0000-0002-4417-5553
View Profile

,
Xuan Wu

School of Software, Henan Polytechnic University, China

School of Software, Henan Polytechnic University, China

0000-0002-1640-6043
View Profile

,
Kaiwen Tian

School of Software, Henan Polytechnic University, China

School of Software, Henan Polytechnic University, China

0000-0001-5164-3469
View Profile

ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and ComputingDecember 2022Pages 123–128https://doi.org/10.1145/3579895.3579914

Published:04 April 2023Publication History

ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing

Pages 123–128

ABSTRACT

In order to solve the problem that mainstream object detection models such as R-CNN, YOLO and SSD focus too much on network depth and neglect the fusion of deep semantic feature information with shallow layers, this paper proposes a new network model based on transformer and feature fusion module, named as YOLOv4-RFEM and FGFM(YOLOv4-RF). YOLOv4-RF can show better results in the use and fusion of multi-scale feature maps. Firstly, to solve the spatial pyramid pooling layer used in YOLOv4 is deficient in integrating global information, we propose a Receptive Field Enhancement Module, named RFEM, based on the idea of transformer. RFEM can better handle the global information and enhance the feature map to achieve higher detection accuracy. Secondly, a Feature-Guided Fusion Module (FGFM) is designed to solve the problem of inadequate fusion between different feature maps. By introducing channel attention, different feature maps are fully fused to enhance network performance. Our experiments on the PASCAL VOC2007+2012 dataset achieved the mAP of 91.55%. The experimental results show that the model can improve the accuracy of object detection.

References

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, Columbus, OH, USA, 580-587. https://doi.org /10.1109/CVPR.2014.81Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. J. IEEE 37, 9 (September 2015), 1904 – 1916. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
Ross Girshick. 2015. Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV’15). IEEE, Santiago, Chile, 1440-1448. https://doi.org /10.1109/ICCV.2015.169Google ScholarDigital Library
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. J. IEEE 39, 6 (June 2017), 1137 - 1149. https://doi.org/10.1109/TPAMI.2016.2577031Google ScholarDigital Library
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 2980-2988. https://doi.org /10.1109/ICCV.2017.322Google ScholarCross Ref
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Las Vegas, NV, USA, 779-788. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector (ECCV’16). Springer, Cham, Amsterdam, Netherlands, 21-37. https://doi.org/10.1007/978-3-319-46448-0_2Google ScholarCross Ref
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. J. arXiv preprint arXiv:1704.04861 (April 2017). https://doi.org/10.48550/arXiv.1704.04861Google ScholarCross Ref
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Salt Lake City, UT, USA, 4510-4520. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. 2019. Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Seoul, Korea (South),1314-1324. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 1800-1807. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
Joseph Redmon, and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 6517-6525. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
Joseph Redmon, and Ali Farhadi. YOLOv3: An Incremental Improvement. J. arXiv preprint arXiv:1804.02767 (April 2018). https://doi.org/10.48550/arXiv.1804.02767Google ScholarCross Ref
Cui Gao, Qiang Cai, and Shaofeng Ming. 2020. YOLOv4 Object Detection Algorithm with Efficient Channel Attention Mechanism. 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE’20). IEEE, Harbin, China, 1764-1770. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. 2020. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). IEEE, Seattle, WA, USA, 1571-1580. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 936-944. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
Zhenwei Yu, Yonggang Shen, and Chenkai Shen. A real-time detection approach for bridge cracks based on YOLOv4-FPM. J. Elsevier (December 2020). https://doi.org/10.1016/j.autcon.2020.103514Google ScholarCross Ref
Li Tan, Xinyue Lv, Xiaofeng Lian, and Ge Wang. YOLOv4_Drone: UAV image target detection based on an improved YOLOv4 algorithm. J. Elsevier (June 2021). https://doi.org/10.1016/j.compeleceng.2021.107261Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. J. Advances in neural information processing systems (December 2017) 30. https://doi.org/10.48550/arXiv.1706.03762Google ScholarCross Ref
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. J. arXiv preprint arXiv:2010.11929 (June 2021). https://doi.org/10.48550/arXiv.2010.11929Google ScholarCross Ref
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation Networks. J. IEEE 42, 8 (April 2019), 2011 – 2023. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. Proceedings of the European conference on computer vision, New York, NY, USA, 3-19. https://doi.org/10.1007/978-3-030-01234-2_1Google ScholarDigital Library

Index Terms

Improved YOLOv4 Based on Transformer and Feature Fusion Module
1. Software and its engineering
  1. Software notations and tools
    1. Formal language definitions
      1. Semantics

Recommendations

Dilated convolution based RCNN using feature fusion for Low-Altitude aerial objects
Abstract
The low-altitude aerial objects are hard to detect by existing deep learning-based object detectors because of the scale variance, small size, and occlusion-related problems. Deep learning-based detectors do not consider contextual ...
Read More
FESSD:SSD target detection based on feature fusion and feature enhancement
Abstract
In recent years, significant breakthroughs have been made in target detection. However, although the existing two-stage target detection algorithm has high precision, the detection velocity is slow to content the real-time requirements. One-stage ...
Read More
A method towards biometric feature fusion

For multimodal biometric person recognition, information fusion can be classified into several levels: rank, decision, sensor, feature and match-score levels. In this paper, a novel method is proposed to fuse information from two or more biometric ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing
December 2022
365 pages
ISBN:9781450398039
DOI:10.1145/3579895

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Feature Fusion
Keywords:Object Detection
Receptive Field
Transformer
YOLO
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 24
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Improved YOLOv4 Based on Transformer and Feature Fusion Module

ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dilated convolution based RCNN using feature fusion for Low-Altitude aerial objects

FESSD:SSD target detection based on feature fusion and feature enhancement

A method towards biometric feature fusion

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Improved YOLOv4 Based on Transformer and Feature Fusion Module

ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dilated convolution based RCNN using feature fusion for Low-Altitude aerial objects

FESSD:SSD target detection based on feature fusion and feature enhancement

A method towards biometric feature fusion

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media