ABSTRACT
In order to solve the problem that mainstream object detection models such as R-CNN, YOLO and SSD focus too much on network depth and neglect the fusion of deep semantic feature information with shallow layers, this paper proposes a new network model based on transformer and feature fusion module, named as YOLOv4-RFEM and FGFM(YOLOv4-RF). YOLOv4-RF can show better results in the use and fusion of multi-scale feature maps. Firstly, to solve the spatial pyramid pooling layer used in YOLOv4 is deficient in integrating global information, we propose a Receptive Field Enhancement Module, named RFEM, based on the idea of transformer. RFEM can better handle the global information and enhance the feature map to achieve higher detection accuracy. Secondly, a Feature-Guided Fusion Module (FGFM) is designed to solve the problem of inadequate fusion between different feature maps. By introducing channel attention, different feature maps are fully fused to enhance network performance. Our experiments on the PASCAL VOC2007+2012 dataset achieved the mAP of 91.55%. The experimental results show that the model can improve the accuracy of object detection.
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, Columbus, OH, USA, 580-587. https://doi.org /10.1109/CVPR.2014.81Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. J. IEEE 37, 9 (September 2015), 1904 – 1916. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
- Ross Girshick. 2015. Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV’15). IEEE, Santiago, Chile, 1440-1448. https://doi.org /10.1109/ICCV.2015.169Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. J. IEEE 39, 6 (June 2017), 1137 - 1149. https://doi.org/10.1109/TPAMI.2016.2577031Google ScholarDigital Library
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 2980-2988. https://doi.org /10.1109/ICCV.2017.322Google ScholarCross Ref
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Las Vegas, NV, USA, 779-788. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector (ECCV’16). Springer, Cham, Amsterdam, Netherlands, 21-37. https://doi.org/10.1007/978-3-319-46448-0_2Google ScholarCross Ref
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. J. arXiv preprint arXiv:1704.04861 (April 2017). https://doi.org/10.48550/arXiv.1704.04861Google ScholarCross Ref
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, Salt Lake City, UT, USA, 4510-4520. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
- Andrew Howard, Mark Sandler, Bo Chen, Weijun Wang, Liang-Chieh Chen, Mingxing Tan, Grace Chu, Vijay Vasudevan, and Yukun Zhu. 2019. Searching for MobileNetV3. 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, Seoul, Korea (South),1314-1324. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
- François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 1800-1807. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
- Joseph Redmon, and Ali Farhadi. 2017. YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 6517-6525. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
- Joseph Redmon, and Ali Farhadi. YOLOv3: An Incremental Improvement. J. arXiv preprint arXiv:1804.02767 (April 2018). https://doi.org/10.48550/arXiv.1804.02767Google ScholarCross Ref
- Cui Gao, Qiang Cai, and Shaofeng Ming. 2020. YOLOv4 Object Detection Algorithm with Efficient Channel Attention Mechanism. 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE’20). IEEE, Harbin, China, 1764-1770. https://doi.org/10.1109/CVPR.2016.91Google ScholarCross Ref
- Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. 2020. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). IEEE, Seattle, WA, USA, 1571-1580. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, HI, USA, 936-944. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
- Zhenwei Yu, Yonggang Shen, and Chenkai Shen. A real-time detection approach for bridge cracks based on YOLOv4-FPM. J. Elsevier (December 2020). https://doi.org/10.1016/j.autcon.2020.103514Google ScholarCross Ref
- Li Tan, Xinyue Lv, Xiaofeng Lian, and Ge Wang. YOLOv4_Drone: UAV image target detection based on an improved YOLOv4 algorithm. J. Elsevier (June 2021). https://doi.org/10.1016/j.compeleceng.2021.107261Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. J. Advances in neural information processing systems (December 2017) 30. https://doi.org/10.48550/arXiv.1706.03762Google ScholarCross Ref
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. J. arXiv preprint arXiv:2010.11929 (June 2021). https://doi.org/10.48550/arXiv.2010.11929Google ScholarCross Ref
- Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation Networks. J. IEEE 42, 8 (April 2019), 2011 – 2023. https://doi.org/10.1109/TPAMI.2015.2389824Google ScholarDigital Library
- Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. Proceedings of the European conference on computer vision, New York, NY, USA, 3-19. https://doi.org/10.1007/978-3-030-01234-2_1Google ScholarDigital Library
Index Terms
- Improved YOLOv4 Based on Transformer and Feature Fusion Module
Recommendations
Dilated convolution based RCNN using feature fusion for Low-Altitude aerial objects
AbstractThe low-altitude aerial objects are hard to detect by existing deep learning-based object detectors because of the scale variance, small size, and occlusion-related problems. Deep learning-based detectors do not consider contextual ...
FESSD:SSD target detection based on feature fusion and feature enhancement
AbstractIn recent years, significant breakthroughs have been made in target detection. However, although the existing two-stage target detection algorithm has high precision, the detection velocity is slow to content the real-time requirements. One-stage ...
A method towards biometric feature fusion
For multimodal biometric person recognition, information fusion can be classified into several levels: rank, decision, sensor, feature and match-score levels. In this paper, a novel method is proposed to fuse information from two or more biometric ...
Comments