Abstract
Within the field of computer vision, object detection is a core issue. A technique extensively utilized in convolution-oriented detectors is Non-Maximum Suppression (NMS), designed to suppress redundant predictions. However, the sequential nature intrinsic to NMS inhibits its capacity for parallel execution, consequently restricting the inference speed. Furthermore, the recall rate of detectors with NMS is also affected in scenes with high object density and overlap. In this paper, we propose a real-time and end-to-end detector with YOLOF (You Only Look One-level Feature). The proposed methods do not introduce additional parameters or attention mechanisms, making them practical for real-time applications. Specifically, we propose the stop-gradient strategy to train only a portion of parameters to address the problem of weak supervision in one-to-one label assignment. We also present auxiliary losses to strengthen the supervision of negative samples during training and use semantic anchor optimization to suppress other anchors in the same location. These techniques allow the improved YOLOF to discard NMS within a 1 mAP gap and achieve faster inference speed. Our YOLOF-CSP-D53-DC5 achieves 42.7 mAP, only 0.5 mAP lower than the original version. Additionally, our YOLOF-R50 achieves a 37.1 mAP at 38 FPS and exceeds state-of-the-art networks by more than 1.5 times in inference speed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For simplicity, we use \({ }{\mathop \pi \limits_{oto}}\) and \({\mathop \pi \limits_{otm}}\) to denote one-to-one and one-to-many label assignments, respectively. YOLOFnms and YOLOFend represent the NMS-dependent and NMS-independent YOLOF, respectively.
- 2.
For clarity, we omit image indices k.
References
Bolya, D., Foley, S., Hays, J., Hoffman, J.: Tide: a general toolbox for identifying object detection errors. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 558–573. Springer (2020)
Chen, K., et al.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., Sun, J.: You only look one-level feature. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13039–13048 (2021)
Chen, Y., Chen, Q., Hu, Q., Cheng, J.: Date: dual assignment for end-to-end fully convolutional object detection. arXiv preprint arXiv:2211.13859 (2022)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988 (2017)
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666 (2019)
Shao, S., et al.: Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018)
Sun, P., et al.: What makes for end-to-end object detection? In: Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 9934–9944. PMLR (2021)
Sun, P., et al.: Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14454–14463 (2021)
Targ, S., Almeida, D., Lyman, K.: Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029 (2016)
12Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9627–9636 (2019)
Wang, J., et al.: End-to-end object detection with fully convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15849–15858 (2021)
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Zheng, Z., et al.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation (2021)
16Zhou, Q., Yu, C., Shen, C., Wang, Z., Li, H.: Object detection made simpler by eliminating heuristic nms (2021)
Zhu*, B., et al.: cvpods: All-in-one toolbox for computer vision research (2020)
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: Yolact: Real-time instance segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 9157–9166 (2019)
Zheng, Z., et al.: Distance-iou loss: Faster and better learning for bounding box regression. In: The AAAI Conference on Artificial Intelligence (AAAI) (2020)
Acknowledgements
This work was also partially supported by Guangdong Artificial Intelligence and Digital Economy Laboratory (Guangzhou).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xi, X., Huang, Y., Wu, W., Luo, R. (2024). End-to-End Object Detection with YOLOF. In: Huang, DS., Zhang, C., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14868. Springer, Singapore. https://doi.org/10.1007/978-981-97-5600-1_9
Download citation
DOI: https://doi.org/10.1007/978-981-97-5600-1_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5599-8
Online ISBN: 978-981-97-5600-1
eBook Packages: Computer ScienceComputer Science (R0)