YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Wang, Chien-Yao; Yeh, I-Hau; Mark Liao, Hong-Yuan

doi:10.1007/978-3-031-72751-1_1

Chien-Yao Wang^13,14,
I-Hau Yeh¹⁴ &
Hong-Yuan Mark Liao^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15089))

Included in the following conference series:

European Conference on Computer Vision

1645 Accesses

Abstract

Today’s deep learning methods focus on how to design the objective functions to make the prediction as close as possible to the target. Meanwhile, an appropriate neural network architecture has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature transformation, large amount of information will be lost. This paper delve into the important issues of information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network parameters. In addition, a lightweight network architecture—Generalized Efficient Layer Aggregation Network (GELAN) is designed. GELAN confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO object detection dataset. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Fig. 1. The source codes are released at https://github.com/WongKinYiu/yolov9.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ShuffleNeMt: modern lightweight convolutional neural network architecture

Article 24 September 2024

DSC-Ghost-Conv: A compact convolution module for building efficient neural network architectures

Article 14 July 2023

A dedicated hardware accelerator for real-time acceleration of YOLOv2

Article 09 May 2020

References

Bao, H., Dong, L., Piao, S., Wei, F.: BEiT: BERT pre-training of image transformers. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Cai, Y., et al.: Reversible column networks. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, K., Lin, W., Li, J., See, J., Wang, J., Zou, J.: AP-loss for accurate one-stage object detection. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 3782–3798 (2020)
Google Scholar
Chen, Y., et al.: SdAE: self-distillated masked autoencoder. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 108–124 (2022)
Google Scholar
Chen, Y., Yuan, X., Wu, R., Wang, J., Hou, Q., Cheng, M.M.: YOLO-MS: rethinking multi-scale representation learning for real-time object detection. arXiv preprint arXiv:2308.05480 (2023)
Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: DaVIT: dual attention vision transformers. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 74–92 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: TOOD: task-aligned one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3490–3499 (2021)
Google Scholar
Gao, S.H., Cheng, M.M., Zhao, K., Zhang, X.Y., Yang, M.H., Torr, P.: Res2Net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2019)
Google Scholar
Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: OTA: optimal transport assignment for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 303–312 (2021)
Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430 (2021)
Glenn, J.: YOLOv5 release v7.0 (2022). https://github.com/ultralytics/yolov5/releases/tag/v7.0
Glenn, J.: YOLOv8 release v8.1.0 (2024). https://github.com/ultralytics/ultralytics/releases/tag/v8.1.0
Gomez, A.N., Ren, M., Urtasun, R., Grosse, R.B.: The reversible residual network: backpropagation without storing activations. Adv. Neural Inf. Process. Syst. (2017)
Google Scholar
Gu, A., Dao, T.: Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
Guo, C., Fan, B., Zhang, Q., Xiang, S., Pan, C.: AugFPN: improving multi-scale feature learning for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12595–12604 (2020)
Google Scholar
Han, Q., Cai, Y., Zhang, X.: RevColV2: exploring disentangled representations in masked image modeling. Adv. Neural Inf. Process. Syst. (2023)
Google Scholar
Hayder, Z., He, X., Salzmann, M.: Boundary-aware instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5696–5704 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700–4708 (2017)
Google Scholar
Huang, K.C., Wu, T.H., Su, H.T., Hsu, W.H.: MonoDTR: monocular 3D object detection with depth-aware transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4012–4021 (2022)
Google Scholar
Huang, L., Li, W., Shen, L., Fu, H., Xiao, X., Xiao, S.: YOLOCS: object detection based on dense channel compression for feature spatial solidification. arXiv preprint arXiv:2305.04170 (2023)
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: International Conference on Machine Learning (ICML), pp. 4651–4664 (2021)
Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)
Google Scholar
Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: Artificial Intelligence and Statistics, pp. 562–570 (2015)
Google Scholar
Levinshtein, A., Sereshkeh, A.R., Derpanis, K.: DATNet: dense auxiliary tasks for object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1419–1427 (2020)
Google Scholar
Li, C., et al.: YOLOv6 v3.0: a full-scale reloading. arXiv preprint arXiv:2301.05586 (2023)
Li, C., et al.: YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)
Li, H., et al.: Uni-perceiver v2: a generalist model for large-scale vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2691–2700 (2023)
Google Scholar
Li, S., He, C., Li, R., Zhang, L.: A dual weighting label assignment scheme for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9387–9396 (2022)
Google Scholar
Liang, T., et al.: CBNet: a composite backbone network architecture for object detection. IEEE Trans. Image Process. (2022)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017)
Google Scholar
Lin, Z., Wang, Y., Zhang, J., Chu, X.: DynamicDet: a unified dynamic architecture for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6282–6291 (2023)
Google Scholar
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768 (2018)
Google Scholar
Liu, Y., et al.: CBNet: a novel composite backbone network architecture for object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 11653–11660 (2020)
Google Scholar
Liu, Y., et al.: Vmamba: visual state space model. arXiv preprint arXiv:2401.10166 (2024)
Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986 (2022)
Google Scholar
Lv, W., et al.: DETRs beat YOLOs on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16965–16974 (2024)
Google Scholar
Lyu, C., et al.: RTMDet: an empirical study of designing real-time object detectors. arXiv preprint arXiv:2212.07784 (2022)
Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: A ranking-based, balanced loss function unifying classification and localisation in object detection. Adv. Neural Inf. Process. Syst. 33, 15534–15545 (2020)
Google Scholar
Oksuz, K., Cam, B.C., Akbas, E., Kalkan, S.: Rank & sort loss for object detection and instance segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3009–3018 (2021)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)
Google Scholar
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7263–7271 (2017)
Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 658–666 (2019)
Google Scholar
Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: Object detection from scratch with deep supervision. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 398–412 (2019)
Google Scholar
Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning (CoRL), pp. 785–799 (2023)
Google Scholar
Sun, P., et al.: What makes for end-to-end object detection? In: International Conference on Machine Learning (ICML), pp. 9934–9944 (2021)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
Google Scholar
Tang, Z., Cho, J., Lei, J., Bansal, M.: Perceiver-VL: efficient vision-and-language modeling with iterative latent attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4410–4420 (2023)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9627–9636 (2019)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: a simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 1922–1933 (2022)
Google Scholar
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: IEEE Information Theory Workshop (ITW), pp. 1–5 (2015)
Google Scholar
Tu, Z., et al.: MaxVIT: Multi-axis vision transformer. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 459–479 (2022)
Google Scholar
Wang, C., et al.: Gold-YOLO: efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. (2023)
Google Scholar
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-YOLOv4: scaling cross stage partial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13029–13038 (2021)
Google Scholar
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7464–7475 (2023)
Google Scholar
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 390–391 (2020)
Google Scholar
Wang, C.Y., Liao, H.Y.M., Yeh, I.H.: Designing network design strategies through gradient path analysis. J. Inf. Sci. Eng. (2023)
Google Scholar
Wang, J., Song, L., Li, Z., Sun, H., Sun, J., Zheng, N.: End-to-end object detection with fully convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15849–15858 (2021)
Google Scholar
Wang, L., Lee, C.Y., Tu, Z., Lazebnik, S.: Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496 (2015)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 568–578 (2021)
Google Scholar
Wang, W., et al.: PVT v2: improved baselines with pyramid vision transformer. Comput. Visual Media 8(3), 415–424 (2022)
Google Scholar
Woo, S., et al.: ConvNeXt v2: co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16133–16142 (2023)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500 (2017)
Google Scholar
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9653–9663 (2022)
Google Scholar
Xu, S., et al.: PP-YOLOE: an evolved version of YOLO. arXiv preprint arXiv:2203.16250 (2022)
Xu, X., Jiang, Y., Chen, W., Huang, Y., Zhang, Y., Sun, X.: DAMO-YOLO: a report on real-time object detection design. arXiv preprint arXiv:2211.15444 (2022)
Zhang, R., et al.: MonoDETR: depth-guided transformer for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9155–9166 (2023)
Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 34, 12993–13000 (2020)
Google Scholar
Zhou, D., et al.: IoU loss for 2D/3D object detection. In: International Conference on 3D Vision (3DV), pp. 85–94 (2019)
Google Scholar
Zhu, B., et al.: AutoAssign: differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496 (2020)
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024)
Zhu, X., et al.: Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16804–16815 (2022)
Google Scholar
Zong, Z., Song, G., Liu, Y.: DETRs with collaborative hybrid assignments training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6748–6758 (2023)
Google Scholar

Download references

Acknowledgement

The authors wish to thank National Center for High-performance Computing (NCHC) for providing computational and storage resources.

Author information

Authors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chien-Yao Wang & Hong-Yuan Mark Liao
National Taipei University of Technology, Taipei, Taiwan
Chien-Yao Wang, I-Hau Yeh & Hong-Yuan Mark Liao
Department of Information and Computer Engineering, Chung Yuan Christian University, Taoyuan, Taiwan
Hong-Yuan Mark Liao

Authors

Chien-Yao Wang
View author publications
You can also search for this author in PubMed Google Scholar
I-Hau Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Yuan Mark Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chien-Yao Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8224 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, CY., Yeh, IH., Mark Liao, HY. (2025). YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15089. Springer, Cham. https://doi.org/10.1007/978-3-031-72751-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-72751-1_1
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72750-4
Online ISBN: 978-3-031-72751-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information