Skip to main content

Advertisement

Log in

Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

As the pioneering work in Transformer-based object detection, DETR has attracted widespread attention and sparked a research trend since its inception. DETR’s global attention mechanism is novel in its architecture, but it takes a very long time to optimize and reach good performance. To address this issue, we introduce DO-DETR in this paper. Specifically, except for the Hungarian loss, we build a denoising module where noisy GT bounding boxes are inputted into the decoder, which trains the model to reconstruct the original boxes. This process significantly simplifies the complexity of bipartite graph matching, resulting in accelerated convergence. In the decoder part, we designed a multi-layer recurrent processing structure based on RoI, which helps the attention of DETR gradually and more accurately focus on foreground objects. Visual features are taken as glimpse features from the larger bounding box regions of RoIs in each processing stage based on the detection outcomes of the preceding stage. These glimpse features are then modeled together with the attention outputs from the previous stage, thereby alleviating the difficulty of global attention modeling. Under the ResNet-50 backbone, DO-DETR achieved the same AP (43.6) on the MSCOCO dataset in just 16 epochs, which vanilla DETR requires 500 epochs to achieve. Meanwhile, Deformable DETR took 50 epochs to achieve a similar performance. Our DO-DETR thus improved the convergence efficiency of Deformable DETR by 68%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The data underlying this article are available in the COCO dataset repository and can be accessed at http://cocodataset.org. The experimental results presented in this paper are based on the COCO dataset. No additional datasets were generated during the current study. The code and data analysis scripts used to process the COCO dataset and generate the experimental results are available from the corresponding author on reasonable request.

Code availability

Some or all of the code used during the study is available on request from the corresponding author.

References

  1. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. (2016)

  2. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. (2018)

  3. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10 012–10 022. (2021)

  4. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014)

  5. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R. et al.: Resnest: Split-attention networks, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2736–2746. (2022)

  6. Umirzakova, S., Abdullaev, M., Mardieva, S., Latipova, N., Muksimova, S.: Simplified knowledge distillation for deep neural networks bridging the performance gap with a novel teacher-student architecture. Electronics 13(22), 4530 (2024)

    Google Scholar 

  7. Chen, C., Chen, Z., Zhang, J., Tao, D.: Sasa: Semantics-augmented set abstraction for point-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence 36(1), 221–229 (2022)

  8. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. (2017)

  9. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst., 28, (2015)

  10. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 913–922. (2021)

  11. Carion, N., Massa, F., Synnaeve, G. Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers, In: European conference on computer vision. Springer, pp. 213–229 , (2020)

  12. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context, In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, pp. 740–755 , (2014)

  13. Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector, In: Proceedings of the AAAI conference on artificial intelligence, pp. 2567–2575. (2022)

  14. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection, arXiv:2010.04159, (2020)

  15. Sun, Z., Cao, S., Yang, Y., Kitani, K. M.: Rethinking transformer-based set prediction for object detection, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3611–3620, (2021)

  16. Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: Dab-detr: Dynamic anchor boxes are better queries for detr, arXiv:2201.12329, (2022)

  17. Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: End-to-end object detection with dynamic attention, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2988–2997. (2021)

  18. Li, Z., Hu, J., Wu, K., Miao, J., Wu, J.: Adjacent-atrous mechanism for expanding global receptive fields: an end-to-end network for multi-attribute scene analysis in remote sensing imagery, IEEE Trans. Geosci. Remote Sens., (2024)

  19. Li, Z., Hu, J., Wu, K., Miao, J., Zhao, Z., Wu, J.: Local feature acquisition and global context understanding network for very high-resolution land cover classification. Sci. Rep. 14(1), 12597 (2024)

    MATH  Google Scholar 

  20. Li, Z., Hu, J., Wu, K., Miao, J., Wu, J.: Comprehensive attribute difference attention network for remote sensing image semantic understanding, IEEE Trans. Geosci. Remote Sens., (2024)

  21. Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3651–3660. (2021)

  22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks, Adv. Neural Inf Process. Syst., 25, (2012)

  23. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2015)

    Google Scholar 

  24. Talreja, J., Aramvith, S., Onoye, T.: Dans: deep attention network for single image super-resolution. IEEE Access 11, 84379–84397 (2023)

    Google Scholar 

  25. Talreja, J., Aramvith, S., Onoye, T.: Dhtcun: deep hybrid transformer cnn u network for single-image super-resolution, IEEE Access, (2024)

  26. Talreja, J., Aramvith, S., Onoye, T.: Xtnsr: Xception-based transformer network for single image super resolution. Compl. Intell. Syst. 11(2), 162 (2025)

    Google Scholar 

  27. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 779–788

  28. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A. C.: Ssd: Single shot multibox detector, In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, pp. 21–37. (2016)

  29. Girshick, R.: Fast r-cnn, In: Proceedings of the IEEE international conference on computer vision, (2015), pp. 1440–1448

  30. Stewart, R., Andriluka, M., Ng, A. Y.: End-to-end people detection in crowded scenes, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2325–2333. (2016)

  31. Salvador, A., Bellver, M., Campos, V., Baradad, M., Marques, F., Torres, J., Giro-i Nieto, X.: Recurrent neural networks for semantic instance segmentation, arXiv:1712.00617, (2017)

  32. Ren, M., Zemel, R. S.: End-to-end instance segmentation with recurrent attention, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6656–6664. (2017)

  33. Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q.: Freeanchor: Learning to match anchors for visual object detection, Adv. Neural Inf. Process. Syst, 32, (2019)

  34. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666. (2019)

  35. Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13 619–13 627. (2022)

  36. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn, In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. (2017)

  37. Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3651–3660. (2021)

  38. Yao, Z., Ai, J., Li, B., Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior, arXiv:2104.01318, (2021)

  39. Gao, Z., Wang, L., Han, B., Guo, S.: Adamixer: A fast-converging query-based object detector, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 5364–5373

  40. Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: Query design for transformer-based detector. In: Proceedings of the AAAI conference on artificial intelligence 36(3), 2567–2575 (2022)

  41. Zhang, G., Luo, Z., Huang, J., Lu, S., Xing, E. P.: Semantic-aligned matching for enhanced detr convergence and multi-scale feature fusion, Int. J. Comput. Vis., pp. 1–20, (2024)

  42. Cai, Z., Liu, S., Wang, G., Ge, Z., Zhang, X., Huang, D.: Align-detr: Improving detr with simple iou-aware bce loss, arXiv:2304.07527, (2023)

  43. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., Shum, H.-Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection, arXiv:2203.03605, (2022)

  44. Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: Focus attention for efficient detr, In: Proceedings of the IEEE/CVF international conference on computer vision, (2023), pp. 6674–6683

  45. Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention, In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3621–3630

  46. Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection,” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16 965–16 974. (2024)

  47. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. (2017)

  48. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is all you need, Adv. Neural. Inf. Process. Syst., 30, (2017)

  49. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection, In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. (2017)

Download references

Funding

This work was supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).

Author information

Authors and Affiliations

Authors

Contributions

HL and YL took part in conceptualization; HL and YL were involved in methodology; YL and QZ was responsible for software; HL, YL, QZ and MS carried out formal analysis; YL wrote and prepared the original draft; HL and MS took part in writing review and editing, funding acquisition, and supervision; HL, QZ and MS contributed to resources.

Corresponding author

Correspondence to Yu Li.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, H., Li, Y., Zhang, Q. et al. Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism. Multimedia Systems 31, 171 (2025). https://doi.org/10.1007/s00530-025-01761-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-025-01761-1

Keywords