Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism

Liang, Hong; Li, Yu; Zhang, Qian; Shao, Mingwen

doi:10.1007/s00530-025-01761-1

Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism

Regular Paper
Published: 24 March 2025

Volume 31, article number 171, (2025)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Hong Liang¹^na1,
Yu Li¹,
Qian Zhang¹^na1 &
…
Mingwen Shao¹^na1

129 Accesses
Explore all metrics

Abstract

As the pioneering work in Transformer-based object detection, DETR has attracted widespread attention and sparked a research trend since its inception. DETR’s global attention mechanism is novel in its architecture, but it takes a very long time to optimize and reach good performance. To address this issue, we introduce DO-DETR in this paper. Specifically, except for the Hungarian loss, we build a denoising module where noisy GT bounding boxes are inputted into the decoder, which trains the model to reconstruct the original boxes. This process significantly simplifies the complexity of bipartite graph matching, resulting in accelerated convergence. In the decoder part, we designed a multi-layer recurrent processing structure based on RoI, which helps the attention of DETR gradually and more accurately focus on foreground objects. Visual features are taken as glimpse features from the larger bounding box regions of RoIs in each processing stage based on the detection outcomes of the preceding stage. These glimpse features are then modeled together with the attention outputs from the previous stage, thereby alleviating the difficulty of global attention modeling. Under the ResNet-50 backbone, DO-DETR achieved the same AP (43.6) on the MSCOCO dataset in just 16 epochs, which vanilla DETR requires 500 epochs to achieve. Meanwhile, Deformable DETR took 50 epochs to achieve a similar performance. Our DO-DETR thus improved the convergence efficiency of Deformable DETR by 68%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RA-Net: reverse attention for generalizing residual learning

Article Open access 04 June 2024

AugDETR: Improving Multi-scale Learning for Detection Transformer

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Data Availability

The data underlying this article are available in the COCO dataset repository and can be accessed at http://cocodataset.org. The experimental results presented in this paper are based on the COCO dataset. No additional datasets were generated during the current study. The code and data analysis scripts used to process the COCO dataset and generate the experimental results are available from the corresponding author on reasonable request.

Code availability

Some or all of the code used during the study is available on request from the corresponding author.

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. (2018)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10 012–10 022. (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014)
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R. et al.: Resnest: Split-attention networks, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2736–2746. (2022)
Umirzakova, S., Abdullaev, M., Mardieva, S., Latipova, N., Muksimova, S.: Simplified knowledge distillation for deep neural networks bridging the performance gap with a novel teacher-student architecture. Electronics 13(22), 4530 (2024)
Google Scholar
Chen, C., Chen, Z., Zhang, J., Tao, D.: Sasa: Semantics-augmented set abstraction for point-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence 36(1), 221–229 (2022)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst., 28, (2015)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 913–922. (2021)
Carion, N., Massa, F., Synnaeve, G. Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers, In: European conference on computer vision. Springer, pp. 213–229 , (2020)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context, In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, pp. 740–755 , (2014)
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector, In: Proceedings of the AAAI conference on artificial intelligence, pp. 2567–2575. (2022)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection, arXiv:2010.04159, (2020)
Sun, Z., Cao, S., Yang, Y., Kitani, K. M.: Rethinking transformer-based set prediction for object detection, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3611–3620, (2021)
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: Dab-detr: Dynamic anchor boxes are better queries for detr, arXiv:2201.12329, (2022)
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: End-to-end object detection with dynamic attention, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2988–2997. (2021)
Li, Z., Hu, J., Wu, K., Miao, J., Wu, J.: Adjacent-atrous mechanism for expanding global receptive fields: an end-to-end network for multi-attribute scene analysis in remote sensing imagery, IEEE Trans. Geosci. Remote Sens., (2024)
Li, Z., Hu, J., Wu, K., Miao, J., Zhao, Z., Wu, J.: Local feature acquisition and global context understanding network for very high-resolution land cover classification. Sci. Rep. 14(1), 12597 (2024)
MATH Google Scholar
Li, Z., Hu, J., Wu, K., Miao, J., Wu, J.: Comprehensive attribute difference attention network for remote sensing image semantic understanding, IEEE Trans. Geosci. Remote Sens., (2024)
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3651–3660. (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks, Adv. Neural Inf Process. Syst., 25, (2012)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2015)
Google Scholar
Talreja, J., Aramvith, S., Onoye, T.: Dans: deep attention network for single image super-resolution. IEEE Access 11, 84379–84397 (2023)
Google Scholar
Talreja, J., Aramvith, S., Onoye, T.: Dhtcun: deep hybrid transformer cnn u network for single-image super-resolution, IEEE Access, (2024)
Talreja, J., Aramvith, S., Onoye, T.: Xtnsr: Xception-based transformer network for single image super resolution. Compl. Intell. Syst. 11(2), 162 (2025)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 779–788
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A. C.: Ssd: Single shot multibox detector, In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, pp. 21–37. (2016)
Girshick, R.: Fast r-cnn, In: Proceedings of the IEEE international conference on computer vision, (2015), pp. 1440–1448
Stewart, R., Andriluka, M., Ng, A. Y.: End-to-end people detection in crowded scenes, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2325–2333. (2016)
Salvador, A., Bellver, M., Campos, V., Baradad, M., Marques, F., Torres, J., Giro-i Nieto, X.: Recurrent neural networks for semantic instance segmentation, arXiv:1712.00617, (2017)
Ren, M., Zemel, R. S.: End-to-end instance segmentation with recurrent attention, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6656–6664. (2017)
Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q.: Freeanchor: Learning to match anchors for visual object detection, Adv. Neural Inf. Process. Syst, 32, (2019)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666. (2019)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13 619–13 627. (2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn, In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. (2017)
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3651–3660. (2021)
Yao, Z., Ai, J., Li, B., Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior, arXiv:2104.01318, (2021)
Gao, Z., Wang, L., Han, B., Guo, S.: Adamixer: A fast-converging query-based object detector, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 5364–5373
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: Query design for transformer-based detector. In: Proceedings of the AAAI conference on artificial intelligence 36(3), 2567–2575 (2022)
Zhang, G., Luo, Z., Huang, J., Lu, S., Xing, E. P.: Semantic-aligned matching for enhanced detr convergence and multi-scale feature fusion, Int. J. Comput. Vis., pp. 1–20, (2024)
Cai, Z., Liu, S., Wang, G., Ge, Z., Zhang, X., Huang, D.: Align-detr: Improving detr with simple iou-aware bce loss, arXiv:2304.07527, (2023)
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., Shum, H.-Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection, arXiv:2203.03605, (2022)
Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: Focus attention for efficient detr, In: Proceedings of the IEEE/CVF international conference on computer vision, (2023), pp. 6674–6683
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention, In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3621–3630
Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection,” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16 965–16 974. (2024)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is all you need, Adv. Neural. Inf. Process. Syst., 30, (2017)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection, In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. (2017)

Download references

Funding

This work was supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).

Author information

Hong Liang, Qian Zhang and Mingwen Shao have contributed equally to this work.

Authors and Affiliations

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), No. 66, West Changjiang Road, Qingdao, 266580, Shandong, China
Hong Liang, Yu Li, Qian Zhang & Mingwen Shao

Authors

Hong Liang
View author publications
You can also search for this author inPubMed Google Scholar
Yu Li
View author publications
You can also search for this author inPubMed Google Scholar
Qian Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Mingwen Shao
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

HL and YL took part in conceptualization; HL and YL were involved in methodology; YL and QZ was responsible for software; HL, YL, QZ and MS carried out formal analysis; YL wrote and prepared the original draft; HL and MS took part in writing review and editing, funding acquisition, and supervision; HL, QZ and MS contributed to resources.

Corresponding author

Correspondence to Yu Li.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liang, H., Li, Y., Zhang, Q. et al. Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism. Multimedia Systems 31, 171 (2025). https://doi.org/10.1007/s00530-025-01761-1

Download citation

Received: 05 October 2024
Accepted: 12 March 2025
Published: 24 March 2025
DOI: https://doi.org/10.1007/s00530-025-01761-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RA-Net: reverse attention for generalizing residual learning

AugDETR: Improving Multi-scale Learning for Detection Transformer

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Data Availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now