Abstract
As the pioneering work in Transformer-based object detection, DETR has attracted widespread attention and sparked a research trend since its inception. DETR’s global attention mechanism is novel in its architecture, but it takes a very long time to optimize and reach good performance. To address this issue, we introduce DO-DETR in this paper. Specifically, except for the Hungarian loss, we build a denoising module where noisy GT bounding boxes are inputted into the decoder, which trains the model to reconstruct the original boxes. This process significantly simplifies the complexity of bipartite graph matching, resulting in accelerated convergence. In the decoder part, we designed a multi-layer recurrent processing structure based on RoI, which helps the attention of DETR gradually and more accurately focus on foreground objects. Visual features are taken as glimpse features from the larger bounding box regions of RoIs in each processing stage based on the detection outcomes of the preceding stage. These glimpse features are then modeled together with the attention outputs from the previous stage, thereby alleviating the difficulty of global attention modeling. Under the ResNet-50 backbone, DO-DETR achieved the same AP (43.6) on the MSCOCO dataset in just 16 epochs, which vanilla DETR requires 500 epochs to achieve. Meanwhile, Deformable DETR took 50 epochs to achieve a similar performance. Our DO-DETR thus improved the convergence efficiency of Deformable DETR by 68%.






Similar content being viewed by others
Data Availability
The data underlying this article are available in the COCO dataset repository and can be accessed at http://cocodataset.org. The experimental results presented in this paper are based on the COCO dataset. No additional datasets were generated during the current study. The code and data analysis scripts used to process the COCO dataset and generate the experimental results are available from the corresponding author on reasonable request.
Code availability
Some or all of the code used during the study is available on request from the corresponding author.
References
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. (2018)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10 012–10 022. (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014)
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R. et al.: Resnest: Split-attention networks, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2736–2746. (2022)
Umirzakova, S., Abdullaev, M., Mardieva, S., Latipova, N., Muksimova, S.: Simplified knowledge distillation for deep neural networks bridging the performance gap with a novel teacher-student architecture. Electronics 13(22), 4530 (2024)
Chen, C., Chen, Z., Zhang, J., Tao, D.: Sasa: Semantics-augmented set abstraction for point-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence 36(1), 221–229 (2022)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks, Adv. Neural Inf. Process. Syst., 28, (2015)
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 913–922. (2021)
Carion, N., Massa, F., Synnaeve, G. Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers, In: European conference on computer vision. Springer, pp. 213–229 , (2020)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context, In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, pp. 740–755 , (2014)
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector, In: Proceedings of the AAAI conference on artificial intelligence, pp. 2567–2575. (2022)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection, arXiv:2010.04159, (2020)
Sun, Z., Cao, S., Yang, Y., Kitani, K. M.: Rethinking transformer-based set prediction for object detection, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3611–3620, (2021)
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: Dab-detr: Dynamic anchor boxes are better queries for detr, arXiv:2201.12329, (2022)
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: End-to-end object detection with dynamic attention, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2988–2997. (2021)
Li, Z., Hu, J., Wu, K., Miao, J., Wu, J.: Adjacent-atrous mechanism for expanding global receptive fields: an end-to-end network for multi-attribute scene analysis in remote sensing imagery, IEEE Trans. Geosci. Remote Sens., (2024)
Li, Z., Hu, J., Wu, K., Miao, J., Zhao, Z., Wu, J.: Local feature acquisition and global context understanding network for very high-resolution land cover classification. Sci. Rep. 14(1), 12597 (2024)
Li, Z., Hu, J., Wu, K., Miao, J., Wu, J.: Comprehensive attribute difference attention network for remote sensing image semantic understanding, IEEE Trans. Geosci. Remote Sens., (2024)
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3651–3660. (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks, Adv. Neural Inf Process. Syst., 25, (2012)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2015)
Talreja, J., Aramvith, S., Onoye, T.: Dans: deep attention network for single image super-resolution. IEEE Access 11, 84379–84397 (2023)
Talreja, J., Aramvith, S., Onoye, T.: Dhtcun: deep hybrid transformer cnn u network for single-image super-resolution, IEEE Access, (2024)
Talreja, J., Aramvith, S., Onoye, T.: Xtnsr: Xception-based transformer network for single image super resolution. Compl. Intell. Syst. 11(2), 162 (2025)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 779–788
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A. C.: Ssd: Single shot multibox detector, In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, pp. 21–37. (2016)
Girshick, R.: Fast r-cnn, In: Proceedings of the IEEE international conference on computer vision, (2015), pp. 1440–1448
Stewart, R., Andriluka, M., Ng, A. Y.: End-to-end people detection in crowded scenes, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2325–2333. (2016)
Salvador, A., Bellver, M., Campos, V., Baradad, M., Marques, F., Torres, J., Giro-i Nieto, X.: Recurrent neural networks for semantic instance segmentation, arXiv:1712.00617, (2017)
Ren, M., Zemel, R. S.: End-to-end instance segmentation with recurrent attention, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6656–6664. (2017)
Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q.: Freeanchor: Learning to match anchors for visual object detection, Adv. Neural Inf. Process. Syst, 32, (2019)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 658–666. (2019)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13 619–13 627. (2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn, In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. (2017)
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence, In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3651–3660. (2021)
Yao, Z., Ai, J., Li, B., Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior, arXiv:2104.01318, (2021)
Gao, Z., Wang, L., Han, B., Guo, S.: Adamixer: A fast-converging query-based object detector, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 5364–5373
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: Query design for transformer-based detector. In: Proceedings of the AAAI conference on artificial intelligence 36(3), 2567–2575 (2022)
Zhang, G., Luo, Z., Huang, J., Lu, S., Xing, E. P.: Semantic-aligned matching for enhanced detr convergence and multi-scale feature fusion, Int. J. Comput. Vis., pp. 1–20, (2024)
Cai, Z., Liu, S., Wang, G., Ge, Z., Zhang, X., Huang, D.: Align-detr: Improving detr with simple iou-aware bce loss, arXiv:2304.07527, (2023)
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., Shum, H.-Y.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection, arXiv:2203.03605, (2022)
Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: Focus attention for efficient detr, In: Proceedings of the IEEE/CVF international conference on computer vision, (2023), pp. 6674–6683
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention, In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3621–3630
Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., Liu, Y., Chen, J.: Detrs beat yolos on real-time object detection,” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16 965–16 974. (2024)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I.: Attention is all you need, Adv. Neural. Inf. Process. Syst., 30, (2017)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection, In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. (2017)
Funding
This work was supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).
Author information
Authors and Affiliations
Contributions
HL and YL took part in conceptualization; HL and YL were involved in methodology; YL and QZ was responsible for software; HL, YL, QZ and MS carried out formal analysis; YL wrote and prepared the original draft; HL and MS took part in writing review and editing, funding acquisition, and supervision; HL, QZ and MS contributed to resources.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liang, H., Li, Y., Zhang, Q. et al. Do-DETR: enhancing DETR training convergence with integrated denoising and RoI mechanism. Multimedia Systems 31, 171 (2025). https://doi.org/10.1007/s00530-025-01761-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01761-1