PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries

Cui, Shuming; Deng, Hongwei

doi:10.1007/s10044-024-01281-0

PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries

Theoretical Advances
Published: 09 May 2024

Volume 27, article number 58, (2024)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Shuming Cui¹ &
Hongwei Deng^1,2

379 Accesses
Explore all metrics

Abstract

The recently proposed DETR successfully applied the Transformer to object detection and achieved impressive results. However, the learned object queries often explore the entire image to match the corresponding regions, resulting in slow convergence of DETR. Additionally, DETR only uses single-scale features from the final stage of the backbone network, leading to poor performance in small object detection. To address these issues, we propose an effective training strategy for improving the DETR framework, named PMG-DETR. We achieve this by using Position-sensitive Multi-scale attention and Grouped queries. First, to better fuse the multi-scale features, we propose a Position-sensitive Multi-scale attention. By incorporating a spatial sampling strategy into deformable attention, we can further improve the performance of small object detection. Second, we extend the attention mechanism by introducing a novel positional encoding scheme. Finally, we propose a grouping strategy for object queries, where queries are grouped at the decoder side for a more precise inclusion of regions of interest and to accelerate DETR convergence. Extensive experiments on the COCO dataset show that PMG-DETR can achieve better performance compared to DETR, e.g., AP 47.8$\%$ using ResNet50 as backbone trained in 50 epochs. We perform ablation studies on the COCO dataset to validate the effectiveness of the proposed PMG-DETR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IoU-Enhanced Attention for End-to-End Task Specific Object Detection

AugDETR: Improving Multi-scale Learning for Detection Transformer

Spatial Group-Wise Enhance: Enhancing Semantic Feature Learning in CNN

Data Availability

The datasets generated and analysed during the current study are available in COCO (https://cocodataset.org) and Cityscapes (https://www.cityscapes-dataset.com) repositories.

References

Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, 11–14 Oct 2016, Proceedings, Part I 14. Springer, pp 21–37
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Neubeck A, Van Gool L (2006) Efficient non-maximum suppression. In: 18th International conference on pattern recognition (ICPR’06), vol 3. IEEE, pp 850–855
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, Sun L, Wang J (2021) Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3651–3660
Liu S, Li F, Zhang H, Yang X, Qi X, Su H, Zhu J, Zhang L (2022) Dab-detr: dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329
Li F, Zhang H, Liu S, Guo J, Ni LM, Zhang L (2022) Dn-detr: accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13619–13627
Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, Ni LM, Shum H-Y (2022) Dino: detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Wang Y, Zhang X, Yang T, Sun J (2022) Anchor detr: query design for transformer-based detector. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 2567–2575
Chen Q, Chen X, Zeng G, Wang J (2022) Group detr: fast training convergence with decoupled one-to-many label assignment. arXiv preprint arXiv:2207.13085
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9259–9266
Kim S-W, Kook H-K, Sun J-Y, Kang M-C, Ko S-J (2018) Parallel feature pyramid network for object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 234–250
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8759–8768
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773
Cao X, Yuan P, Feng B, Niu K (2022) Cf-detr: coarse-to-fine transformers for end-to-end object detection. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 185–193
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Zhang G, Luo Z, Tian Z, Zhang J, Zhang X, Lu S (2023) Towards efficient use of multi-scale features in transformer-based object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6206–6216
Li F, Zeng A, Liu S, Zhang H, Li H, Zhang L, Ni LM (2023) Lite detr: an interleaved multi-scale encoder for efficient detr. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18558–18567
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, vol 32
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Srinivas A, Lin T-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155
Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. In Advances in neural information processing systems, vol 32
Zhao H, Jia J, Koltun V (2020) Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10076–10085
Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3621–3630
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934
Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7464–7475
Dai Z, Cai B, Lin Y, Chen J (2021) Up-detr: unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1601–1610
Zhang G, Luo Z, Yu Y, Cui K, Lu S (2022) Accelerating detr convergence via semantic-aligned matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 949–958
Qiu T, Zhou L, Xu W, Cheng L, Feng Z, Song M (2023) Team-detr: guide queries as a professional team in detection transformers. arXiv preprint arXiv:2302.07116
Chen Z, Huang G, Li W, Teng J, Wang K, Shao J, Loy CC, Sheng L (2023) Siamese detr. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15722–15731
Munir MA, Khan SH, Khan MH, Ali M, Shahbaz Khan F (2024) Cal-DETR: Calibrated Detection Transformer. In: Advances in neural information processing systems, vol 36
Roh B, Shin J, Shin W, Kim S (2021) Sparse detr: efficient end-to-end object detection with learnable sparsity. arXiv preprint arXiv:2111.14330
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 Sept 2014, Proceedings, Part V 13. Springer, pp 740–755
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Xie X, Zhou P, Li H, Lin Z, Yan S (2022) Adan: adaptive nesterov momentum algorithm for faster optimizing deep models. arXiv preprint arXiv:2208.06677

Download references

Funding

This work was supported by the Hunan Provincial Natural Science Foundation of China (2023JJ50096, 2022JJ50016), the Science and Technology Plan Project of Hunan Province (2016TP1020), the “14th Five-Year Plan” Key Disciplines and Application-oriented Special Disciplines of Hunan Province (Xiangjiaotong [2022] 351).

Author information

Authors and Affiliations

College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China
Shuming Cui & Hongwei Deng
Hunan Provincial Key Laboratory of Intelligent Information Processing and Application, Hengyang, 421002, China
Hongwei Deng

Authors

Shuming Cui
View author publications
You can also search for this author inPubMed Google Scholar
Hongwei Deng
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Shuming Cui and Hongwei Deng were responsible for material preparation, data collection, and analysis. Shuming Cui wrote the main manuscript text, and all authors provided comments on previous versions of the manuscript. All authors read and approved the final manuscript. All authors reviewed the manuscript. Data availability statement. The datasets generated and analysed during the current study are available in COCO (https://cocodataset.org) and Cityscapes (https://www.cityscapes-dataset.com) repositories.

Corresponding author

Correspondence to Hongwei Deng.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cui, S., Deng, H. PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries. Pattern Anal Applic 27, 58 (2024). https://doi.org/10.1007/s10044-024-01281-0

Download citation

Received: 12 September 2023
Accepted: 15 April 2024
Published: 09 May 2024
DOI: https://doi.org/10.1007/s10044-024-01281-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

IoU-Enhanced Attention for End-to-End Task Specific Object Detection

AugDETR: Improving Multi-scale Learning for Detection Transformer

Spatial Group-Wise Enhance: Enhancing Semantic Feature Learning in CNN

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now