RESC: REfine the SCore with adaptive transformer head for end-to-end object detection

Wang, Honglie; Jiang, Rong; Xu, Jian; Sun, Shouqian

doi:10.1007/s00521-022-07089-5

RESC: REfine the SCore with adaptive transformer head for end-to-end object detection

Original Article
Published: 10 March 2022

Volume 34, pages 12017–12028, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Honglie Wang ORCID: orcid.org/0000-0001-9779-4748¹,
Rong Jiang^2,3,
Jian Xu⁴ &
…
Shouqian Sun¹

384 Accesses
1 Altmetric
Explore all metrics

Abstract

Most detection models employ many detection heads to output their prediction results independently. However, the locality of convolutional neural networks (CNN) causes the features extracted by adjacent convolution kernels to be very similar, which leads to duplicate prediction results. To tackle this issue, the hand-designed non-maximum suppression (NMS) procedure is proposed to remove the duplicate results. However, the NMS procedure cannot be applied to certain scenarios, such as the crowd scenarios, and requires careful adjustment of hyper-parameters. Therefore, end-to-end training is necessary to improve the detection ability on more scenarios. To this end, we propose a model that enables the network to adaptively identify duplicate objects and output non-repetitive results, which can effectively replace the hand-designed non-maximum suppression procedure. By adding differentiated priors to image features, and using Multi-Head Attention to enhance the global communication between features, our model can detect objects in an end-to-end manner. Our model can be easily applied to traditional one-stage detectors, e.g., FCOS and RetinaNet. While fast convergence and high recall rate are achieved, the accuracy is also significantly better than the baseline and outperforms many one-stage and two-stage methods. It also achieves the comparable performance as traditional detectors under the dense scene datasets CrowdHuman. Evaluation results demonstrate that our model with ResNet-50 can achieve 40.5% in \({\mathrm{AP}}\) on COCO dataset and 89.2% in \({\mathrm{AP}}_{50}\) on CrowdHuman dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

IterDet: Iterative Scheme for Object Detection in Crowded Environments

End-to-End Single Shot Detector Using Graph-Based Learnable Duplicate Removal

Rethinking the Misalignment Problem in Dense Object Detection

References

Murase H, Nayar SK (1995) Visual learning and recognition of 3-d objects from appearance. Int J Comput Vis 14(1):5–24
Article Google Scholar
Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191
Article MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
Cai Z, Vasconcelos N (2018) Cascade r-CNN: delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6154–6162
Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Google Scholar
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767
Li B, Liu Y, Wang X (2019) Gradient harmonized single-stage detector. Proc AAAI Confer Artif Intell 33:8577–8584
Google Scholar
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS—improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision, pp 5561–5569
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Wang J, Song L, Li Z, Sun H, Sun J, Zheng N (2021) End-to-end object detection with fully convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15849–15858
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123
Tian Z, Shen C, Chen H, He T (2019) Fcos: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9627–9636
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 761–769
Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9759–9768
Stewart R, Andriluka M, Ng AY (2016) End-to-end people detection in crowded scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2325–2333
Sun P, Zhang R, Jiang Y, Kong T, Xu C, Zhan W, Tomizuka M, Li L, Yuan Z, Wang C, et al. (2021) Sparse r-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14454–14463
Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, Sun L, Wang J (2021) Conditional detr for fast training convergence. arXiv preprint arXiv:2108.06152
Gao P, Zheng M, Wang X, Dai J, Li H (2021) Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448
Sun P, Jiang Y, Xie E, Yuan Z, Wang C, Luo P (2020) Onenet: towards end-to-end one-stage object detection. arXiv preprint arXiv:2012.05780
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030
Fang Y, Liao B, Wang X, Fang J, Qi J, Wu R, Niu J, Liu W (2021) You only look at one sequence: rethinking transformer in vision through object detection. arXiv preprint arXiv:2106.00666
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, pp 249–256
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
Fu C-Y, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750
Zhu X, Hu H, Lin S, Dai J (2019) Deformable convnets v2: more deformable, better results. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9308–9316

Download references

Acknowledgements

We thank many colleagues at Zhejiang University for their help, in particular Dr. Guanghao Ying for insightful discussion; Dr. Binling Nie for useful discussions.

Funding

This study was funded by Center for Balance Architecture, Zhejiang University.

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, 321000, Zhejiang, China
Honglie Wang & Shouqian Sun
Center for Balance Architecture, Zhejiang University, Hangzhou, 321000, Zhejiang, China
Rong Jiang
The Architectural Design and Research Institute of Zhejiang University Co, Ltd, Zhejiang University, Hangzhou, 321000, Zhejiang, China
Rong Jiang
Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen, 518000, Guangdong, China
Jian Xu

Authors

Honglie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shouqian Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Honglie Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Jiang, R., Xu, J. et al. RESC: REfine the SCore with adaptive transformer head for end-to-end object detection. Neural Comput & Applic 34, 12017–12028 (2022). https://doi.org/10.1007/s00521-022-07089-5

Download citation

Received: 16 October 2021
Accepted: 14 February 2022
Published: 10 March 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00521-022-07089-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

RESC: REfine the SCore with adaptive transformer head for end-to-end object detection

Abstract

Access this article

Similar content being viewed by others

IterDet: Iterative Scheme for Object Detection in Crowded Environments

End-to-End Single Shot Detector Using Graph-Based Learnable Duplicate Removal

Rethinking the Misalignment Problem in Dense Object Detection

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RESC: REfine the SCore with adaptive transformer head for end-to-end object detection

Abstract

Access this article

Similar content being viewed by others

IterDet: Iterative Scheme for Object Detection in Crowded Environments

End-to-End Single Shot Detector Using Graph-Based Learnable Duplicate Removal

Rethinking the Misalignment Problem in Dense Object Detection

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation