Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Li, Lingling; Zheng, Changwen; Mao, Cunli; Deng, Haibo; Jin, Taisong

doi:10.1007/s11063-021-10645-0

Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Published: 19 October 2021

Volume 54, pages 581–595, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Lingling Li¹,
Changwen Zheng²,
Cunli Mao³,
Haibo Deng⁴ &
…
Taisong Jin⁵

851 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

With the progress of deep learning, object detection has attracted great attention in computer vision community. For object detection task, one key challenge is that object scale usually varies in a large range, which may make the existing detectors fail in real applications. To address this problem, we propose a novel end-to-end Attention Feature Pyramid Transformer Network framework to learn the object detectors with multi-scale feature maps via a transformer encoder-decoder fashion. AFPN learns to aggregate pyramid feature maps with attention mechanisms. Specifically, transformer-based attention blocks are used to scan through each spatial location of feature maps in the same pyramid layers and update it by aggregating information from deep to shadow layers. Furthermore, inter-level feature aggregation and intra-level information attention are repeated to encode multi-scale and self-attention feature representation. The extensive experiments on challenging MS COCO object detection dataset demonstrate that the proposed AFPN outperforms its baseline methods, i.e., DETR and Faster R-CNN methods, and achieves the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AgBFPN: Attention Guided Bidirectional Feature Pyramid Network for Object Detection

Stacked Pyramid Attention Network for Object Detection

Article 07 April 2021

Attention-based fusion factor in FPN for object detection

Article 16 March 2022

References

Ciaparrone G, Sánchez F. L, Tabik S, Troiano L, Tagliaferri R, Herrera F (2020) Deep learning in video multi-object tracking: a survey. Neurocomputing 381:61-88
Lin X, Shen Y, Cai L, Ji R (2016) The distributed system for inverted multi-index visual retrieval. Neurocomputing 215:241–249
Article Google Scholar
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Article Google Scholar
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Article MathSciNet Google Scholar
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: ICCV
Zou Z., Shi Z, Guo Y, Ye J (2019) Object detection in 20 years: a survey. arXiv arXiv:1905.05055
Uijlings JRR, van de Sande KEA, Gevers T, Smeulders AWM (2013)Selective search for object recognition, IJCV
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf Process Syst 28:91–99
Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2016) you only look once: unified, real-time object detection. In: CVPR
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: ECCV
Girshick R (2015) Fast R-CNN. In: ICCV
Singh B, Davis LS (2018) An analysis of scale invariance in object detection - SNIP. In: CVPR
Singh B, Najibi M, Davis LS (2018) SNIPER: efficient multi-scale training. In: NeurIPS
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: CVPR
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
He K, Zhang X, Ren S, Sun J (2014) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In: ECCV
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NeurIPS
Shrivastava A, Gupta A, Ross G (2016) Training region-based object detectors with online hard example mining. In: CVPR
Cai Z, Vasconcelos N (2018) Cascade R-CNN: delving into high quality object detection. In: CVPR
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: CVPR
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: ICCV
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR
Wang J, Chen K, Yang S, Loy CC, Lin D (2019) Region proposal by guided anchoring. In: CVPR
Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: CVPR
Tychsen-Smith L, Petersson L (2018) Improving object localization with fitness NMS and bounded IoU loss. In: CVPR
Shen Y, Ji R, Chen Z, Hong X, Zheng F, Liu J, Xu M, Tian Q (2020) Noise-aware fully webly supervised object detection. In: CVPR
Shen Y, Ji R, Yang K, Deng C, Wang C (2019) Category-aware spatial constraint for weakly supervised detection. IEEE Trans Image Process 29:843–858
Article MathSciNet Google Scholar
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: NeurIPS
Shen Y, Ji R, Wang Y, Wu Y, Cao L (2019) Cyclic guidance for weakly supervised joint detection and segmentation. In: CVPR
Shen Y, Ji R, Zhang S, Zuo W, Wang Y (2018) Generative adversarial learning towards fast weakly supervised detection. In: CVPR
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: ICCV
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) DSSD: deconvolutional single shot detector. arXiv arXiv:1701.06659
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: ICCV
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) single-shot refinement neural network for object detection. In: CVPR
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2017) Light-Head R-CNN: in defense of two-stage object detector. arXiv arXiv:1711.07264
Tychsen-Smith L, Petersson L (2017) DeNet: scalable real-time object detection with directed sparse sampling. In: ICCV
Singh B, Li H, Sharma A, Davis LS (2018) R-FCN-3000 at 30fps: decoupling detection and classification. In: CVPR
Wang R. J, Li X, Ao S, Ling CX (2018) Pelee: a real-time object detection system on mobile devices. In: NeurIPS
Qin Z, Li Z, Zhang Z, Bao Y, Yu G, Peng Y, Sun J (2019) ThunderNet: towards real-time generic object detection. In: ICCV
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: MICCAI
Jeong J, Park H, Kwak N (2017) Enhancement of SSD by concatenating feature maps for object detection. In: BMVC
Ren J, Chen X, Liu J, Sun W, Pang J, Yan Q, Tai Y.W, Xu L (2017) Accurate single stage detector using recurrent rolling convolution. In: CVPR
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: CVPR
Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: CVPR
Kong T, Yao A, Chen Y, Sun F (2016) HyperNet: towards accurate region proposal generation and joint object detection. In: CVPR
Cai Z, Fan Q, Feris R. S, Vasconcelos N (2016) A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: ECCV
Cao J, Pang Y, Li X (2019) Triply supervised decoder networks for joint detection and segmentation. In: CVPR
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: CVPR
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NeurIPS
Lin T.-Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: ECCV
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS
Redmon J, Farhadi A, Ap C (2018) YOLOv3 : an incremental improvement. arXiv arXiv:1804.02767
Law H, Deng J (2018) CornerNet: detecting objects as paired keypoints. In: ECCV
Zhou X, Zhuo J, Krähenbühl P (2019) Bottom-up object detection by grouping extreme and center points. In: CVPR
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: CVPR
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) CenterNet: keypoint triplets for object detection. In: ICCV
Ghiasi G, Lin TY, Le QV (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. In: CVPR
Du X, Lin T-Y, Jin P, Ghiasi G, Tan M, Cui Y, Le QV, Song X: SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. CVPR 2020: 11589–11598
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra R-CNN: towards balanced learning for object detection. In: CVPR
Zhu X, Hu H, Lin S, Dai J (2019) Deformable ConvNets v2: more deformable, better results. In: CVPR
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: ICCV
Song G, Liu Y, Wang X (2020) Revisiting the sibling head in object detector. In: CVPR
Peng C, Xiao T, Li Z, Jiang Y, Zhang X, Jia K, Yu G, Sun J (2018) MegDet: a large mini-batch object detector. In: CVPR
Liu Y, Wang Y, Wang S, Liang T, Zhao Q, Tang Z, Ling H (2019) CBNet: a novel composite backbone network architecture for object detection. In: AAAI
Chen B, Medini T, Farwell J, Gobriel S, Tai C, Shrivastava A (2019) Slide: in defense of smart algorithms over hardware acceleration for large-scale deep learning systems

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China (No. 62072386, No.61866019); Yunnan provincial major science and technology special plan projects (No. 202103AA080015,No.202002AD080001); The Program For Applied & Basic Research of Yunnan Province in Key Areas (No. 2019FA023), the Candidates of the Young and Middle Aged Academic and Technical Leaders of Yunnan Province (No. 2019HB006).

Author information

Authors and Affiliations

School of Intelligent Engineering, Zhengzhou University of Aeronautics, Zhengzhou, 450046, China
Lingling Li
Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Changwen Zheng
Yunnan Key Laboratory of Artificial Intelligence, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China
Cunli Mao
Beijing Zhonghangzhi Technology Co.,Ltd., Beijing, 100176, China
Haibo Deng
School of Informatics, Xiamen University, Xiamen, 361005, China
Taisong Jin

Authors

Lingling Li
View author publications
You can also search for this author in PubMed Google Scholar
Changwen Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Cunli Mao
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Deng
View author publications
You can also search for this author in PubMed Google Scholar
Taisong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cunli Mao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Zheng, C., Mao, C. et al. Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network. Neural Process Lett 54, 581–595 (2022). https://doi.org/10.1007/s11063-021-10645-0

Download citation

Accepted: 09 September 2021
Published: 19 October 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11063-021-10645-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Abstract

Access this article

Similar content being viewed by others

AgBFPN: Attention Guided Bidirectional Feature Pyramid Network for Object Detection

Stacked Pyramid Attention Network for Object Detection

Attention-based fusion factor in FPN for object detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

Abstract

Access this article

Similar content being viewed by others

AgBFPN: Attention Guided Bidirectional Feature Pyramid Network for Object Detection

Stacked Pyramid Attention Network for Object Detection

Attention-based fusion factor in FPN for object detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation