Abstract
With the progress of deep learning, object detection has attracted great attention in computer vision community. For object detection task, one key challenge is that object scale usually varies in a large range, which may make the existing detectors fail in real applications. To address this problem, we propose a novel end-to-end Attention Feature Pyramid Transformer Network framework to learn the object detectors with multi-scale feature maps via a transformer encoder-decoder fashion. AFPN learns to aggregate pyramid feature maps with attention mechanisms. Specifically, transformer-based attention blocks are used to scan through each spatial location of feature maps in the same pyramid layers and update it by aggregating information from deep to shadow layers. Furthermore, inter-level feature aggregation and intra-level information attention are repeated to encode multi-scale and self-attention feature representation. The extensive experiments on challenging MS COCO object detection dataset demonstrate that the proposed AFPN outperforms its baseline methods, i.e., DETR and Faster R-CNN methods, and achieves the state-of-the-art results.
Similar content being viewed by others
References
Ciaparrone G, Sánchez F. L, Tabik S, Troiano L, Tagliaferri R, Herrera F (2020) Deep learning in video multi-object tracking: a survey. Neurocomputing 381:61-88
Lin X, Shen Y, Cai L, Ji R (2016) The distributed system for inverted multi-index visual retrieval. Neurocomputing 215:241–249
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: ICCV
Zou Z., Shi Z, Guo Y, Ye J (2019) Object detection in 20 years: a survey. arXiv arXiv:1905.05055
Uijlings JRR, van de Sande KEA, Gevers T, Smeulders AWM (2013)Selective search for object recognition, IJCV
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf Process Syst 28:91–99
Redmon J, Divvala S, Girshick R, Farhadi A (2016) you only look once: unified, real-time object detection. In: CVPR
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: ECCV
Girshick R (2015) Fast R-CNN. In: ICCV
Singh B, Davis LS (2018) An analysis of scale invariance in object detection - SNIP. In: CVPR
Singh B, Najibi M, Davis LS (2018) SNIPER: efficient multi-scale training. In: NeurIPS
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: CVPR
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
He K, Zhang X, Ren S, Sun J (2014) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In: ECCV
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NeurIPS
Shrivastava A, Gupta A, Ross G (2016) Training region-based object detectors with online hard example mining. In: CVPR
Cai Z, Vasconcelos N (2018) Cascade R-CNN: delving into high quality object detection. In: CVPR
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: CVPR
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: ICCV
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR
Wang J, Chen K, Yang S, Loy CC, Lin D (2019) Region proposal by guided anchoring. In: CVPR
Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: CVPR
Tychsen-Smith L, Petersson L (2018) Improving object localization with fitness NMS and bounded IoU loss. In: CVPR
Shen Y, Ji R, Chen Z, Hong X, Zheng F, Liu J, Xu M, Tian Q (2020) Noise-aware fully webly supervised object detection. In: CVPR
Shen Y, Ji R, Yang K, Deng C, Wang C (2019) Category-aware spatial constraint for weakly supervised detection. IEEE Trans Image Process 29:843–858
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: NeurIPS
Shen Y, Ji R, Wang Y, Wu Y, Cao L (2019) Cyclic guidance for weakly supervised joint detection and segmentation. In: CVPR
Shen Y, Ji R, Zhang S, Zuo W, Wang Y (2018) Generative adversarial learning towards fast weakly supervised detection. In: CVPR
Tian Z, Shen C, Chen H, He T (2019) FCOS: fully convolutional one-stage object detection. In: ICCV
Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) DSSD: deconvolutional single shot detector. arXiv arXiv:1701.06659
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: ICCV
Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) single-shot refinement neural network for object detection. In: CVPR
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2017) Light-Head R-CNN: in defense of two-stage object detector. arXiv arXiv:1711.07264
Tychsen-Smith L, Petersson L (2017) DeNet: scalable real-time object detection with directed sparse sampling. In: ICCV
Singh B, Li H, Sharma A, Davis LS (2018) R-FCN-3000 at 30fps: decoupling detection and classification. In: CVPR
Wang R. J, Li X, Ao S, Ling CX (2018) Pelee: a real-time object detection system on mobile devices. In: NeurIPS
Qin Z, Li Z, Zhang Z, Bao Y, Yu G, Peng Y, Sun J (2019) ThunderNet: towards real-time generic object detection. In: ICCV
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: MICCAI
Jeong J, Park H, Kwak N (2017) Enhancement of SSD by concatenating feature maps for object detection. In: BMVC
Ren J, Chen X, Liu J, Sun W, Pang J, Yan Q, Tai Y.W, Xu L (2017) Accurate single stage detector using recurrent rolling convolution. In: CVPR
Liu S, Qi L, Qin H, Shi J, Jia J (2018) Path aggregation network for instance segmentation. In: CVPR
Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: CVPR
Kong T, Yao A, Chen Y, Sun F (2016) HyperNet: towards accurate region proposal generation and joint object detection. In: CVPR
Cai Z, Fan Q, Feris R. S, Vasconcelos N (2016) A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: ECCV
Cao J, Pang Y, Li X (2019) Triply supervised decoder networks for joint detection and segmentation. In: CVPR
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: CVPR
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NeurIPS
Lin T.-Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: ECCV
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: ICLR
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: AISTATS
Redmon J, Farhadi A, Ap C (2018) YOLOv3 : an incremental improvement. arXiv arXiv:1804.02767
Law H, Deng J (2018) CornerNet: detecting objects as paired keypoints. In: ECCV
Zhou X, Zhuo J, Krähenbühl P (2019) Bottom-up object detection by grouping extreme and center points. In: CVPR
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: CVPR
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) CenterNet: keypoint triplets for object detection. In: ICCV
Ghiasi G, Lin TY, Le QV (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. In: CVPR
Du X, Lin T-Y, Jin P, Ghiasi G, Tan M, Cui Y, Le QV, Song X: SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. CVPR 2020: 11589–11598
Pang J, Chen K, Shi J, Feng H, Ouyang W, Lin D (2019) Libra R-CNN: towards balanced learning for object detection. In: CVPR
Zhu X, Hu H, Lin S, Dai J (2019) Deformable ConvNets v2: more deformable, better results. In: CVPR
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: ICCV
Song G, Liu Y, Wang X (2020) Revisiting the sibling head in object detector. In: CVPR
Peng C, Xiao T, Li Z, Jiang Y, Zhang X, Jia K, Yu G, Sun J (2018) MegDet: a large mini-batch object detector. In: CVPR
Liu Y, Wang Y, Wang S, Liang T, Zhao Q, Tang Z, Ling H (2019) CBNet: a novel composite backbone network architecture for object detection. In: AAAI
Chen B, Medini T, Farwell J, Gobriel S, Tai C, Shrivastava A (2019) Slide: in defense of smart algorithms over hardware acceleration for large-scale deep learning systems
Acknowledgements
This research is supported by the National Natural Science Foundation of China (No. 62072386, No.61866019); Yunnan provincial major science and technology special plan projects (No. 202103AA080015,No.202002AD080001); The Program For Applied & Basic Research of Yunnan Province in Key Areas (No. 2019FA023), the Candidates of the Young and Middle Aged Academic and Technical Leaders of Yunnan Province (No. 2019HB006).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, L., Zheng, C., Mao, C. et al. Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network. Neural Process Lett 54, 581–595 (2022). https://doi.org/10.1007/s11063-021-10645-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10645-0