Abstract
Despite the impressive performance of some recent state-of-the-art detectors, small target detection, scale variation, and label ambiguities remain challenges. To tackle these issues, we present a coordinate-based anchor-free (CBAF) module for object detection. It can be used as a branch of a single-shot detector (e.g., RetinaNet or SSD) or predict the output probabilities and coordinates directly. The main idea of the CBAF module is to predict the category and the adjustments to the box of the object by part feature and its contextual part features, which are based on feature maps divided by spatial coordinates. This is inspired by the fact that human beings can infer an entire object by observing the part of the surrounding environment. The CBAF module will encode and decode boxes in the anchor-free manner per feature map with different resolutions during training and testing. During training, we first use the proposed spatial coordinate partition layer to divide feature maps into several parts of size n × n and then propose a contextual building layer to fuse the part and its contextual parts together. We will demonstrate the CBAF module through a concrete implementation. The CBAF module improves AP scores of object detection with nearly no additional computation when working in conjunction with the anchor-based RetinaNet. Furthermore, experimental results on the MS-COCO dataset show that the mAP of the CBAF module has increased by 1.1%, compared with RetinaNet. When the CBAF module works in conjunction with the anchor-based RetinaNet, the mAP increased by 2.2%.
Similar content being viewed by others
References
Bai Y, Zhang Y, Ding M, Ghanem B (2018) Finding tiny faces in the wild with generative adversarial network. In: CVPR
Bell S, Lawrence Zitnick C, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883
Bhagavatula C, Zhu C, Luu K, Savvides M (2017) Faster than real-time facial alignment: a 3d spatial transformer network approach in unconstrained poses. In: The IEEE international conference on computer vision (ICCV). 1
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv:2004.10934. 11
Cai Z, Vasconcelos N Cascade r-cnn: delving into high quality object detection. arXiv:1712.00726. 8
Cai Z, Fan Q, Feris RS, Vasconcelos N (2016) A unified multi-scale deep convolutional neural network for fast object detection. In: European conference on computer vision. Springer, pp 354–370
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:2005.12872v3 (ECCV)
Dai J, Li Y, He K, Sun J (2016) R-fcn: object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387
Deng J, Dong W, Socher R, Li L-J, Li K, FeiFei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR 2009. IEEE conference on computer vision and pattern recognition. 5. IEEE, pp 248–255
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) CenterNet: keypoint triplets for object detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV)
Duan K, Xie L, Qi H, Bai S, Huang Q, Tian Q (2020) Corner proposal network for anchor-free, two-stage object detection. arXiv:2007.13816v1 (ECCV)
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A The PASCAL visual object classes challenge 2007 (VOC2007) results. Available: http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html
Fu C -Y, Liu W, Ranga A, Tyagi A, Berg AC (2017) Dssd: deconvolutional single shot detector. arXiv:1701.06659. 2, 3, 8
Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. In: Proceedings of the IEEE international conference on computer vision, pp 1134–1142
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Hariharan B, Arbelaez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2, pp 447–456
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Huang Y, Dai Q, Lu Y (2019) Decoupling localization and classification in single shot temporal action detection. In: IEEE international conference on multimedia and expo (ICME), p 2019
Kong T, Sun F, Liu H, Jiang Y, Shi J (2020) FoveaBox: Beyound Anchor-Based Object Detection, in IEEE Transactions on Image Processing, vol. 29, pp. 7389–7398, 2020, https://doi.org/10.1109/TIP.2020.3002345.
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision, pp 734–750
Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J Detnet: a backbone network for object detection. arXiv:1804.06215. 2
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. arXiv:1901.01892
Liang X, Wang T, Yang L, Xing E Cirl: controllable imitative reinforcement learning for vision-based self-driving. arXiv:1807.03776. 1
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. 1, 2, 6. Springer, Cham, pp 740–755
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: ACM MM. ACM
Lin T-Y, Dollar P, Girshick RB, He K, Hariharan B, Belongie SJ (2017) Feature pyramid networks for object detection. In: CVPR. 2, 5, 8, p 3
Lin T-Y, Goyal P, Girshick R, He K, Dollar P (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. 1, 2, 3, 4, 5, 8
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid R-CNN, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 7355-7364, https://doi.org/10.1109/CVPR.2019.00754
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Redmon J, Farhadi A YOLOv3: an incremental improvement. Computer Vision and Pattern Recognition
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Russakovsky O, et al. (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 815–823
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. In: ICLR. 2
Shrivastava A, Gupta A (2016) Contextual priming and feedback for faster r-cnn. In: European conference on computer vision, pp 330–348
Song X, Ma L, et al. (2016) Selfishness- and Selflessness-based Models of Pedestrian Room Evacuation. Phys A-Stat Mech Appl 447(4):455–466
Song X, Han D, et al. (2018) A data-driven neural network approach to simulate pedestrian movement. Phys A-Stat Mech Appl 509(11):827–844
Song X, Chen K, et al. (2020) Pedestrian trajectory prediction based on deep convolutional LSTM network. IEEE Trans Intell Transp Syst 3. https://doi.org/10.1109/TITS.2020.2981118
Tan M, Pang R, Le QV (2020) EfficientDet: scalable and efficient object detection. arXiv:1911.09070v7 (CVPR)
Tang Z, Yang J, Pei Z, Song X, Ge B (2019) Multi-process training GAN for identity-preserving face synthesis. IEEE Access 7
Tychsen-Smith L, Petersson L (2017) Denet: scalable realtime object detection with directed sparse sampling. In: Proceedings of the IEEE international conference on computer vision, pp 428–436
Wang J, Yuan Y, Yu G, Jian S Sface: an efficient network for face detection in large scale variations. arXiv:1804.06559. 3
Wang S, Gong Y, Xing J, Huang L, Huang C, Hu W (2019) RDSNet: a new deep architecture for reciprocal object detection and instance segmentation. arXiv:1912.05070. 13 (AAAI)
Yang Z, Xu Y, Xue H, Zhang Z, Urtasun R, Wang L, Lin S, Hu H (2020) Dense RepPoints: representing visual objects with dense point sets. arXiv:1912.11473v3
Yao L, Xu H, Zhang W, Liang X, Li Z (2020) SM-NAS: structural-to-modular neural architecture search for object detection. In: Proceedings of the AAAI conference on artificial intelligence (AAAI). 13
Zeng X, Ouyang W, Yang B, Yan J, Wang X (2016) Gated bi-directional cnn for object detection. In: European conference on computer vision. Springer, pp 354–369
Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: a single-shot object detector based on multi-level feature pyramid network. In: Thirty-third AAAI conference on artificial intelligence. 2
Zheng Y, Pal DK, Savvides M (2018) Ring loss: convex feature normalization for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 1, pp 5089–5097
Zhong Z, Sun L, Huo Q An anchor-free region proposal network for faster r-cnn based text detection approaches. arXiv:1804.09003. 3
Zhu Y, Zhao C, Wang J, Zhao X, Wu Y, Lu H et al (2017) Couplenet: coupling global structure with local parts for object detection. In: Proceedings of international conference on computer vision (ICCV). 8, vol 2
Zhu C, He Y, Savvides M (2019) Feature selective anchor-free module for single-shot object detection. In: Conference on computer vision and pattern recognition (CVPR)
Acknowledgements
This work was supported by a grant from the Major State Basic Research Development Program of China (973 Program) (No. 2016YFC0802703).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Author Zhiyong Tang declares that he has no conflict of interest. Author Jianbing Yang declares that he has no conflict of interest. Author Zhongcai Pei declares that he has no conflict of interest. Author Xiao Song declares that he has no conflict of interest. Author Pei Pei declares that he has no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tang, Z., Yang, J., Pei, Z. et al. Coordinate-based anchor-free module for object detection. Appl Intell 51, 9066–9080 (2021). https://doi.org/10.1007/s10489-021-02373-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02373-8