Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection

Zhang, Zhaoxiang; Pan, Cong; Peng, Junran

doi:10.1007/s11263-021-01573-6

Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection

Published: 22 February 2022

Volume 130, pages 970–989, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

912 Accesses
1 Altmetric
Explore all metrics

Abstract

Scale-sensitive object detection remains a challenging task, where most of the existing methods could not learn it explicitly and are not robust. Besides, they are less efficient during training or slow during inference, which is not friendly to real-time applications. In this paper, we propose a scale-transferrable architecture for practical object detection based on the analysis of the connection between dilation rate and effective receptive field. Our method firstly predicts a global continuous scale, which is shared by all positions, for each convolution filter of each network stage. Secondly, we average the spatial features and distill the scale from channels to effectively learn the scale. Thirdly, for fast-deployment, we propose a scale decomposition method that transfers the robust fractional scale into the combination of fixed integral scales for each convolution filter, which exploits the dilated convolution. Moreover, to overcome the shortcomings of our method for large-scale object detection, we modify the Feature Pyramid Network structure. Finally, we illustrate the orthogonality role of our method for sampling strategy. We demonstrate the effectiveness of our method on one-stage and two-stage algorithms under different configurations and compare them with different dilated convolution blocks. For practical applications, the training strategy of our method is simple and efficient, avoiding complex data sampling or optimization strategy. During inference, we reduce the latency of the proposed method by using the hardware accelerator TensorRT without extra operation. On the COCO test-dev, our model achieves 41.7% mAP on one-stage detector and 42.5% mAP on two-stage detector based on ResNet-101, and outperforms baselines by 3.2% and 3.1% mAP, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

Feature Enhancement for Multi-scale Object Detection

Article 09 January 2020

Hybrid dilated multilayer faster RCNN for object detection

Article 07 March 2023

Notes

TensorRT is an SDK for optimizing trained deep learning models to enable high-performance inference on NVIDIA graphics processing units (GPUs). TensorRT can be used to perform inference acceleration on hyper-scale data centers, embedded platforms, or autonomous-driving platforms. Essentially speaking, convolutions as in DCN Dai et al. (2017) or SAC (Zhang et al. 2017) are not supported in TensorRT due to their sampling mechanism.
This simplification in 2-D case means the rounding operation, it is important to ensure the integral number of convolution channels while it may make a slight change in the density of connections. Moreover, we can use ceil or floor instead of the round, but the difference is not big.
We provide the learned scales of this model. Scale rates of Res3 are [[1.01, 1.02], [1.05, 1.08], 1.09, [1.06, 1.03]], scale rates of Res4 are [[1.03, 1.07], [1.28, 1.29], [1.43, 1.39], [1.31, 1.28], [1.25, 1.21], [1.40, 1.42], 1.32, [1.21, 1.27], 1.25, [1.03, 1.01], [1.19, 1.22], [1.19, 1.18], [1.26, 1.32], [1.34, 1.47], [1.59, 1.61], [1.34, 1.21], [2.32, 1.71], [1.50, 1.83], [2.20, 2.41], [1.44, 1.21], [1.17, 1.28], [1.37, 1.11], [1.41, 1.45]], and scale rates of Res5 head are [1.51, [3.58, 3.86], [2.45, 2.49]].
We provide the learned scales of this model. Scale rates of Res3 are [[1.0, 1.01], [1.05, 1.08], [1.06, 1.08], [1.03, 1.01]], scale rates of Res4 are [[1.03, 1.06], [1.28, 1.26], [1.41, 1.36], [1.29, 1.29], [1.23, 1.21], [1.34, 1.33], [1.27, 1.27], [1.22, 1.28], [1.23, 1.23], [1.03, 1.03], [1.21, 1.23], [1.21, 1.2], [1.29, 1.31], [1.39, 1.46], [1.58, 1.59], [1.44, 1.46], [1.71, 1.56], [1.4, 1.49], [1.49, 1.55], [1.33, 1.21], [1.22, 1.23], [1.33, 1.23], [1.4, 1.41]], and scale rates of Res5 are [[1.28, 1.24], [2.58, 2.62], [1.8, 1.72]].

References

Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.
Chen, Q., Wang, P., Cheng, A., Wang, W., Zhang, Y., & Cheng, J. (2020). Robust one-stage object detection with location-aware classifiers. Pattern Recognition, 105, 107334.
Article Google Scholar
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773.
Fu, C. Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
He, K., Girshick, R., & Dollár, P. (2019). Rethinking imagenet pre-training. Proceedings of the IEEE International Conference on Computer Vision, pp. 4918–4927.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, pp. 448–456.
Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. Advances in Neural Information Processing Systems, 28, 2017–2025.
Google Scholar
Jeon, Y., & Kim, J. (2017). Active convolution: Learning the shape of convolution for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4201–4209.
Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision, pp. 784–799.
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, pp. 734–750.
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018). Detnet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision, pp. 334–350.
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pp. 740–755.
Liu, S., Huang, D., & Wang, Y. (2018). Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision, pp. 385–400.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, pp. 21–37.
Peng, J., Sun, M., Zhang, Z., Tan, T., & Yan, J. (2019). Pod: Practical object detection with scale-sensitive network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9607–9616.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, pp. 91–99.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., & Berg, A. C. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520.
Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Tychsen-Smith, L., & Petersson, L. (2018). Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6877–6885.
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
Article Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500.
Xu, H., Lv, X., Wang, X., Ren, Z., Bodla, N., & Chellappa, R. (2018). Deep regionlets for object detection. In Proceedings of the European Conference on Computer Vision, pp. 798–814.
Xu, J., Wang, W., Wang, H., & Guo, J. (2020). Multi-model ensemble with rich spatial information for object detection. Pattern Recognition, 99, 107098.
Article Google Scholar
Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations.
Zhang, R., Tang, S., Zhang, Y., Li, J., & Yan, S. (2017). Scale-adaptive convolutions for scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2031–2039.
Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9759–9768.
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890.
Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316.

Download references

Acknowledgements

This work was supported in part by the Major Project for New Generation of AI (No.2018AAA0100400) and the National Natural Science Foundation of China (No.61836014, No.U21B2042).

Author information

Authors and Affiliations

School of Future Technology, University of Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang & Cong Pan
Center for Research on Intelligent Perception and Computing & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing, China
Zhaoxiang Zhang & Cong Pan
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, China
Zhaoxiang Zhang
Huawei Cloud & AI, Beijing, China
Junran Peng

Authors

Zhaoxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cong Pan
View author publications
You can also search for this author in PubMed Google Scholar
Junran Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junran Peng.

Additional information

Communicated by V. Lepetit.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zhaoxiang Zhang and Cong Pan Co-first author.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Z., Pan, C. & Peng, J. Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection. Int J Comput Vis 130, 970–989 (2022). https://doi.org/10.1007/s11263-021-01573-6

Download citation

Received: 08 January 2021
Accepted: 30 November 2021
Published: 22 February 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11263-021-01573-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection

Abstract

Access this article

Similar content being viewed by others

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

Feature Enhancement for Multi-scale Object Detection

Hybrid dilated multilayer faster RCNN for object detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection

Abstract

Access this article

Similar content being viewed by others

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

Feature Enhancement for Multi-scale Object Detection

Hybrid dilated multilayer faster RCNN for object detection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation