Skip to main content
Log in

Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Scale-sensitive object detection remains a challenging task, where most of the existing methods could not learn it explicitly and are not robust. Besides, they are less efficient during training or slow during inference, which is not friendly to real-time applications. In this paper, we propose a scale-transferrable architecture for practical object detection based on the analysis of the connection between dilation rate and effective receptive field. Our method firstly predicts a global continuous scale, which is shared by all positions, for each convolution filter of each network stage. Secondly, we average the spatial features and distill the scale from channels to effectively learn the scale. Thirdly, for fast-deployment, we propose a scale decomposition method that transfers the robust fractional scale into the combination of fixed integral scales for each convolution filter, which exploits the dilated convolution. Moreover, to overcome the shortcomings of our method for large-scale object detection, we modify the Feature Pyramid Network structure. Finally, we illustrate the orthogonality role of our method for sampling strategy. We demonstrate the effectiveness of our method on one-stage and two-stage algorithms under different configurations and compare them with different dilated convolution blocks. For practical applications, the training strategy of our method is simple and efficient, avoiding complex data sampling or optimization strategy. During inference, we reduce the latency of the proposed method by using the hardware accelerator TensorRT without extra operation. On the COCO test-dev, our model achieves 41.7% mAP on one-stage detector and 42.5% mAP on two-stage detector based on ResNet-101, and outperforms baselines by 3.2% and 3.1% mAP, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. TensorRT is an SDK for optimizing trained deep learning models to enable high-performance inference on NVIDIA graphics processing units (GPUs). TensorRT can be used to perform inference acceleration on hyper-scale data centers, embedded platforms, or autonomous-driving platforms. Essentially speaking, convolutions as in DCN Dai et al. (2017) or SAC (Zhang et al. 2017) are not supported in TensorRT due to their sampling mechanism.

  2. This simplification in 2-D case means the rounding operation, it is important to ensure the integral number of convolution channels while it may make a slight change in the density of connections. Moreover, we can use ceil or floor instead of the round, but the difference is not big.

  3. We provide the learned scales of this model. Scale rates of Res3 are [[1.01, 1.02], [1.05, 1.08], 1.09, [1.06, 1.03]], scale rates of Res4 are [[1.03, 1.07], [1.28, 1.29], [1.43, 1.39], [1.31, 1.28], [1.25, 1.21], [1.40, 1.42], 1.32, [1.21, 1.27], 1.25, [1.03, 1.01], [1.19, 1.22], [1.19, 1.18], [1.26, 1.32], [1.34, 1.47], [1.59, 1.61], [1.34, 1.21], [2.32, 1.71], [1.50, 1.83], [2.20, 2.41], [1.44, 1.21], [1.17, 1.28], [1.37, 1.11], [1.41, 1.45]], and scale rates of Res5 head are [1.51, [3.58, 3.86], [2.45, 2.49]].

  4. We provide the learned scales of this model. Scale rates of Res3 are [[1.0, 1.01], [1.05, 1.08], [1.06, 1.08], [1.03, 1.01]], scale rates of Res4 are [[1.03, 1.06], [1.28, 1.26], [1.41, 1.36], [1.29, 1.29], [1.23, 1.21], [1.34, 1.33], [1.27, 1.27], [1.22, 1.28], [1.23, 1.23], [1.03, 1.03], [1.21, 1.23], [1.21, 1.2], [1.29, 1.31], [1.39, 1.46], [1.58, 1.59], [1.44, 1.46], [1.71, 1.56], [1.4, 1.49], [1.49, 1.55], [1.33, 1.21], [1.22, 1.23], [1.33, 1.23], [1.4, 1.41]], and scale rates of Res5 are [[1.28, 1.24], [2.58, 2.62], [1.8, 1.72]].

References

  • Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162.

  • Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.

  • Chen, Q., Wang, P., Cheng, A., Wang, W., Zhang, Y., & Cheng, J. (2020). Robust one-stage object detection with location-aware classifiers. Pattern Recognition, 105, 107334.

    Article  Google Scholar 

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773.

  • Fu, C. Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.

  • Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448.

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587.

  • Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.

  • He, K., Girshick, R., & Dollár, P. (2019). Rethinking imagenet pre-training. Proceedings of the IEEE International Conference on Computer Vision, pp. 4918–4927.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.

  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141.

  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, pp. 448–456.

  • Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. Advances in Neural Information Processing Systems, 28, 2017–2025.

    Google Scholar 

  • Jeon, Y., & Kim, J. (2017). Active convolution: Learning the shape of convolution for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4201–4209.

  • Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision, pp. 784–799.

  • Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, pp. 734–750.

  • Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018). Detnet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision, pp. 334–350.

  • Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125.

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, pp. 740–755.

  • Liu, S., Huang, D., & Wang, Y. (2018). Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision, pp. 385–400.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, pp. 21–37.

  • Peng, J., Sun, M., Zhang, Z., Tan, T., & Yan, J. (2019). Pod: Practical object detection with scale-sensitive network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9607–9616.

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788.

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, pp. 91–99.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., & Berg, A. C. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520.

  • Shrivastava, A., Sukthankar, R., Malik, J., & Gupta, A. (2016). Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  • Tychsen-Smith, L., & Petersson, L. (2018). Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6877–6885.

  • Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.

    Article  Google Scholar 

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500.

  • Xu, H., Lv, X., Wang, X., Ren, Z., Bodla, N., & Chellappa, R. (2018). Deep regionlets for object detection. In Proceedings of the European Conference on Computer Vision, pp. 798–814.

  • Xu, J., Wang, W., Wang, H., & Guo, J. (2020). Multi-model ensemble with rich spatial information for object detection. Pattern Recognition, 99, 107098.

    Article  Google Scholar 

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations.

  • Zhang, R., Tang, S., Zhang, Y., Li, J., & Yan, S. (2017). Scale-adaptive convolutions for scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2031–2039.

  • Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9759–9768.

  • Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212.

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890.

  • Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316.

Download references

Acknowledgements

This work was supported in part by the Major Project for New Generation of AI (No.2018AAA0100400) and the National Natural Science Foundation of China (No.61836014, No.U21B2042).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junran Peng.

Additional information

Communicated by V. Lepetit.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zhaoxiang Zhang and Cong Pan Co-first author.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Pan, C. & Peng, J. Delving into the Effectiveness of Receptive Fields: Learning Scale-Transferrable Architectures for Practical Object Detection. Int J Comput Vis 130, 970–989 (2022). https://doi.org/10.1007/s11263-021-01573-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01573-6

Keywords

Navigation