Skip to main content
Log in

Context and Structure Mining Network for Video Object Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Aggregating temporal features from other frames is verified to be very effective for video object detection to overcome the challenges in still images, such as occlusion, motion blur, and rare pose. Currently, proposal-level feature aggregation dominates this direction. However, there are two main problems for the holistic proposal-level feature aggregation. First, the object proposals generated by the region proposal network ignore the useful context information around the object which is proved to be helpful for object classification. Second, the traditional proposal-level feature aggregation regards the proposal as a whole without considering the important object structure information, which makes the similarity comparison between two proposals less effective when occlusion or pose misalignment occurs on proposal objects. To deal with these problems, we propose the Context and Structure Mining Network to better aggregate features for video object detection. In our method, we first encode the spatial-temporal context information into object features in a global manner, which can benefit the object classification. In addition, the holistic proposal is divided into several patches to capture the structure information of the object, and cross patch matching is conducted to alleviate the pose misalignment between objects in target and support proposals. Moreover, an importance weight is learned for each target proposal patch to indicate how informative this patch is for the final feature aggregation, by which the occluded patches can be neglected. This enables the aggregation module to leverage the most important and informative patches to obtain the final feature aggregation. The proposed framework outperforms all the latest state-of-the-art methods on the ImageNet VID dataset with a large margin. This project is publicly available https://github.com/LiangHann/Context-and-Structure-Mining-Network-for-Video-Object-Detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Bertasius, G., Torresani, L., & Shi, J. (2018). Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 331–346).

  • Chen, K., Wang, J., Yang, S., Zhang, X., Xiong, Y., Change Loy, C., & Lin, D. (2018a). Optimizing video object detection via a scale-time lattice. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7814–7823).

  • Chen, Y., Cao, Y., Hu, H., & Wang, L. (2020). Memory enhanced global-local aggregation for video object detection. In: CVPR.

  • Chen, Z., Huang, S., & Tao, D. (2018b). Context refinement for object detection. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 71–86).

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems (pp. 379–387).

  • Deng, H., Hua, Y., Song, T., Zhang, Z., Xue, Z., Ma, R., Robertson, N., & Guan, H. (2019a). Object guided external memory network for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 6678–6687).

  • Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., & Mei, T. (2019b). Relation distillation networks for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 7023–7032).

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T . (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (pp. 2758–2766).

  • Feichtenhofer, C., Pinz, A., & Zisserman, A. (2017). Detect to track and track to detect. In: Proceedings of the IEEE international conference on computer vision (pp. 3038–3046).

  • Fu, C. Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659

  • Gao, Z., Wang, L., & Zhou, L. (2018). A probabilistic approach to cross-region matching-based image retrieval. IEEE Transactions on Image Processing, 28(3), 1191–1204.

    Article  MathSciNet  Google Scholar 

  • Girshick, R. (2015). Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

  • Guo, C., Fan, B., Gu, J., Zhang, Q., Xiang, S., Prinet, V., & Pan, C. (2019). Progressive sparse local attention for video object detection. In: Proceedings of the IEEE international conference on computer vision.

  • Han, L,, Wang, P., Yin, Z., Wang, F., & Li, H. (2020a). Exploiting better feature aggregation for video object detection. In: ACM MM.

  • Han, M., Wang, Y., Chang, X., & Qiao, Y. (2020b). Mining inter-video proposal relations for video object detection. In: European conference on computer vision (pp. 431–446). Springer.

  • Han, W., Khorrami, P., Paine, T. L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., & Huang, T. S. (2016). Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • Howard, AG., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

  • Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3588–3597).

  • Jiang, Z., Gao, P., Guo, C., Zhang, Q., Xiang, S., & Pan, C. (2019). Video object detection with locally-weighted deformable neighbors. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8529–8536.

    Article  Google Scholar 

  • Jiang, Z., Liu, Y., Yang, C., Liu, J., Gao, P., Zhang, Q., Xiang, S., & Pan, C. (2020). Learning where to focus for efficient video object detection. In: European conference on computer vision (pp. 18–34). Springer.

  • Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 817–825).

  • Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., et al. (2017). T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 2896–2907.

    Article  Google Scholar 

  • Kantorov, V., Oquab, M., Cho, M., & Laptev, I. (2016). Contextlocnet: Context-aware deep network models for weakly supervised localization. In: European Conference on Computer Vision (pp 350–365). Springer.

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

  • Liu, M., & Zhu, M. (2018). Mobile video object detection with temporally-aware feature maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5686–5695).

  • Liu, M., Zhu, M., White, M., Li, Y., & Kalenichenko, D. (2019). Looking fast and slow: Memory-guided mobile video object detection. arXiv preprint arXiv:1903.10172

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, AC. (2016). SSD: Single shot multibox detector. In: European conference on computer vision (pp. 21–37). Springer.

  • Redmon, J., Farhadi, A .(2017) . Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263–7271).

  • Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28 (pp. 91–99).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).

  • Sharif Razavian A, Sullivan J, Maki A, Carlsson S (2015) A baseline for visual instance retrieval with deep convolutional networks. In: International conference on learning representations, 7–9 May 2015. San Diego. ICLR: CA.

  • Shvets, M., Liu, W., & Berg, A. C. (2019). Leveraging long-range temporal relationships between proposals for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 9756–9764).

  • Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In: ICCV.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems pp. 5998–6008.

  • Wang, S., Zhou, Y., Yan, J., & Deng, Z. (2018a). Fully motion-aware network for video object detection. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 542–557).

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).

  • Wu, H., Chen, Y., Wang, N., Zhang, Z. (2019). Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 9217–9225).

  • Xiao, F., & Jae Lee, Y. (2018). Video object detection with an aligned spatial-temporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 485–501).

  • Xu, Z., Hrustic, E., & Vivet, D. (2020). Centernet heatmap propagation for real-time video object detection. In: European conference on computer vision (pp. 220–234).

  • Yao, CH., Fang, C., Shen, X., Wan, Y., & Yang, MH. (2020). Video object detection via object-level temporal aggregation. In: European conference on computer vision (pp. 160–177).

  • Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850

  • Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017a). Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 408–417).

  • Zhu, X., Xiong, Y., Dai, J., Yuan, L., & Wei, Y. (2017b). Deep feature flow for video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Zhu, X., Dai, J., Yuan, L., & Wei, Y. (2018). Towards high performance video object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7210–7218).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaozheng Yin.

Additional information

Communicated by Dong Xu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, L., Wang, P., Yin, Z. et al. Context and Structure Mining Network for Video Object Detection. Int J Comput Vis 129, 2927–2946 (2021). https://doi.org/10.1007/s11263-021-01507-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01507-2

Keywords

Navigation