Context and Structure Mining Network for Video Object Detection

Han, Liang; Wang, Pichao; Yin, Zhaozheng; Wang, Fan; Li, Hao

doi:10.1007/s11263-021-01507-2

Context and Structure Mining Network for Video Object Detection

Published: 13 August 2021

Volume 129, pages 2927–2946, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Liang Han ORCID: orcid.org/0000-0002-6148-1114¹^na1,
Pichao Wang²^na1,
Zhaozheng Yin^1,3^na1,
Fan Wang⁴ &
…
Hao Li⁵

1071 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Aggregating temporal features from other frames is verified to be very effective for video object detection to overcome the challenges in still images, such as occlusion, motion blur, and rare pose. Currently, proposal-level feature aggregation dominates this direction. However, there are two main problems for the holistic proposal-level feature aggregation. First, the object proposals generated by the region proposal network ignore the useful context information around the object which is proved to be helpful for object classification. Second, the traditional proposal-level feature aggregation regards the proposal as a whole without considering the important object structure information, which makes the similarity comparison between two proposals less effective when occlusion or pose misalignment occurs on proposal objects. To deal with these problems, we propose the Context and Structure Mining Network to better aggregate features for video object detection. In our method, we first encode the spatial-temporal context information into object features in a global manner, which can benefit the object classification. In addition, the holistic proposal is divided into several patches to capture the structure information of the object, and cross patch matching is conducted to alleviate the pose misalignment between objects in target and support proposals. Moreover, an importance weight is learned for each target proposal patch to indicate how informative this patch is for the final feature aggregation, by which the occluded patches can be neglected. This enables the aggregation module to leverage the most important and informative patches to obtain the final feature aggregation. The proposed framework outperforms all the latest state-of-the-art methods on the ImageNet VID dataset with a large margin. This project is publicly available https://github.com/LiangHann/Context-and-Structure-Mining-Network-for-Video-Object-Detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level Proposal Relations Aggregation for Video Object Detection

Adaptive video object proposals by a context-aware model

Article 17 April 2017

Mining Inter-Video Proposal Relations for Video Object Detection

References

Bertasius, G., Torresani, L., & Shi, J. (2018). Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 331–346).
Chen, K., Wang, J., Yang, S., Zhang, X., Xiong, Y., Change Loy, C., & Lin, D. (2018a). Optimizing video object detection via a scale-time lattice. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7814–7823).
Chen, Y., Cao, Y., Hu, H., & Wang, L. (2020). Memory enhanced global-local aggregation for video object detection. In: CVPR.
Chen, Z., Huang, S., & Tao, D. (2018b). Context refinement for object detection. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 71–86).
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems (pp. 379–387).
Deng, H., Hua, Y., Song, T., Zhang, Z., Xue, Z., Ma, R., Robertson, N., & Guan, H. (2019a). Object guided external memory network for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 6678–6687).
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., & Mei, T. (2019b). Relation distillation networks for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 7023–7032).
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T . (2015). Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision (pp. 2758–2766).
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2017). Detect to track and track to detect. In: Proceedings of the IEEE international conference on computer vision (pp. 3038–3046).
Fu, C. Y., Liu, W., Ranga, A., Tyagi, A., & Berg, A. C. (2017). DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659
Gao, Z., Wang, L., & Zhou, L. (2018). A probabilistic approach to cross-region matching-based image retrieval. IEEE Transactions on Image Processing, 28(3), 1191–1204.
Article MathSciNet Google Scholar
Girshick, R. (2015). Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).
Guo, C., Fan, B., Gu, J., Zhang, Q., Xiang, S., Prinet, V., & Pan, C. (2019). Progressive sparse local attention for video object detection. In: Proceedings of the IEEE international conference on computer vision.
Han, L,, Wang, P., Yin, Z., Wang, F., & Li, H. (2020a). Exploiting better feature aggregation for video object detection. In: ACM MM.
Han, M., Wang, Y., Chang, X., & Qiao, Y. (2020b). Mining inter-video proposal relations for video object detection. In: European conference on computer vision (pp. 431–446). Springer.
Han, W., Khorrami, P., Paine, T. L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., & Huang, T. S. (2016). Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).
Howard, AG., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3588–3597).
Jiang, Z., Gao, P., Guo, C., Zhang, Q., Xiang, S., & Pan, C. (2019). Video object detection with locally-weighted deformable neighbors. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8529–8536.
Article Google Scholar
Jiang, Z., Liu, Y., Yang, C., Liu, J., Gao, P., Zhang, Q., Xiang, S., & Pan, C. (2020). Learning where to focus for efficient video object detection. In: European conference on computer vision (pp. 18–34). Springer.
Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 817–825).
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., et al. (2017). T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 2896–2907.
Article Google Scholar
Kantorov, V., Oquab, M., Cho, M., & Laptev, I. (2016). Contextlocnet: Context-aware deep network models for weakly supervised localization. In: European Conference on Computer Vision (pp 350–365). Springer.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Liu, M., & Zhu, M. (2018). Mobile video object detection with temporally-aware feature maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5686–5695).
Liu, M., Zhu, M., White, M., Li, Y., & Kalenichenko, D. (2019). Looking fast and slow: Memory-guided mobile video object detection. arXiv preprint arXiv:1903.10172
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, AC. (2016). SSD: Single shot multibox detector. In: European conference on computer vision (pp. 21–37). Springer.
Redmon, J., Farhadi, A .(2017) . Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263–7271).
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28 (pp. 91–99).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).
Sharif Razavian A, Sullivan J, Maki A, Carlsson S (2015) A baseline for visual instance retrieval with deep convolutional networks. In: International conference on learning representations, 7–9 May 2015. San Diego. ICLR: CA.
Shvets, M., Liu, W., & Berg, A. C. (2019). Leveraging long-range temporal relationships between proposals for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 9756–9764).
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully convolutional one-stage object detection. In: ICCV.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems pp. 5998–6008.
Wang, S., Zhou, Y., Yan, J., & Deng, Z. (2018a). Fully motion-aware network for video object detection. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 542–557).
Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794–7803).
Wu, H., Chen, Y., Wang, N., Zhang, Z. (2019). Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 9217–9225).
Xiao, F., & Jae Lee, Y. (2018). Video object detection with an aligned spatial-temporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 485–501).
Xu, Z., Hrustic, E., & Vivet, D. (2020). Centernet heatmap propagation for real-time video object detection. In: European conference on computer vision (pp. 220–234).
Yao, CH., Fang, C., Shen, X., Wan, Y., & Yang, MH. (2020). Video object detection via object-level temporal aggregation. In: European conference on computer vision (pp. 160–177).
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850
Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017a). Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE international conference on computer vision (pp. 408–417).
Zhu, X., Xiong, Y., Dai, J., Yuan, L., & Wei, Y. (2017b). Deep feature flow for video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
Zhu, X., Dai, J., Yuan, L., & Wei, Y. (2018). Towards high performance video object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7210–7218).

Download references

Author information

Equal contribution: Liang Han and Pichao Wang. Liang Han and Zhaozheng Yin have been supported by NSF grants CMMI-1646162 and CMMI-1954548.

Authors and Affiliations

Department of Computer Science, Stony Brook University, Stony Brook, New York, USA
Liang Han & Zhaozheng Yin
Alibaba Group, Bellevue, Washington, USA
Pichao Wang
Department of Biomedical Informatics, Stony Brook University, Stony Brook, New York, USA
Zhaozheng Yin
Alibaba Group, Sunnyvale, California, USA
Fan Wang
Alibaba Group, Hangzhou, Zhejiang, China
Hao Li

Authors

Liang Han
View author publications
You can also search for this author in PubMed Google Scholar
Pichao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaozheng Yin
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaozheng Yin.

Additional information

Communicated by Dong Xu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, L., Wang, P., Yin, Z. et al. Context and Structure Mining Network for Video Object Detection. Int J Comput Vis 129, 2927–2946 (2021). https://doi.org/10.1007/s11263-021-01507-2

Download citation

Received: 17 December 2020
Accepted: 13 July 2021
Published: 13 August 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11263-021-01507-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context and Structure Mining Network for Video Object Detection

Abstract

Access this article

Similar content being viewed by others

Multi-level Proposal Relations Aggregation for Video Object Detection

Adaptive video object proposals by a context-aware model

Mining Inter-Video Proposal Relations for Video Object Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Context and Structure Mining Network for Video Object Detection

Abstract

Access this article

Similar content being viewed by others

Multi-level Proposal Relations Aggregation for Video Object Detection

Adaptive video object proposals by a context-aware model

Mining Inter-Video Proposal Relations for Video Object Detection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation