Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation

Cao, Danyang; Ma, Jinfeng; Chen, Zhixin

doi:10.1007/s11042-020-09827-0

Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation

Published: 24 September 2020

Volume 80, pages 23275–23295, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

249 Accesses
1 Citation
Explore all metrics

Abstract

In comparison with static image object detection, focusing on video objects has greater research significance in realizing intelligent monitoring and automatic anomaly detection. However, it may be challenging to apply the most advanced image recognition networks to video data, as the number of static frame files represented in a video is often huge, thereby causing the problem of the slow evaluation speed, in addition to other issues, such as motion blur, low resolution, occlusion, and object deformation. In the present study, to mitigate these deficiencies, we applied sparse feature propagation to improve the detection speed and dense feature aggregation to refine the detection accuracy. Moreover, we utilized the key frame scheduling strategy relying on the consistency of feature information. Implementing these technologies allowed steadily improving the detection speed and accuracy to achieve high performance. To verify the applicability of the optimized video detection strategy proposed in this paper, we selected the part of the video data in the ImageNet VID training dataset. Then, the other part of this dataset was used to conduct the experiments, including the calculation and comparison of mean average precision (MAP) and frames per second (FPS).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Where to Focus for Efficient Video Object Detection

A feature temporal attention based interleaved network for fast video object detection

Article 11 May 2021

Dynamic Feature Aggregation for Efficient Video Object Detection

References

Bertasius, G, Torresani, L and Shi, J (2018). Object detection in video with spatiotemporal sampling networks. European conference on computer vision (pp. 342-357). https://doi.org/10.1007/978-3-030-01258-8_21
Bhandari B, Alsadoon A, Prasad PWC, Abdullah S, Haddad S (2020) Deep learning neural network for texture feature extraction in oral cancer: enhanced loss function. Multimed Tools Appl. https://doi.org/10.1007/s11042-020-09384-6
Brazil, G and Liu, X (2019). M3d-rpn: monocular 3d region proposal network for object detection. In proceedings of the IEEE international conference on computer vision (pp. 9287-9296). https://doi.org/10.1109/ICCV.2019.00938
Dai, J, Li, Y, He, K and Sun, J (2016). R-FCN: object detection via region-based fully convolutional networks. arXiv: computer vision and pattern recognition
Dosovitskiy, A, Fischery, P, Ilg, E, Hausser, P, Hazirbas, C, Golkov, V, ... and Brox, T (2015). FlowNet: Learning Optical Flow with Convolutional Networks. international conference on computer vision (pp. 2758–2766). https://doi.org/10.1109/ICCV.2015.316
Fattal, A, Karg, M, Scharfenberger, C and Adamy, J (2017). Saliency-guided region proposal network for CNN based object detection. International conference on intelligent transportation systems (pp 1-7). https://doi.org/10.1109/itsc.2017.8317756
Feichtenhofer, C, Pinz, A and Zisserman, A (2017). Detect to track and track to detect. International conference on computer vision (pp. 3057-3065). https://doi.org/10.1109/ICCV.2017.330
Gao, F, Huang, Z, Wang, Z and Wang, S (2016). An object detection acceleration framework based on low-power heterogeneous manycore architecture. The internet of things. https://doi.org/10.1109/WF-IoT.2016.7845407
Girshick, R (2015). Fast R-CNN. International conference on computer vision (pp. 1140-1148). https://doi.org/10.1109/ICCV.2015.169
Girshick, R, Donahue, J, Darrell, T and Malik, J (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Computer vision and pattern recognition (pp. 580-587). https://doi.org/10.1109/CVPR.2014.81
Guo C, Liu D, Guo Y, Sun Y (2014) An adaptive graph cut algorithm for video moving objects detection. Multimed Tools Appl 72(3):2633–2652. https://doi.org/10.1007/s11042-013-1566-x
Article Google Scholar
Han, W, Khorrami, P, Paine, TL, Ramachandran, P, Babaeizadeh, M, Shi, H, ... and Huang, TS (2016). Seq-NMS for Video Object Detection. arXiv: Computer Vision and Pattern Recognition
Hu, H, Wang, W, Zheng, A and Luo, B (2019). MMA: motion memory attention network for video object detection. International conference on image and graphics (pp. 167-178). https://doi.org/10.1007/978-3-030-34110-7_15
Huang, J, Rathod, V, Sun, C, Zhu, M, Korattikara, A, Fathi, A, ... & Murphy, K (2017). Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. computer vision and pattern recognition (pp. 3296–3297). https://doi.org/10.1109/CVPR.2017.351
Ilg, E, Mayer, N, Saikia, T, Keuper, M, Dosovitskiy, A and Brox, T (2016). Flownet 2.0: evolution of optical flow estimation with deep networks. https://doi.org/10.1109/CVPR.2017.179
Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X, Ouyang W (2018) T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28(10):2896–2907. https://doi.org/10.1109/TCSVT.2017.2736553
Article Google Scholar
Kang K, Ouyang W, Li H, Wang X (2016) Object detection from video tubelets with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016:817–825. https://doi.org/10.1109/CVPR.2016.95
Article Google Scholar
Konig, D, Adam, M, Jarvers, C, Layher, G, Neumann, H and Teutsch, M (2017). Fully convolutional region proposal networks for multispectral person detection. In proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 49-56). https://doi.org/10.1109/CVPRW.2017.36
Li L, Hu Q, Li X (2019) Moving object detection in video via hierarchical modeling and alternating optimization. IEEE Trans Image Process 28(4):2021–2036. https://doi.org/10.1109/TIP.2018.2882926
Article MathSciNet Google Scholar
Li, K, Huang, Z, Cheng, Y and Lee, C (2014). A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers. International conference on acoustics speech and signal processing (pp. 4503-4507). https://doi.org/10.1109/ICASSP.2014.6854454
Li Q, Zhan S, Xu L, Wu C (2019) Facial micro-expression recognition based on the fusion of deep learning and enhanced optical flow. Multimed Tools Appl 78:29307–29322. https://doi.org/10.1007/s11042-018-6857-9
Article Google Scholar
Liu, W, Anguelov, D, Erhan, D, Szegedy, C, Reed, S, Fu, C and Berg, AC (2016). SSD: single shot MultiBox detector. European conference on computer vision (pp. 21-37). https://doi.org/10.1007/978-3-319-46448-0_2
Martin, P, Benoispineau, J, Peteri, R and Morlier, J (2019). Optimal choice of motion estimation methods for fine-grained action classification with 3D convolutional networks. International conference on image processing. https://doi.org/10.1109/ICIP.2019.8803780
Meng B, Liu X, Wang X (2018) Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos. Multimed Tools Appl 77(20):26901–26918. https://doi.org/10.1007/s11042-018-5893-9
Article Google Scholar
Nadimi S, Bhanu B (2004) Physical models for moving shadow and object detection in video. IEEE Trans Pattern Anal Mach Intell 26(8):1079–1087. https://doi.org/10.1109/TPAMI.2004.51
Article Google Scholar
Nam, H and Han, B (2016). Learning multi-domain convolutional neural networks for visual tracking. Computer vision and pattern recognition (pp. 3119-3127). https://doi.org/10.1109/ICCV.2015.357
Redmon, J, Divvala, SK, Girshick, R and Farhadi, A (2016). You only look once: unified, real-time object detection. Computer vision and pattern recognition (pp. 779-788). https://doi.org/10.1109/CVPR.2016.91
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Shen, R, Wang, W and Zhang, S (2018). Missing recover with recurrent neural networks for video object detection. In big data: 6th CCF conference, big data 2018, Xi’an, China, October 11-13, 2018, proceedings (Vol. 945, p. 278). Springer
Verikas, A, Radeva, P, Nikolaev, DP, Zhang, W, Zhou, J and Heravi, EJ, et al. (2017). Classification of foods by transferring knowledge from ImageNet dataset. International conference on machine vision (Vol.10341, pp.1034128). https://doi.org/10.1117/12.2268737
Wang, L, Ouyang, W, Wang, X and Lu, H (2015). Visual tracking with fully convolutional networks. International conference on computer vision(pp.3119-3127). https://doi.org/10.1109/ICCV.2015.357
Wang, X, Xie, X and Lai, J (2018). Convolutional LSTM based video object detection. Chinese conference on pattern recognition (pp. 99-109). https://doi.org/10.1007/978-3-030-03335-4_9
Wang, N and Yeung, D (2013). Learning a deep compact image representation for visual tracking. Neural information processing systems (pp. 809-817). http://respository.ust.hk/ir/Record/1783.1-61168
Wang, S, Zhou, Y, Yan, J and Deng, Z (2018). Fully motion-aware network for video object detection. European conference on computer vision (pp. 557-573). https://doi.org/10.1007/978-3-030-01261-8_33
Woo, S, Hwang, S and Kweon, IS (2018). StairNet: top-down semantic aggregation for accurate one shot detection. Workshop on applications of computer vision (pp. 1093-1102). https://doi.org/10.1109/WACV.2018.00125
Xiao, F and Lee, YJ (2018). Video object detection with an aligned spatial-temporal memory. European conference on computer vision (pp. 494-510). https://doi.org/10.1007/978-3-030-01237-3_30
Yang, Ming and Ji, Shuiwang and Xu, Wei and Wang, Jinjun and Lv, Fengjun and Yu, Kai and Gong, Yihong and Dikmen, Mert and Lin, Dennis and Huang, Thomas. (2011). Detecting human actions in surveillance videos. TREC video retrieval evaluation. https://www.researchgate.net/publication/229045898
Zhang R, Miao Z, Ma C, Hao S (2020) Aggregating Motion and Attention for Video Object Detection. In: Palaiahnakote S, Sanniti di Baja G, Wang L, Yan W (eds) Pattern Recognition. ACPR 2019. Lecture notes in computer science, vol 12047. Springer, Cham. https://doi.org/10.1007/978-3-030-41299-9_47
Chapter Google Scholar
Zhu, X, Dai, J, Yuan, L and Wei, Y (2018). Towards high performance video object detection. Computer vision and pattern recognition (pp. 7210-7218). https://doi.org/10.1109/cvpr.2018.00753
Zhu, X, Wang, Y, Dai, J, Yuan, L and Wei, Y (2017). Flow-guided feature aggregation for video object detection. International conference on computer vision (pp. 408-417). https://doi.org/10.1109/iccv.2017.52
Zhu, X, Xiong, Y, Dai, J, Yuan, L and Wei, Y (2017). Deep Feature Flow for Video Recognition. 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/CVPR.2017.441

Download references

Acknowledgments

The work was supported by Yuyou Talent Support Plan of North China University of Technology (107051360019XN132/017), the Fundamental Research Funds for Beijing Universities (110052971803/037), Special Research Foundation of North China University of Technology (PXM2017_014212_000014).

Availability of data and material

The datasets supporting the conclusions of this article are available in the Pascal VOC repository[http://host.robots.ox.ac.uk/pascal/VOC/].

Code availability

Not applicable.

Funding

This study was funded by Supported by Yuyou Talent Support Plan of North China University of Technology (grant number 107051360019XN132/017), The Fundamental Research Funds for Beijing Universities (grant number 110052971803/037), and Special Research Foundation of North China University of Technology (grant number PXM2017_014212_000014).

Author information

Danyang Cao, Jinfeng Ma and Zhixin Chen contributed equally to this work.

Authors and Affiliations

School of Information Science and Technology, North China University of Technology, No 5, Jin Yuan Zhuang Road, Beijing, 100144, China
Danyang Cao, Jinfeng Ma & Zhixin Chen
Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, Beijing, 100144, China
Danyang Cao

Authors

Danyang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jinfeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhixin Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Authors contributed in various important aspects. ZXC conducted the experiments, JFM analyzed the results and drafted the manuscript. DYC proposed the structure design. DYC provided valuable suggestions on improving the standards of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Danyang Cao.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, D., Ma, J. & Chen, Z. Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation. Multimed Tools Appl 80, 23275–23295 (2021). https://doi.org/10.1007/s11042-020-09827-0

Download citation

Received: 05 November 2019
Revised: 04 August 2020
Accepted: 03 September 2020
Published: 24 September 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11042-020-09827-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation

Abstract

Access this article

Similar content being viewed by others

Learning Where to Focus for Efficient Video Object Detection

A feature temporal attention based interleaved network for fast video object detection

Dynamic Feature Aggregation for Efficient Video Object Detection

References

Acknowledgments

Availability of data and material

Code availability

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation

Abstract

Access this article

Similar content being viewed by others

Learning Where to Focus for Efficient Video Object Detection

A feature temporal attention based interleaved network for fast video object detection

Dynamic Feature Aggregation for Efficient Video Object Detection

References

Acknowledgments

Availability of data and material

Code availability

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation