Skip to main content
Log in

Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In comparison with static image object detection, focusing on video objects has greater research significance in realizing intelligent monitoring and automatic anomaly detection. However, it may be challenging to apply the most advanced image recognition networks to video data, as the number of static frame files represented in a video is often huge, thereby causing the problem of the slow evaluation speed, in addition to other issues, such as motion blur, low resolution, occlusion, and object deformation. In the present study, to mitigate these deficiencies, we applied sparse feature propagation to improve the detection speed and dense feature aggregation to refine the detection accuracy. Moreover, we utilized the key frame scheduling strategy relying on the consistency of feature information. Implementing these technologies allowed steadily improving the detection speed and accuracy to achieve high performance. To verify the applicability of the optimized video detection strategy proposed in this paper, we selected the part of the video data in the ImageNet VID training dataset. Then, the other part of this dataset was used to conduct the experiments, including the calculation and comparison of mean average precision (MAP) and frames per second (FPS).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Bertasius, G, Torresani, L and Shi, J (2018). Object detection in video with spatiotemporal sampling networks. European conference on computer vision (pp. 342-357). https://doi.org/10.1007/978-3-030-01258-8_21

  2. Bhandari B, Alsadoon A, Prasad PWC, Abdullah S, Haddad S (2020) Deep learning neural network for texture feature extraction in oral cancer: enhanced loss function. Multimed Tools Appl. https://doi.org/10.1007/s11042-020-09384-6

  3. Brazil, G and Liu, X (2019). M3d-rpn: monocular 3d region proposal network for object detection. In proceedings of the IEEE international conference on computer vision (pp. 9287-9296). https://doi.org/10.1109/ICCV.2019.00938

  4. Dai, J, Li, Y, He, K and Sun, J (2016). R-FCN: object detection via region-based fully convolutional networks. arXiv: computer vision and pattern recognition

  5. Dosovitskiy, A, Fischery, P, Ilg, E, Hausser, P, Hazirbas, C, Golkov, V, ... and Brox, T (2015). FlowNet: Learning Optical Flow with Convolutional Networks. international conference on computer vision (pp. 2758–2766). https://doi.org/10.1109/ICCV.2015.316

  6. Fattal, A, Karg, M, Scharfenberger, C and Adamy, J (2017). Saliency-guided region proposal network for CNN based object detection. International conference on intelligent transportation systems (pp 1-7). https://doi.org/10.1109/itsc.2017.8317756

  7. Feichtenhofer, C, Pinz, A and Zisserman, A (2017). Detect to track and track to detect. International conference on computer vision (pp. 3057-3065). https://doi.org/10.1109/ICCV.2017.330

  8. Gao, F, Huang, Z, Wang, Z and Wang, S (2016). An object detection acceleration framework based on low-power heterogeneous manycore architecture. The internet of things. https://doi.org/10.1109/WF-IoT.2016.7845407

  9. Girshick, R (2015). Fast R-CNN. International conference on computer vision (pp. 1140-1148). https://doi.org/10.1109/ICCV.2015.169

  10. Girshick, R, Donahue, J, Darrell, T and Malik, J (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Computer vision and pattern recognition (pp. 580-587). https://doi.org/10.1109/CVPR.2014.81

  11. Guo C, Liu D, Guo Y, Sun Y (2014) An adaptive graph cut algorithm for video moving objects detection. Multimed Tools Appl 72(3):2633–2652. https://doi.org/10.1007/s11042-013-1566-x

    Article  Google Scholar 

  12. Han, W, Khorrami, P, Paine, TL, Ramachandran, P, Babaeizadeh, M, Shi, H, ... and Huang, TS (2016). Seq-NMS for Video Object Detection. arXiv: Computer Vision and Pattern Recognition

  13. Hu, H, Wang, W, Zheng, A and Luo, B (2019). MMA: motion memory attention network for video object detection. International conference on image and graphics (pp. 167-178). https://doi.org/10.1007/978-3-030-34110-7_15

  14. Huang, J, Rathod, V, Sun, C, Zhu, M, Korattikara, A, Fathi, A, ... & Murphy, K (2017). Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors. computer vision and pattern recognition (pp. 3296–3297). https://doi.org/10.1109/CVPR.2017.351

  15. Ilg, E, Mayer, N, Saikia, T, Keuper, M, Dosovitskiy, A and Brox, T (2016). Flownet 2.0: evolution of optical flow estimation with deep networks. https://doi.org/10.1109/CVPR.2017.179

  16. Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X, Ouyang W (2018) T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28(10):2896–2907. https://doi.org/10.1109/TCSVT.2017.2736553

    Article  Google Scholar 

  17. Kang K, Ouyang W, Li H, Wang X (2016) Object detection from video tubelets with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016:817–825. https://doi.org/10.1109/CVPR.2016.95

    Article  Google Scholar 

  18. Konig, D, Adam, M, Jarvers, C, Layher, G, Neumann, H and Teutsch, M (2017). Fully convolutional region proposal networks for multispectral person detection. In proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 49-56). https://doi.org/10.1109/CVPRW.2017.36

  19. Li L, Hu Q, Li X (2019) Moving object detection in video via hierarchical modeling and alternating optimization. IEEE Trans Image Process 28(4):2021–2036. https://doi.org/10.1109/TIP.2018.2882926

    Article  MathSciNet  Google Scholar 

  20. Li, K, Huang, Z, Cheng, Y and Lee, C (2014). A maximal figure-of-merit learning approach to maximizing mean average precision with deep neural network based classifiers. International conference on acoustics speech and signal processing (pp. 4503-4507). https://doi.org/10.1109/ICASSP.2014.6854454

  21. Li Q, Zhan S, Xu L, Wu C (2019) Facial micro-expression recognition based on the fusion of deep learning and enhanced optical flow. Multimed Tools Appl 78:29307–29322. https://doi.org/10.1007/s11042-018-6857-9

    Article  Google Scholar 

  22. Liu, W, Anguelov, D, Erhan, D, Szegedy, C, Reed, S, Fu, C and Berg, AC (2016). SSD: single shot MultiBox detector. European conference on computer vision (pp. 21-37). https://doi.org/10.1007/978-3-319-46448-0_2

  23. Martin, P, Benoispineau, J, Peteri, R and Morlier, J (2019). Optimal choice of motion estimation methods for fine-grained action classification with 3D convolutional networks. International conference on image processing. https://doi.org/10.1109/ICIP.2019.8803780

  24. Meng B, Liu X, Wang X (2018) Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos. Multimed Tools Appl 77(20):26901–26918. https://doi.org/10.1007/s11042-018-5893-9

    Article  Google Scholar 

  25. Nadimi S, Bhanu B (2004) Physical models for moving shadow and object detection in video. IEEE Trans Pattern Anal Mach Intell 26(8):1079–1087. https://doi.org/10.1109/TPAMI.2004.51

    Article  Google Scholar 

  26. Nam, H and Han, B (2016). Learning multi-domain convolutional neural networks for visual tracking. Computer vision and pattern recognition (pp. 3119-3127). https://doi.org/10.1109/ICCV.2015.357

  27. Redmon, J, Divvala, SK, Girshick, R and Farhadi, A (2016). You only look once: unified, real-time object detection. Computer vision and pattern recognition (pp. 779-788). https://doi.org/10.1109/CVPR.2016.91

  28. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  29. Shen, R, Wang, W and Zhang, S (2018). Missing recover with recurrent neural networks for video object detection. In big data: 6th CCF conference, big data 2018, Xi’an, China, October 11-13, 2018, proceedings (Vol. 945, p. 278). Springer

  30. Verikas, A, Radeva, P, Nikolaev, DP, Zhang, W, Zhou, J and Heravi, EJ, et al. (2017). Classification of foods by transferring knowledge from ImageNet dataset. International conference on machine vision (Vol.10341, pp.1034128). https://doi.org/10.1117/12.2268737

  31. Wang, L, Ouyang, W, Wang, X and Lu, H (2015). Visual tracking with fully convolutional networks. International conference on computer vision(pp.3119-3127). https://doi.org/10.1109/ICCV.2015.357

  32. Wang, X, Xie, X and Lai, J (2018). Convolutional LSTM based video object detection. Chinese conference on pattern recognition (pp. 99-109). https://doi.org/10.1007/978-3-030-03335-4_9

  33. Wang, N and Yeung, D (2013). Learning a deep compact image representation for visual tracking. Neural information processing systems (pp. 809-817). http://respository.ust.hk/ir/Record/1783.1-61168

  34. Wang, S, Zhou, Y, Yan, J and Deng, Z (2018). Fully motion-aware network for video object detection. European conference on computer vision (pp. 557-573). https://doi.org/10.1007/978-3-030-01261-8_33

  35. Woo, S, Hwang, S and Kweon, IS (2018). StairNet: top-down semantic aggregation for accurate one shot detection. Workshop on applications of computer vision (pp. 1093-1102). https://doi.org/10.1109/WACV.2018.00125

  36. Xiao, F and Lee, YJ (2018). Video object detection with an aligned spatial-temporal memory. European conference on computer vision (pp. 494-510). https://doi.org/10.1007/978-3-030-01237-3_30

  37. Yang, Ming and Ji, Shuiwang and Xu, Wei and Wang, Jinjun and Lv, Fengjun and Yu, Kai and Gong, Yihong and Dikmen, Mert and Lin, Dennis and Huang, Thomas. (2011). Detecting human actions in surveillance videos. TREC video retrieval evaluation. https://www.researchgate.net/publication/229045898

  38. Zhang R, Miao Z, Ma C, Hao S (2020) Aggregating Motion and Attention for Video Object Detection. In: Palaiahnakote S, Sanniti di Baja G, Wang L, Yan W (eds) Pattern Recognition. ACPR 2019. Lecture notes in computer science, vol 12047. Springer, Cham. https://doi.org/10.1007/978-3-030-41299-9_47

    Chapter  Google Scholar 

  39. Zhu, X, Dai, J, Yuan, L and Wei, Y (2018). Towards high performance video object detection. Computer vision and pattern recognition (pp. 7210-7218). https://doi.org/10.1109/cvpr.2018.00753

  40. Zhu, X, Wang, Y, Dai, J, Yuan, L and Wei, Y (2017). Flow-guided feature aggregation for video object detection. International conference on computer vision (pp. 408-417). https://doi.org/10.1109/iccv.2017.52

  41. Zhu, X, Xiong, Y, Dai, J, Yuan, L and Wei, Y (2017). Deep Feature Flow for Video Recognition. 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/CVPR.2017.441

Download references

Acknowledgments

The work was supported by Yuyou Talent Support Plan of North China University of Technology (107051360019XN132/017), the Fundamental Research Funds for Beijing Universities (110052971803/037), Special Research Foundation of North China University of Technology (PXM2017_014212_000014).

Availability of data and material

The datasets supporting the conclusions of this article are available in the Pascal VOC repository[http://host.robots.ox.ac.uk/pascal/VOC/].

Code availability

Not applicable.

Funding

This study was funded by Supported by Yuyou Talent Support Plan of North China University of Technology (grant number 107051360019XN132/017), The Fundamental Research Funds for Beijing Universities (grant number 110052971803/037), and Special Research Foundation of North China University of Technology (grant number PXM2017_014212_000014).

Author information

Authors and Affiliations

Authors

Contributions

Authors contributed in various important aspects. ZXC conducted the experiments, JFM analyzed the results and drafted the manuscript. DYC proposed the structure design. DYC provided valuable suggestions on improving the standards of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Danyang Cao.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, D., Ma, J. & Chen, Z. Video object detection algorithm based on dynamic combination of sparse feature propagation and dense feature aggregation. Multimed Tools Appl 80, 23275–23295 (2021). https://doi.org/10.1007/s11042-020-09827-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09827-0

Keywords

Navigation