Skip to main content
Log in

A feature temporal attention based interleaved network for fast video object detection

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Object detection in videos is a fundamental technology for applications such as monitoring. Since video frames are treated as independent input images, static detectors ignore the temporal information of objects when detecting objects in videos, generating redundant calculations in the detection process. In this paper, based on the spatiotemporal continuity of video objects, we propose an attention-guided dynamic video object detection method for fast detection. We define two frame attributes as key frame and non-key frame, then extract complete or shallow features, respectively. Distinct from the fixed key frame strategy used in previous studies, by measuring the feature similarity between frames, we develop a new key frame decision method to adaptively determine the attributes of the current frame. For the extracted shallow features of non-key frames, semantic enhancement and feature temporal attention (FTA) based feature propagation are performed to generate high-level semantic features in the designed temporal attention based feature propagation module (TAFPM). Our method is evaluated on the ImageNet VID dataset. It runs at the speed of 21.53 fps, which is twice the speed of the base detector R-FCN. The mAP decline is only 0.2% compared to R-FCN. Effectively, the proposed method achieves comparable performance with the state-of-the-arts which focus on speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Ashraf S, Abdullah S, Aslam M, Qiyas M, Kutbi MA (2019) Spherical fuzzy sets and its representation of spherical fuzzy t-norms and t-conorms. J Intell Fuzzy Syst 36(6):6089–6102

    Article  Google Scholar 

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015

  • Bertasius G, Torresani L, Shi J (2018) Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 331–346

  • Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:200410934

  • Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6154–6162

  • Chen K, Wang J, Yang S, Zhang X, Xiong Y, Change Loy C, Lin D (2018) Optimizing video object detection via a scale-time lattice. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7814–7823

  • Chen X, Yu J, Wu Z (2019) Temporally identity-aware ssd with attentional lstm. IEEE Trans Cybern 50(6):2674–2686

    Article  Google Scholar 

  • Chen Y, Cao Y, Hu H, Wang L (2020) Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10337–10346

  • Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387

  • Deng J, Pan Y, Yao T, Zhou W, Li H, Mei T (2019) Relation distillation networks for video object detection. In: European Conference on Computer Vision

  • Dong Z, Li G, Liao Y, Wang F, Ren P, Qian C (2020) Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 10516–10525

  • Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766

  • Feichtenhofer C, Pinz A, Zisserman A (2017) Detect to track and track to detect. In: Proceedings of the IEEE international conference on computer vision, pp 3038–3046

  • Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  • Han W, Khorrami P, Paine TL, Ramachandran P, Babaeizadeh M, Shi H, Li J, Yan S, Huang TS (2016) Seq-nms for video object detection. arXiv preprint arXiv:160208465

  • Hasselt Hv, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp 2094–2100

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  • Jiang Z, Liu Y, Yang C, Liu J, Gao P, Zhang Q, Xiang S, Pan C (2020) Learning where to focus for efficient video object detection. In: European Conference on Computer Vision

  • Jin H, Ashraf S, Abdullah S, Qiyas M, Zeng S (2019) Linguistic spherical fuzzy aggregation operators and their applications in multi-attribute decision making problems. Mathematics 7(5):413–434

    Article  Google Scholar 

  • Kang K, Ouyang W, Li H, Wang X (2016) Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 817–825

  • Kang K, Li H, Xiao T, Ouyang W, Yan J, Liu X, Wang X (2017a) Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 727–735

  • Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X et al (2017b) T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circuits Syst Video Technol 28(10):2896–2907

    Article  Google Scholar 

  • Law H, Deng J (2020) Cornernet: Detecting objects as paired keypoints. Int J Comput Vis 128(3):642–656

    Article  Google Scholar 

  • Li Y, Shi J, Lin D (2018) Low-latency video semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5997–6005

  • Liu M, Zhu M (2018) Mobile video object detection with temporally-aware feature maps. In: IEEE conference on computer vision and pattern recognition (CVPR)

  • Liu M, Zhu M, White M, Li Y, Kalenichenko D (2019) Looking fast and slow: Memory-guided mobile video object detection. arXiv preprint arXiv:190310172

  • Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37

  • Qiyas M, Abdullah S, Ashraf S, Abdullah L (2019a) Linguistic picture fuzzy dombi aggregation operators and their application in multiple attribute group decision making problem. Mathematics 7(8):764–785

    Article  Google Scholar 

  • Qiyas M, Abdullah S, Ashraf S, Khan S, Khan A (2019b) Triangular picture fuzzy linguistic induced ordered weighted aggregation operators and its application on decision making problems. Math Found Comput 2(3):183–201

    Article  Google Scholar 

  • Qiyas M, Abdullah S, Ashraf S, Aslam M (2020) Utilizing linguistic picture fuzzy aggregation operators for multiple-attribute decision-making problems. Int J Fuzzy Syst 22(1):310–320

    Article  Google Scholar 

  • Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  • Shelhamer E, Rakelly K, Hoffman J, Darrell T (2016) Clockwork convnets for video semantic segmentation. In: European Conference on computer vision. Springer, pp 852–868

  • Shvets M, Liu W, Berg A (2019) Leveraging long-range temporal relationships between proposals for video object detection. In: IEEE international conference on computer vision, pp 9756–9764

  • Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations, pp 1–14

  • Tang P, Wang C, Wang X, Liu W, Zeng W, Wang J (2018) Object detection in videos by short and long range object linking. arXiv preprint arXiv:180109823

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  • Wang S, Zhou Y, Yan J, Deng Z (2018a) Fully motion-aware network for video object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 542–557

  • Wang X, Girshick R, Gupta A, He K (2018b) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803

  • Wu Z, Xiong C, Ma CY, Socher R, Davis LS (2019) Adaframe: adaptive frame selection for fast video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1278–1287

  • Xiao F, Jae Lee Y (2018) Video object detection with an aligned spatial-temporal memory. In: Proceedings of the European conference on computer vision (ECCV), pp 485–501

  • Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810

  • Xu YS, Fu TJ, Yang HK, Lee CY (2018) Dynamic video segmentation network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6556–6565

  • Yao C, Fang C, Shen S, Wan Y, Yang M (2020) Video object detection via object-level temporal aggregation. In: European conference on computer vision, pp 160–177

  • Zhang W, Gao XZ, Yang CF, Jiang F, Chen ZY (2020) A object detection and tracking method for security in intelligence of unmanned surface vehicles. J Ambient Intell Hum Comput (2)

  • Zhu X, Wang Y, Dai J, Yuan L, Wei Y (2017a) Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE international conference on computer vision, pp 408–417

  • Zhu X, Xiong Y, Dai J, Yuan L, Wei Y (2017b) Deep feature flow for video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2349–2358

  • Zhu X, Dai J, Yuan L, Wei Y (2018) Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7210–7218

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (62072053), the Fundamental Research Funds for the Central Universities (300102249317), Natural Science Foundation of Shaanxi Province (2019SF-258), and Key R & D project of Shaanxi Science and Technology Department (2019YFB1600500).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huansheng Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Song, H., Sun, S. et al. A feature temporal attention based interleaved network for fast video object detection. J Ambient Intell Human Comput 14, 497–509 (2023). https://doi.org/10.1007/s12652-021-03309-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03309-3

Keywords

Navigation