A feature temporal attention based interleaved network for fast video object detection

Yang, Yanni; Song, Huansheng; Sun, Shijie; Chen, Yan; Tang, Xinyao; Shi, Qin

doi:10.1007/s12652-021-03309-3

A feature temporal attention based interleaved network for fast video object detection

Original Research
Published: 11 May 2021

Volume 14, pages 497–509, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Yanni Yang¹,
Huansheng Song ORCID: orcid.org/0000-0002-8590-0061¹,
Shijie Sun¹,
Yan Chen¹,
Xinyao Tang¹ &
…
Qin Shi¹

555 Accesses
2 Citations
Explore all metrics

Abstract

Object detection in videos is a fundamental technology for applications such as monitoring. Since video frames are treated as independent input images, static detectors ignore the temporal information of objects when detecting objects in videos, generating redundant calculations in the detection process. In this paper, based on the spatiotemporal continuity of video objects, we propose an attention-guided dynamic video object detection method for fast detection. We define two frame attributes as key frame and non-key frame, then extract complete or shallow features, respectively. Distinct from the fixed key frame strategy used in previous studies, by measuring the feature similarity between frames, we develop a new key frame decision method to adaptively determine the attributes of the current frame. For the extracted shallow features of non-key frames, semantic enhancement and feature temporal attention (FTA) based feature propagation are performed to generate high-level semantic features in the designed temporal attention based feature propagation module (TAFPM). Our method is evaluated on the ImageNet VID dataset. It runs at the speed of 21.53 fps, which is twice the speed of the base detector R-FCN. The mAP decline is only 0.2% compared to R-FCN. Effectively, the proposed method achieves comparable performance with the state-of-the-arts which focus on speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Aggregating Motion and Attention for Video Object Detection

3D Attention Based YOLO-SWINF for Real-Time Video Object Detection

PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

References

Ashraf S, Abdullah S, Aslam M, Qiyas M, Kutbi MA (2019) Spherical fuzzy sets and its representation of spherical fuzzy t-norms and t-conorms. J Intell Fuzzy Syst 36(6):6089–6102
Article Google Scholar
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015
Bertasius G, Torresani L, Shi J (2018) Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 331–346
Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:200410934
Cai Z, Vasconcelos N (2018) Cascade r-cnn: Delving into high quality object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6154–6162
Chen K, Wang J, Yang S, Zhang X, Xiong Y, Change Loy C, Lin D (2018) Optimizing video object detection via a scale-time lattice. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7814–7823
Chen X, Yu J, Wu Z (2019) Temporally identity-aware ssd with attentional lstm. IEEE Trans Cybern 50(6):2674–2686
Article Google Scholar
Chen Y, Cao Y, Hu H, Wang L (2020) Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10337–10346
Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387
Deng J, Pan Y, Yao T, Zhou W, Li H, Mei T (2019) Relation distillation networks for video object detection. In: European Conference on Computer Vision
Dong Z, Li G, Liao Y, Wang F, Ren P, Qian C (2020) Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 10516–10525
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 2758–2766
Feichtenhofer C, Pinz A, Zisserman A (2017) Detect to track and track to detect. In: Proceedings of the IEEE international conference on computer vision, pp 3038–3046
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Han W, Khorrami P, Paine TL, Ramachandran P, Babaeizadeh M, Shi H, Li J, Yan S, Huang TS (2016) Seq-nms for video object detection. arXiv preprint arXiv:160208465
Hasselt Hv, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp 2094–2100
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Jiang Z, Liu Y, Yang C, Liu J, Gao P, Zhang Q, Xiang S, Pan C (2020) Learning where to focus for efficient video object detection. In: European Conference on Computer Vision
Jin H, Ashraf S, Abdullah S, Qiyas M, Zeng S (2019) Linguistic spherical fuzzy aggregation operators and their applications in multi-attribute decision making problems. Mathematics 7(5):413–434
Article Google Scholar
Kang K, Ouyang W, Li H, Wang X (2016) Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 817–825
Kang K, Li H, Xiao T, Ouyang W, Yan J, Liu X, Wang X (2017a) Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 727–735
Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X et al (2017b) T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circuits Syst Video Technol 28(10):2896–2907
Article Google Scholar
Law H, Deng J (2020) Cornernet: Detecting objects as paired keypoints. Int J Comput Vis 128(3):642–656
Article Google Scholar
Li Y, Shi J, Lin D (2018) Low-latency video semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5997–6005
Liu M, Zhu M (2018) Mobile video object detection with temporally-aware feature maps. In: IEEE conference on computer vision and pattern recognition (CVPR)
Liu M, Zhu M, White M, Li Y, Kalenichenko D (2019) Looking fast and slow: Memory-guided mobile video object detection. arXiv preprint arXiv:190310172
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
Qiyas M, Abdullah S, Ashraf S, Abdullah L (2019a) Linguistic picture fuzzy dombi aggregation operators and their application in multiple attribute group decision making problem. Mathematics 7(8):764–785
Article Google Scholar
Qiyas M, Abdullah S, Ashraf S, Khan S, Khan A (2019b) Triangular picture fuzzy linguistic induced ordered weighted aggregation operators and its application on decision making problems. Math Found Comput 2(3):183–201
Article Google Scholar
Qiyas M, Abdullah S, Ashraf S, Aslam M (2020) Utilizing linguistic picture fuzzy aggregation operators for multiple-attribute decision-making problems. Int J Fuzzy Syst 22(1):310–320
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Shelhamer E, Rakelly K, Hoffman J, Darrell T (2016) Clockwork convnets for video semantic segmentation. In: European Conference on computer vision. Springer, pp 852–868
Shvets M, Liu W, Berg A (2019) Leveraging long-range temporal relationships between proposals for video object detection. In: IEEE international conference on computer vision, pp 9756–9764
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations, pp 1–14
Tang P, Wang C, Wang X, Liu W, Zeng W, Wang J (2018) Object detection in videos by short and long range object linking. arXiv preprint arXiv:180109823
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang S, Zhou Y, Yan J, Deng Z (2018a) Fully motion-aware network for video object detection. In: Proceedings of the European conference on computer vision (ECCV), pp 542–557
Wang X, Girshick R, Gupta A, He K (2018b) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Wu Z, Xiong C, Ma CY, Socher R, Davis LS (2019) Adaframe: adaptive frame selection for fast video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1278–1287
Xiao F, Jae Lee Y (2018) Video object detection with an aligned spatial-temporal memory. In: Proceedings of the European conference on computer vision (ECCV), pp 485–501
Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo Wc (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810
Xu YS, Fu TJ, Yang HK, Lee CY (2018) Dynamic video segmentation network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6556–6565
Yao C, Fang C, Shen S, Wan Y, Yang M (2020) Video object detection via object-level temporal aggregation. In: European conference on computer vision, pp 160–177
Zhang W, Gao XZ, Yang CF, Jiang F, Chen ZY (2020) A object detection and tracking method for security in intelligence of unmanned surface vehicles. J Ambient Intell Hum Comput (2)
Zhu X, Wang Y, Dai J, Yuan L, Wei Y (2017a) Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE international conference on computer vision, pp 408–417
Zhu X, Xiong Y, Dai J, Yuan L, Wei Y (2017b) Deep feature flow for video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2349–2358
Zhu X, Dai J, Yuan L, Wei Y (2018) Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7210–7218

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (62072053), the Fundamental Research Funds for the Central Universities (300102249317), Natural Science Foundation of Shaanxi Province (2019SF-258), and Key R & D project of Shaanxi Science and Technology Department (2019YFB1600500).

Author information

Authors and Affiliations

Shool of Information Engineering, Chang’an University, Xi’an, 710064, China
Yanni Yang, Huansheng Song, Shijie Sun, Yan Chen, Xinyao Tang & Qin Shi

Authors

Yanni Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huansheng Song
View author publications
You can also search for this author in PubMed Google Scholar
Shijie Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xinyao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Qin Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huansheng Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y., Song, H., Sun, S. et al. A feature temporal attention based interleaved network for fast video object detection. J Ambient Intell Human Comput 14, 497–509 (2023). https://doi.org/10.1007/s12652-021-03309-3

Download citation

Received: 27 October 2020
Accepted: 01 May 2021
Published: 11 May 2021
Issue Date: January 2023
DOI: https://doi.org/10.1007/s12652-021-03309-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A feature temporal attention based interleaved network for fast video object detection

Abstract

Access this article

Similar content being viewed by others

Aggregating Motion and Attention for Video Object Detection

3D Attention Based YOLO-SWINF for Real-Time Video Object Detection

PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A feature temporal attention based interleaved network for fast video object detection

Abstract

Access this article

Similar content being viewed by others

Aggregating Motion and Attention for Video Object Detection

3D Attention Based YOLO-SWINF for Real-Time Video Object Detection

PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation