Streamer temporal action detection in live video by co-attention boundary matching

Li, Chenhao; He, Chen; Zhang, Hui; Yao, Jiacheng; Zhang, Jing; Zhuo, Li

doi:10.1007/s13042-022-01581-z

Streamer temporal action detection in live video by co-attention boundary matching

Original Article
Published: 11 June 2022

Volume 13, pages 3071–3088, (2022)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Chenhao Li^1,2,
Chen He^1,2,
Hui Zhang^1,2,
Jiacheng Yao^1,2,
Jing Zhang^1,2 &
…
Li Zhuo^1,2

297 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

With the advent of the we-media era, live video is being sought after by more and more web users. How to effectively identify and supervise the streamer activities in the live video is of great significance to promote the high-quality development of the live video industry. The streamer activity can be characterized by the temporal composition of a series of actions. To improve the accuracy of streamer temporal action detection, it is a promising path to utilize the temporal action location and co-attention mechanism to overcome the problem of blurring action boundary. Therefore, a streamer temporal action detection method by co-attention boundary matching in live video is proposed. (1) The global spatiotemporal features and action template features of live video are extracted by using two-stream convolutional network and action spatiotemporal attention network respectively. (2) The probability sequences are generated from the global spatiotemporal features through temporal action evaluation, and the boundary matching confidence maps are produced by confidence evaluation of global spatiotemporal features and action template features under the co-attention mechanism. (3) The streamer temporal actions are detected based on the action proposals generated by probability sequences and boundary matching maps. We establish a real-world streamer action BJUT-SAD dataset and conduct extensive experiments to verify that our method can boost the accuracy of streamer temporal action detection in live video. In particular, our temporal action proposal generation and streamer action detection task produce competitive results to prior methods, demonstrating the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cascading spatio-temporal attention network for real-time action detection

Article 26 September 2023

Timeception Single Shot Action Detector: A Single-Stage Method for Temporal Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Data availability

The datasets used or analyzed during the current study are available from the corresponding author Jing Zhang on reasonable request.

Code availability

The code used to generate results shown in this study is available from the corresponding author Jing Zhang upon request.

References

Video streaming market size, share & trends analysis report. https://www.grandviewresearch.com/industry-analysis/video-streaming-market
Must-know live video streaming statistics. https://livestream.com/blog/62-must-know-stats-live-video-streaming
Glance D. As live streaming murder becomes the new normal online, can social media be saved? https://phys.org/news/2017–04-streaming-online-social-media.html
Chao Y, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization. In Proceeding IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, pp 1130–1139
Mnih V, Heess N, Graves A (2014) Recurrent models of visual attention. In Proceeding Advances in Neural Information Processing Systems. Cambridge, pp 2204–2212
Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. Guangzhou, pp 2048–2057
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceeding IEEE international conference on computer vision. Venice, pp 1839–1848
Wang H and Schmid C (2013) Action recognition with improved trajectories. In: Proceeding IEEE international conference on computer vision. Sydney, pp 3551–3558
Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: Proceeding IEEE international conference on computer vision. Seoul, pp 3888–3897
Liu Y, Ma L, Zhang Y, Liu W, Chang SF (2019) Multi-granularity generator for temporal action proposal. In Proceeding IEEE conference on computer vision and pattern recognition. Long Beach, pp 3604–3613
Lin C, Li J, Wang Y (2020) Fast learning of temporal action proposal via dense boundary generator. In: AAAI conference on artificial intelligence. New York, pp 11499–11506
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, Montreal, pp 568–576
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1933–1941
Tu Z, Xie W, Qin Q, Poppe R, Veltkamp R, Li B, Yuan J (2018) Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recogn 79:32–43
Article Google Scholar
Peng Y, Zhao Y, Zhang J (2019) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786
Article Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceeding IEEE international conference on computer vision. Santiago, pp 4489–4497
He D, Zhou Z, Gan C, Li F, Liu X, Li Y, Wang L, Wen S (2019) StNet: Local and global spatial-temporal modeling for action recognition. In: AAAI conference on artificial intelligence. Virtual, pp 8401–8408
Li C, Zhang J, Yao J (2021) Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning. Neurocomputing 453:383–392
Article Google Scholar
Shou Z, Wang D and Chang S (2016) Temporal action localization in untrimmed videos via multistage CNNs. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1049–1058
Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1914–1923
Gao J, Yang Z, Sun C, Chen K, and Nevatia R (2017) TURN TAP: Temporal unit regression network for temporal action proposals. In: Proceeding IEEE international cnference on computer vision. Venice, pp 3648–3656
Lin T, Zhao X, Su H, Wang C, and Yang M (2018) BSN: Boundary sensitive network for temporal action proposal generation. In: Proceeding European conference on computer vision. Munich, pp 3–21
Wang F, Wang GR, Du YX, He ZQ, Jiang Y (2021) A two-stage temporal proposal network for precise action localization in untrimmed video. Int J Mach Learn & Cyber 12:2199–2211
Article Google Scholar
Naveed H, Khan G, Khan AU, Siddiqi A, Khan MUG (2019) Human activity recognition using mixture of heterogeneous features and sequential minimal optimization. Int J Mach Learn Cyber 10:2329–2340
Article Google Scholar
Zhuang DF, Jiang M, Kong J, Liu TS (2021) Spatiotemporal attention enhanced features fusion network for action recognition. Int J Mach Learn Cyber 12:823–841
Article Google Scholar
Li D, Yao T, Duan L, Mei T, Rui Y (2019) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimedia 21(2):416–428
Article Google Scholar
Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatio-temporal VLAD for video action recognition. IEEE Trans Image Process 28(6):2799–2812
Article MathSciNet Google Scholar
Gong G, Wang X, Mu Y, Tian Q (2020) Learning temporal co-attention models for unsupervised video action localization. In: Proceeding IEEE conference on computer vision and pattern recognition. Seattle, pp 9816–9825
Zeng R, Huang W, Gan C, Tan M, Huang J (2019) Graph convolutional networks for temporal action localization. In: Proceeding IEEE international conference on computer vision. Seoul, pp 7093–7102
Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-TAD: Sub-graph localization for temporal action detection. In: Proceeding IEEE conference on computer vision and pattern recognition. Seattle, pp 10153–10162
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput 109:104144
Article Google Scholar
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceeding advances in neural information processing systems. Long Beach, pp 4263–4270
Wang L, Xiong Y, Lin D, Gool LV (2017) UntrimmedNets for weakly supervised action recognition and detection. In: Proceeding IEEE conference on computer vision and pattern recognition. Honolulu, pp 4325–4334
Woo S, Park J, Lee JY (2018) CBAM: convolutional block attention module. In: Proceeding European conference on computer vision. Munich, pp 3–19
Wang L, Zhang J, Tian Q, Li C, Zhuo L (2020) Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Trans Circuits Syst Video Technol 32(12):4876–4886
Article Google Scholar
Zhao B, Li X, Lu X (2019) CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565
Article MathSciNet Google Scholar
Hsieh T, Lo Y, Chen H, Liu T (2019) One-shot object detection with co-attention and co-excitation. In: Proceedings of international conference on neural information processing systems. Vancouver, pp 2725–2734
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceeding international conference on machine learning. Lille, pp 448–456
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van LG (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceeding European conference on computer vision. Amsterdam, pp 20–36
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceeding IEEE conference on computer vision and pattern recognition. Long Beach, pp 3141–3149
Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of international conference on machine learning. Long Beach, pp 7354–7363
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceeding IEEE conference on computer vision and pattern recognition. Salt Lake City, pp 7794–7803
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS improving object detection with one line of code. In: Proceeding IEEE international conference on computer vision. Venice, pp 5562–5570
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. In: Proceeding European conference on computer vision workshop. Zurich, pp 1–6
Deng J, Dong W, Socher R, Li LJ, Li K, L. FF (2009) ImageNet: A large-scale hierarchical image database. In: Proceeding IEEE conference on computer vision and pattern recognition. Miami, pp 248–255
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: Single-stream temporal action proposals. In Proceeding IEEE conference on computer vision and pattern recognition. Honolulu, pp 6373–6382
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceeding IEEE international conference on computer vision. Venice, pp 2933–2942
Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Proceeding European conference on computer vision. Munich, pp 68–83

Download references

Funding

This work was supported by the National Natural Science Foundation of China (No. 61971016) and the Beijing Municipal Education Commission Cooperation Beijing Natural Science Foundation (No. KZ201910005007).

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Chenhao Li, Chen He, Hui Zhang, Jiacheng Yao, Jing Zhang & Li Zhuo
Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing University of Technology, Beijing, 100124, China
Chenhao Li, Chen He, Hui Zhang, Jiacheng Yao, Jing Zhang & Li Zhuo

Authors

Chenhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Chen He
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiacheng Yao
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhuo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

CL: investigation, formal analysis, data curation, methodology, writing—original draft, validation. CH: formal analysis, data curation, methodology, writing—review and editing. HZ: investigation, conceptualization, project administration, writing—review and editing. JY: formal analysis, data curation, methodology. JZ: conceptualization, funding acquisition, methodology, project administration, supervision, writing—review and editing. LZ: project administration, supervision.

Corresponding author

Correspondence to Jing Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., He, C., Zhang, H. et al. Streamer temporal action detection in live video by co-attention boundary matching. Int. J. Mach. Learn. & Cyber. 13, 3071–3088 (2022). https://doi.org/10.1007/s13042-022-01581-z

Download citation

Received: 02 June 2021
Accepted: 13 May 2022
Published: 11 June 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s13042-022-01581-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Streamer temporal action detection in live video by co-attention boundary matching

Abstract

Access this article

Similar content being viewed by others

Cascading spatio-temporal attention network for real-time action detection

Timeception Single Shot Action Detector: A Single-Stage Method for Temporal Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Streamer temporal action detection in live video by co-attention boundary matching

Abstract

Access this article

Similar content being viewed by others

Cascading spatio-temporal attention network for real-time action detection

Timeception Single Shot Action Detector: A Single-Stage Method for Temporal Action Detection

Three-Stream Action Tubelet Detector for Spatiotemporal Action Detection in Videos

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation