Abstract
With the advent of the we-media era, live video is being sought after by more and more web users. How to effectively identify and supervise the streamer activities in the live video is of great significance to promote the high-quality development of the live video industry. The streamer activity can be characterized by the temporal composition of a series of actions. To improve the accuracy of streamer temporal action detection, it is a promising path to utilize the temporal action location and co-attention mechanism to overcome the problem of blurring action boundary. Therefore, a streamer temporal action detection method by co-attention boundary matching in live video is proposed. (1) The global spatiotemporal features and action template features of live video are extracted by using two-stream convolutional network and action spatiotemporal attention network respectively. (2) The probability sequences are generated from the global spatiotemporal features through temporal action evaluation, and the boundary matching confidence maps are produced by confidence evaluation of global spatiotemporal features and action template features under the co-attention mechanism. (3) The streamer temporal actions are detected based on the action proposals generated by probability sequences and boundary matching maps. We establish a real-world streamer action BJUT-SAD dataset and conduct extensive experiments to verify that our method can boost the accuracy of streamer temporal action detection in live video. In particular, our temporal action proposal generation and streamer action detection task produce competitive results to prior methods, demonstrating the effectiveness of our method.
Similar content being viewed by others
Data availability
The datasets used or analyzed during the current study are available from the corresponding author Jing Zhang on reasonable request.
Code availability
The code used to generate results shown in this study is available from the corresponding author Jing Zhang upon request.
References
Video streaming market size, share & trends analysis report. https://www.grandviewresearch.com/industry-analysis/video-streaming-market
Must-know live video streaming statistics. https://livestream.com/blog/62-must-know-stats-live-video-streaming
Glance D. As live streaming murder becomes the new normal online, can social media be saved? https://phys.org/news/2017–04-streaming-online-social-media.html
Chao Y, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization. In Proceeding IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, pp 1130–1139
Mnih V, Heess N, Graves A (2014) Recurrent models of visual attention. In Proceeding Advances in Neural Information Processing Systems. Cambridge, pp 2204–2212
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. Guangzhou, pp 2048–2057
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceeding IEEE international conference on computer vision. Venice, pp 1839–1848
Wang H and Schmid C (2013) Action recognition with improved trajectories. In: Proceeding IEEE international conference on computer vision. Sydney, pp 3551–3558
Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: Proceeding IEEE international conference on computer vision. Seoul, pp 3888–3897
Liu Y, Ma L, Zhang Y, Liu W, Chang SF (2019) Multi-granularity generator for temporal action proposal. In Proceeding IEEE conference on computer vision and pattern recognition. Long Beach, pp 3604–3613
Lin C, Li J, Wang Y (2020) Fast learning of temporal action proposal via dense boundary generator. In: AAAI conference on artificial intelligence. New York, pp 11499–11506
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, Montreal, pp 568–576
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1933–1941
Tu Z, Xie W, Qin Q, Poppe R, Veltkamp R, Li B, Yuan J (2018) Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recogn 79:32–43
Peng Y, Zhao Y, Zhang J (2019) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceeding IEEE international conference on computer vision. Santiago, pp 4489–4497
He D, Zhou Z, Gan C, Li F, Liu X, Li Y, Wang L, Wen S (2019) StNet: Local and global spatial-temporal modeling for action recognition. In: AAAI conference on artificial intelligence. Virtual, pp 8401–8408
Li C, Zhang J, Yao J (2021) Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning. Neurocomputing 453:383–392
Shou Z, Wang D and Chang S (2016) Temporal action localization in untrimmed videos via multistage CNNs. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1049–1058
Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1914–1923
Gao J, Yang Z, Sun C, Chen K, and Nevatia R (2017) TURN TAP: Temporal unit regression network for temporal action proposals. In: Proceeding IEEE international cnference on computer vision. Venice, pp 3648–3656
Lin T, Zhao X, Su H, Wang C, and Yang M (2018) BSN: Boundary sensitive network for temporal action proposal generation. In: Proceeding European conference on computer vision. Munich, pp 3–21
Wang F, Wang GR, Du YX, He ZQ, Jiang Y (2021) A two-stage temporal proposal network for precise action localization in untrimmed video. Int J Mach Learn & Cyber 12:2199–2211
Naveed H, Khan G, Khan AU, Siddiqi A, Khan MUG (2019) Human activity recognition using mixture of heterogeneous features and sequential minimal optimization. Int J Mach Learn Cyber 10:2329–2340
Zhuang DF, Jiang M, Kong J, Liu TS (2021) Spatiotemporal attention enhanced features fusion network for action recognition. Int J Mach Learn Cyber 12:823–841
Li D, Yao T, Duan L, Mei T, Rui Y (2019) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimedia 21(2):416–428
Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatio-temporal VLAD for video action recognition. IEEE Trans Image Process 28(6):2799–2812
Gong G, Wang X, Mu Y, Tian Q (2020) Learning temporal co-attention models for unsupervised video action localization. In: Proceeding IEEE conference on computer vision and pattern recognition. Seattle, pp 9816–9825
Zeng R, Huang W, Gan C, Tan M, Huang J (2019) Graph convolutional networks for temporal action localization. In: Proceeding IEEE international conference on computer vision. Seoul, pp 7093–7102
Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-TAD: Sub-graph localization for temporal action detection. In: Proceeding IEEE conference on computer vision and pattern recognition. Seattle, pp 10153–10162
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput 109:104144
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceeding advances in neural information processing systems. Long Beach, pp 4263–4270
Wang L, Xiong Y, Lin D, Gool LV (2017) UntrimmedNets for weakly supervised action recognition and detection. In: Proceeding IEEE conference on computer vision and pattern recognition. Honolulu, pp 4325–4334
Woo S, Park J, Lee JY (2018) CBAM: convolutional block attention module. In: Proceeding European conference on computer vision. Munich, pp 3–19
Wang L, Zhang J, Tian Q, Li C, Zhuo L (2020) Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Trans Circuits Syst Video Technol 32(12):4876–4886
Zhao B, Li X, Lu X (2019) CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565
Hsieh T, Lo Y, Chen H, Liu T (2019) One-shot object detection with co-attention and co-excitation. In: Proceedings of international conference on neural information processing systems. Vancouver, pp 2725–2734
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceeding international conference on machine learning. Lille, pp 448–456
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van LG (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceeding European conference on computer vision. Amsterdam, pp 20–36
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceeding IEEE conference on computer vision and pattern recognition. Long Beach, pp 3141–3149
Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of international conference on machine learning. Long Beach, pp 7354–7363
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceeding IEEE conference on computer vision and pattern recognition. Salt Lake City, pp 7794–7803
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS improving object detection with one line of code. In: Proceeding IEEE international conference on computer vision. Venice, pp 5562–5570
Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. In: Proceeding European conference on computer vision workshop. Zurich, pp 1–6
Deng J, Dong W, Socher R, Li LJ, Li K, L. FF (2009) ImageNet: A large-scale hierarchical image database. In: Proceeding IEEE conference on computer vision and pattern recognition. Miami, pp 248–255
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: Single-stream temporal action proposals. In Proceeding IEEE conference on computer vision and pattern recognition. Honolulu, pp 6373–6382
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceeding IEEE international conference on computer vision. Venice, pp 2933–2942
Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Proceeding European conference on computer vision. Munich, pp 68–83
Funding
This work was supported by the National Natural Science Foundation of China (No. 61971016) and the Beijing Municipal Education Commission Cooperation Beijing Natural Science Foundation (No. KZ201910005007).
Author information
Authors and Affiliations
Contributions
CL: investigation, formal analysis, data curation, methodology, writing—original draft, validation. CH: formal analysis, data curation, methodology, writing—review and editing. HZ: investigation, conceptualization, project administration, writing—review and editing. JY: formal analysis, data curation, methodology. JZ: conceptualization, funding acquisition, methodology, project administration, supervision, writing—review and editing. LZ: project administration, supervision.
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, C., He, C., Zhang, H. et al. Streamer temporal action detection in live video by co-attention boundary matching. Int. J. Mach. Learn. & Cyber. 13, 3071–3088 (2022). https://doi.org/10.1007/s13042-022-01581-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-022-01581-z