Skip to main content
Log in

Streamer temporal action detection in live video by co-attention boundary matching

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

With the advent of the we-media era, live video is being sought after by more and more web users. How to effectively identify and supervise the streamer activities in the live video is of great significance to promote the high-quality development of the live video industry. The streamer activity can be characterized by the temporal composition of a series of actions. To improve the accuracy of streamer temporal action detection, it is a promising path to utilize the temporal action location and co-attention mechanism to overcome the problem of blurring action boundary. Therefore, a streamer temporal action detection method by co-attention boundary matching in live video is proposed. (1) The global spatiotemporal features and action template features of live video are extracted by using two-stream convolutional network and action spatiotemporal attention network respectively. (2) The probability sequences are generated from the global spatiotemporal features through temporal action evaluation, and the boundary matching confidence maps are produced by confidence evaluation of global spatiotemporal features and action template features under the co-attention mechanism. (3) The streamer temporal actions are detected based on the action proposals generated by probability sequences and boundary matching maps. We establish a real-world streamer action BJUT-SAD dataset and conduct extensive experiments to verify that our method can boost the accuracy of streamer temporal action detection in live video. In particular, our temporal action proposal generation and streamer action detection task produce competitive results to prior methods, demonstrating the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The datasets used or analyzed during the current study are available from the corresponding author Jing Zhang on reasonable request.

Code availability

The code used to generate results shown in this study is available from the corresponding author Jing Zhang upon request.

References

  1. Video streaming market size, share & trends analysis report. https://www.grandviewresearch.com/industry-analysis/video-streaming-market

  2. Must-know live video streaming statistics. https://livestream.com/blog/62-must-know-stats-live-video-streaming

  3. Glance D. As live streaming murder becomes the new normal online, can social media be saved? https://phys.org/news/2017–04-streaming-online-social-media.html

  4. Chao Y, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization. In Proceeding IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, pp 1130–1139

  5. Mnih V, Heess N, Graves A (2014) Recurrent models of visual attention. In Proceeding Advances in Neural Information Processing Systems. Cambridge, pp 2204–2212

    Google Scholar 

  6. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. Guangzhou, pp 2048–2057

  7. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceeding IEEE international conference on computer vision. Venice, pp 1839–1848

  8. Wang H and Schmid C (2013) Action recognition with improved trajectories. In: Proceeding IEEE international conference on computer vision. Sydney, pp 3551–3558

  9. Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: Proceeding IEEE international conference on computer vision. Seoul, pp 3888–3897

  10. Liu Y, Ma L, Zhang Y, Liu W, Chang SF (2019) Multi-granularity generator for temporal action proposal. In Proceeding IEEE conference on computer vision and pattern recognition. Long Beach, pp 3604–3613

  11. Lin C, Li J, Wang Y (2020) Fast learning of temporal action proposal via dense boundary generator. In: AAAI conference on artificial intelligence. New York, pp 11499–11506

  12. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems, Montreal, pp 568–576

  13. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1933–1941

  14. Tu Z, Xie W, Qin Q, Poppe R, Veltkamp R, Li B, Yuan J (2018) Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recogn 79:32–43

    Article  Google Scholar 

  15. Peng Y, Zhao Y, Zhang J (2019) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circuits Syst Video Technol 29(3):773–786

    Article  Google Scholar 

  16. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceeding IEEE international conference on computer vision. Santiago, pp 4489–4497

  17. He D, Zhou Z, Gan C, Li F, Liu X, Li Y, Wang L, Wen S (2019) StNet: Local and global spatial-temporal modeling for action recognition. In: AAAI conference on artificial intelligence. Virtual, pp 8401–8408

  18. Li C, Zhang J, Yao J (2021) Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning. Neurocomputing 453:383–392

    Article  Google Scholar 

  19. Shou Z, Wang D and Chang S (2016) Temporal action localization in untrimmed videos via multistage CNNs. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1049–1058

  20. Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceeding IEEE conference on computer vision and pattern recognition. Las Vegas, pp 1914–1923

  21. Gao J, Yang Z, Sun C, Chen K, and Nevatia R (2017) TURN TAP: Temporal unit regression network for temporal action proposals. In: Proceeding IEEE international cnference on computer vision. Venice, pp 3648–3656

  22. Lin T, Zhao X, Su H, Wang C, and Yang M (2018) BSN: Boundary sensitive network for temporal action proposal generation. In: Proceeding European conference on computer vision. Munich, pp 3–21

  23. Wang F, Wang GR, Du YX, He ZQ, Jiang Y (2021) A two-stage temporal proposal network for precise action localization in untrimmed video. Int J Mach Learn & Cyber 12:2199–2211

    Article  Google Scholar 

  24. Naveed H, Khan G, Khan AU, Siddiqi A, Khan MUG (2019) Human activity recognition using mixture of heterogeneous features and sequential minimal optimization. Int J Mach Learn Cyber 10:2329–2340

    Article  Google Scholar 

  25. Zhuang DF, Jiang M, Kong J, Liu TS (2021) Spatiotemporal attention enhanced features fusion network for action recognition. Int J Mach Learn Cyber 12:823–841

    Article  Google Scholar 

  26. Li D, Yao T, Duan L, Mei T, Rui Y (2019) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimedia 21(2):416–428

    Article  Google Scholar 

  27. Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatio-temporal VLAD for video action recognition. IEEE Trans Image Process 28(6):2799–2812

    Article  MathSciNet  Google Scholar 

  28. Gong G, Wang X, Mu Y, Tian Q (2020) Learning temporal co-attention models for unsupervised video action localization. In: Proceeding IEEE conference on computer vision and pattern recognition. Seattle, pp 9816–9825

  29. Zeng R, Huang W, Gan C, Tan M, Huang J (2019) Graph convolutional networks for temporal action localization. In: Proceeding IEEE international conference on computer vision. Seoul, pp 7093–7102

  30. Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-TAD: Sub-graph localization for temporal action detection. In: Proceeding IEEE conference on computer vision and pattern recognition. Seattle, pp 10153–10162

  31. Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput 109:104144

    Article  Google Scholar 

  32. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119

  33. Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceeding advances in neural information processing systems. Long Beach, pp 4263–4270

  34. Wang L, Xiong Y, Lin D, Gool LV (2017) UntrimmedNets for weakly supervised action recognition and detection. In: Proceeding IEEE conference on computer vision and pattern recognition. Honolulu, pp 4325–4334

  35. Woo S, Park J, Lee JY (2018) CBAM: convolutional block attention module. In: Proceeding European conference on computer vision. Munich, pp 3–19

  36. Wang L, Zhang J, Tian Q, Li C, Zhuo L (2020) Porn streamer recognition in live video streaming via attention-gated multimodal deep features. IEEE Trans Circuits Syst Video Technol 32(12):4876–4886

    Article  Google Scholar 

  37. Zhao B, Li X, Lu X (2019) CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565

    Article  MathSciNet  Google Scholar 

  38. Hsieh T, Lo Y, Chen H, Liu T (2019) One-shot object detection with co-attention and co-excitation. In: Proceedings of international conference on neural information processing systems. Vancouver, pp 2725–2734

  39. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceeding international conference on machine learning. Lille, pp 448–456

  40. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van LG (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceeding European conference on computer vision. Amsterdam, pp 20–36

  41. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceeding IEEE conference on computer vision and pattern recognition. Long Beach, pp 3141–3149

  42. Zhang H, Goodfellow IJ, Metaxas DN, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of international conference on machine learning. Long Beach, pp 7354–7363

  43. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceeding IEEE conference on computer vision and pattern recognition. Salt Lake City, pp 7794–7803

  44. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS improving object detection with one line of code. In: Proceeding IEEE international conference on computer vision. Venice, pp 5562–5570

  45. Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. In: Proceeding European conference on computer vision workshop. Zurich, pp 1–6

  46. Deng J, Dong W, Socher R, Li LJ, Li K, L. FF (2009) ImageNet: A large-scale hierarchical image database. In: Proceeding IEEE conference on computer vision and pattern recognition. Miami, pp 248–255

  47. Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: Single-stream temporal action proposals. In Proceeding IEEE conference on computer vision and pattern recognition. Honolulu, pp 6373–6382

  48. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceeding IEEE international conference on computer vision. Venice, pp 2933–2942

  49. Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Proceeding European conference on computer vision. Munich, pp 68–83

Download references

Funding

This work was supported by the National Natural Science Foundation of China (No. 61971016) and the Beijing Municipal Education Commission Cooperation Beijing Natural Science Foundation (No. KZ201910005007).

Author information

Authors and Affiliations

Authors

Contributions

CL: investigation, formal analysis, data curation, methodology, writing—original draft, validation. CH: formal analysis, data curation, methodology, writing—review and editing. HZ: investigation, conceptualization, project administration, writing—review and editing. JY: formal analysis, data curation, methodology. JZ: conceptualization, funding acquisition, methodology, project administration, supervision, writing—review and editing. LZ: project administration, supervision.

Corresponding author

Correspondence to Jing Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., He, C., Zhang, H. et al. Streamer temporal action detection in live video by co-attention boundary matching. Int. J. Mach. Learn. & Cyber. 13, 3071–3088 (2022). https://doi.org/10.1007/s13042-022-01581-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-022-01581-z

Keywords

Navigation