Skip to main content
Log in

Two-stream graph convolutional neural network fusion for weakly supervised temporal action detection

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Weakly supervised temporal action detection is an important and challenging task, which is to detect temporal intervals of actions and identify category with only video-level labels. Correctly identifying the transition state between action and background will improve the detection accuracy; therefore, this paper focuses on filtering the transition state and proposes two-stream graph convolutional neural network fusion for weakly supervised temporal action detection. Generally, the transition state changes prominently and lasts for a short time, but it is not the same as the characteristics of the action. The feature difference between two video segments with temporal interactions indicates whether this segment belongs to the transition state. Then, according to the feature similarity and temporal correlation of the segments, the semantic similarity weighted graph and the transition-aware temporal correlation graph are constructed. Finally, the temporal attention sequence of video segments is extracted according to the fused two-stream graph feature. Taking the attention-based feature expression as input for the linear classifier to generate the class activation sequence, and the temporal action detection is performed accordingly. Experimental results on shared datasets show that the proposed method can effectively improve the performance of action detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Chua, J.L., Chang, Y.C., Lim, W.K.: A simple vision-based fall detection technique for indoor video surveillance. SIViP 9, 623–633 (2015)

    Article  Google Scholar 

  2. Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1346–1353 (2012)

  3. Xiong, B., Kalantidis, Y., Ghadiyaram, D., et al.: Less is more: learning highlight detection from video duration. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1258–1267 (2019)

  4. Zhao, Y., Xiong, Y., Wang, L., et al.: Temporal action detection with structured segment networks. Int. J. Comput. Vis. 128(1), 74–95 (2020)

    Article  MathSciNet  Google Scholar 

  5. Wang, L., Xiong, Y., Lin, D., et al.: UntrimmedNets for weakly supervised action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6402–6411 (2017)

  6. Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11320–11327 (2020)

  7. Islam, A., Radke, R.J.: Weakly supervised temporal action localization using deep metric learning. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 536–545 (2020)

  8. Islam, A., Long, C., Radke, R.J.: A hybrid attention mechanism for weakly-supervised temporal action localization. arXiv preprint, arXiv: 2101.00545 (2021)

  9. Rashid, M., Kjellstrom, H., Lee, Y.J.: Action graphs: weakly-supervised action localization with graph convolution networks. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 604–613 (2020)

  10. Nguyen, P., Liu, T., Prasad, G., et al.: Weakly supervised action localization by sparse temporal pooling network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6752–6761 (2018)

  11. Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1298–1307 (2019)

  12. Nguyen, P.X., Ramanan, D., Fowlkes, C.C.: Weakly-supervised action localization with background modeling. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5501–5510 (2019)

  13. Shi, B., Dai, Q., Mu, Y., et al.: Weakly-supervised action localization by generative attention modeling. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1006–1016 (2020)

  14. Zhai, Y., Wang, L., Tang, W., et al.: Two-stream consensus network for weakly-supervised temporal action localization. In: European Conference on Computer Vision, pp. 37–54 (2020)

  15. Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-TALC: weakly-supervised temporal activity localization and classification. In: European Conference on Computer Vision, pp. 588–607 (2018)

  16. Fernando, B., Yin Chet, C.T., Bilen, H.: Weakly supervised Gaussian networks for action detection. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 526–535 (2020)

  17. Huang, L., Huang, Y., Ouyang, W., et al.: Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11053–11060 (2020)

  18. Carreira, J., Zisserman, A., Vadis, Q.: Action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)

  19. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

  20. Lin, T., Zhao, X., Su, H., et al.: BSN: boundary sensitive network for temporal action proposal generation. In: European Conference on Computer Vision, Munich, Germany, pp. 3–21 (2018)

  21. Idrees, H., Zamir, A.R., Jiang, Y., et al.: The THUMOS challenge on action recognition for videos “in the wild.” Comput. Vis. Image Underst. 155, 1–23 (2017)

    Article  Google Scholar 

  22. Heilbron, F.C., Escorcia, V., Ghanem, B., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970 (2015)

  23. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. arXiv preprint, arXiv: 1412.6980 (2014)

  24. Shou, Z., Gao, H., Zhang, L., et al.: AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: European Conference on Computer Vision, pp. 162–179 (2018)

  25. Yuan, Y., Lyu, Y., Shen, X., et al.: Marginalized average attentional network for weakly supervised learning. arXiv preprint, arXiv: 1905.08586 (2019)

  26. Ge, Y., Qin, X., Yang, D., et al.: Deep snippet selective network for weakly supervised temporal action localization. Pattern Recognit. 110, 107686 (2021)

    Article  Google Scholar 

  27. Zhang, X., Li, C., Shi, H., et al.: AdapNet: adaptability decomposing encoder-decoder network for weakly supervised action recognition and localization. IEEE Trans. Neural Netw. Learn. Syst. (2020). https://doi.org/10.1109/TNNLS.2019.2962815

    Article  Google Scholar 

  28. Xu, Y., Zhang, C., Cheng, Z., et al.: Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9070–9078 (2019)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 61771420 and 62001413, the Natural Science Foundation of Hebei Province under Grants F2020203064, as well as the Doctoral Foundation of Yanshan University under Grants BL18033.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengping Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, M., Hu, Z., Li, S. et al. Two-stream graph convolutional neural network fusion for weakly supervised temporal action detection. SIViP 16, 947–954 (2022). https://doi.org/10.1007/s11760-021-02039-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-021-02039-5

Keywords

Navigation