Skip to main content

Advertisement

Log in

Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos

  • 1224: New Frontiers in Multimedia-based and Multimodal HCI
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human action segmentation in the video analysis for HCI (human-computer interaction) applications has been extensively studied to get the category and start time of actions that occur in videos. However, it remains an unsolved problem due to the lack of large amounts of accurate annotation of begin frame, end frame, and action category annotation data in the applications of video analysis. To handle this issue, weakly supervised action segmentation based on the transcript only uses the action annotation on the whole sequence in a long video instead of specific labeling of each frame, which significantly reduces the difficulty of obtaining delicately labeled video datasets. However, the task remains challenging for the video’s complex temporal length partition of actions. In this paper, we use the Viterbi algorithm to generate the initial and coarse action segmentation as the baseline and then design a coarse-to-fine learning framework to refine the length partition. By connecting the candidate frames of the initial segmentation points in an orderly fashion and constructing a fully connected directed graph, a new coarse-to-fine loss function is designed to learn the scores of valid and invalid segmentation paths, respectively. The framework learns the coarse-to-fine loss function in an end-to-end manner to reduce the weight of the scores of invalid segmentation paths and obtain the best video segmentation. Compared with the state-of-the-art methods, the experiments on the breakfast and 50salads datasets show that our fine partition model and coarse-to-fine loss function can obtain higher frame accuracy and significantly reduce the time spent for human action segmentation in HCI videos. The source code will be made publicly available (https://github.com/WeaklyActionSegmentation).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The two datasets used during the current study are available in the public web links from the references [20] and [40]. All data generated or analysed during this study are included in this published article and more details are available from the corresponding author on reasonable request.

Notes

  1. The NN-Viterbi represents that the action segmentation is learned by GRU neural network and Viterbi algorithm in [30].

References

  1. AbuFarha Y, Li S J, Liu Y, et al. (2020) MS-TCN++: Multi-stage temporal convolutional network for action segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence

  2. Adiono T, Aska Y, Fuada S, et al. (2017) Design of an OFDM System for VLC with a Viterbi Decoder. IEIE Transaction on Smart Processing and Computing(SPC) 6(6):455–465

    Article  Google Scholar 

  3. Agrawal A, Vishwakarma S (2013) A survey on activity recognition and behavior understanding in video surveillance. Visual Computer 29:983–1009

    Article  Google Scholar 

  4. Alayrac JB, Agrawal N, Bojanowski P, Laptev I, Lacoste-Julien S, Sivic J (2016) Unsupervised learning from narrated instruction videos. In: IEEE Conference Computer Vision Pattern Recognition, pp 4575–4583

  5. Amin S, Andriluka M, Rohrbach M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: IEEE Conference Computer Vision. Pattern Recognition, pp 1194–1201

  6. Arora S, Kalsotra R (2021) Background subtraction for moving object detection: explorations of recent developments and challenges. Visual Computer

  7. Arunlal KS, Hariprasad SA (2012) An efficient viterbi decoder. International Journal of Computer Science, Engineering and Applications 2(1):95

    Article  Google Scholar 

  8. Bach I, Bojanowski P, Lajugie R, Laptev F, Ponce J, Schmid C, Sivic J (2014) Weakly supervised action labeling in videos under ordering constraints. In: Eur. Conf. Comput. Vis., pp 628–643

  9. Bowden R, Koller O, Ney H (2016) Deep hand: How to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: IEEE Conference Computer Vision Pattern Recognition, pp 3793–3802

  10. Buch S, Escorcia V, Shen C et al (2017) SST: Single-stream temporal action proposals. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 2911–2920

  11. Chang CY, Huang DA, Sui Y, Fei-Fei L, Niebles JC (2019) D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: IEEE Conference Computer Vision Pattern Recognition, pp 3546–3555

  12. Dieleman S, van den Oord A, Zen H et al (2016) WaveNet: A Generative Model for Raw Audio, 9th ISCA Speech Synthesis Workshop., pp 125–125.

  13. Ding L, Xu C (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In: IEEE Conference Computer Vision Pattern Recognition, pp 6508–6516

  14. Dollár P, He K, Goyal P, Girshick R, Lin T Y (2017) Focal Loss for Dense Object Detection, IEEE Transaction Pattern Analysis Machine Intelligence

  15. el Yacoubi MA, Granger N (2017) Comparing hybrid NN-HMM and RNN for temporal modeling in gesture recognition. In: International Conference on Neural Information Processing. Springer, Cham, pp 147–156

  16. Farha YA, Gall J (2019) MS-TCN: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3575–3584

  17. Fayyaz M, Gall J (2020) SCT : Set Constrained Temporal Transformer for Set Supervised Action Segmentation. In: IEEE/CVF Conference Computer Vision Pattern Recognition, pp 501–510

  18. Flynn M D, Hager GD, Lea C, Reiter A, Vidal R (2017) Temporal convolutional networks for action segmentation and detection. In: IEEE Conference Computer Vision. Pattern Recognition, pp 156–165

  19. Flynn MD, Lea C, Vidal R, et al. (2017) Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 156–165

  20. Gall J, Kuehne H, Richard A (2017) Weakly supervised action learning with RNN based fine-to-coarse modeling. In: IEEE Conference Computer Vision Pattern Recognition, pp 754–763

  21. Gall J, Kuehne H, Richard A (2017) Weakly supervised learning of actions from transcripts. Computer Vision Image Understanding 163:78–89

    Article  Google Scholar 

  22. Gall J, Kuehne H, Richard A (2018) A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans Pattern Anal Mach Intell 42(4):765–779

    Google Scholar 

  23. Gall J, Li Z, Farha Y A (2021) Temporal Action Segmentation from Timestamp Supervision, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  24. Gall J, Richard A (2016) Temporal action detection using a statistical language model. In: IEEE Conference Computer Vision. Pattern Recognition, pp 3551–3558

  25. Gall J, Richard A, Kuehne H (2018) Action sets: Weakly supervised action segmentation without ordering constraints. In: IEEE Conference Computer Vision. Pattern Recognition, pp 5987–5996

  26. Gall J, Serre T, Kuehne H (2016) An end-to-end generative framework for video segmentation and recognition. In: IEEE Winter Conference Application Computer Vision, pp 1–8

  27. Gao S, Cheng MM, Zhao K, et al., Zhao K (2019) Res2net: A new multi-scale backbone architecture,IEEE transactions on pattern analysis and machine intelligence

  28. Gao J, Nevatia R, Yang Z (2017) Cascaded boundary regression for temporal action detection. arXiv:1705.01180

  29. Huang W, Tan M, Zeng R et al (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7094–7103

  30. Iqbal A, Gall J, Kuehne H, Richard A (2018) Neuralnetwork-viterbi: A framework for weakly supervised video learning. In: IEEE Conference Computer Vision Pattern Recognition, pp 7386–7395

  31. Jones M, Marks T K, Singh B, Shao M, Tuzel O (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE Conference Computer Vision Pattern Recognition, pp 1961–1970

  32. Kim DY, Yoon Y, Yu J, et al. (2020) Action matching network: open-set action recognition using spatio-temporal representation matching. Vis Comput 36:1457–1471

    Article  Google Scholar 

  33. Koller O, Ney H, Zargaran S (2017) Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: IEEE Conference Computer Vision Pattern Recognition, pp 4297–4305

  34. Laptev I, Marszalek M, Rozenfeld B, Schmid C (2008) Learning realistic human actions from movies. In: IEEE Conference Computer Vision Pattern Recognition, pp 1–8

  35. Laptev I, Marszalek M, Schmid C (2009) Actions in context. In: IEEE Conference Computer Vision Pattern Recognition, pp 2929–2936

  36. Lei P, Li J, Todorovic S (2019) Weakly Supervised Energy-Base Learning for Action Segmentation. In: IEEE Conference Computer Vision Pattern Recognition, pp 6243–6251

  37. Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6742–6751

  38. Li J, Todorovic S (2020) Set-Constrained Viterbi for Set-Supervised Action Segmentation. In: IEEE/CVF Conference Computer Vision Pattern Recognition, pp 10820–10829

  39. Li J, Todorovic S (2021) Anchor-Constrained Viterbi for Set-Supervised Action Segmentation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  40. Mckenna SJ, Stein S (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp 729–738

  41. Mori G, Russakovsky O, Yeung S et al (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687

  42. Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European Conference Computer Vision, pp 143–156

  43. Schmid C, Wang H (2013) Action recognition with improved trajectories. In: IEEE Internationa Conference Computer Vision, pp 3551–3558

  44. Souri Y et al (2021) Fast weakly supervised action segmentation using mutual consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence

  45. Viterbi AJ (2006) A personal history of the Viterbi algorithm. IEEE Signal Process Mag 23(4):120–142

    Article  Google Scholar 

  46. Wang L, Xiong Y, Zhao Y, et al. (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2914–2923

  47. Zhou ZH (2018) A brief introduction to weakly supervised learning. National science review 5(1):44–53

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank A.P. Tian Wang of Beihang University and A.P. Fang Wan of University of Chinese Academy of Sciences for helpful comments and discussions. Longshuai Sheng, Ce Li declare that they have no conflict of interest.

Funding

This study was funded by National Natural Science Foundation of China (62076016, 61972016, 62176260), Beijing Nova Program of Science and Technology (Z191100001119106,Z211100002121147), Beijing Municipal Natural Science Foundation (4202065), and supported by the Fundamental Research Funds for the Central Universities (2022YJSJD11, J210409).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ce Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sheng, L., Li, C. Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos. Multimed Tools Appl 82, 12977–12993 (2023). https://doi.org/10.1007/s11042-022-13792-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13792-1

Keywords

Navigation