Abstract
Human action segmentation in the video analysis for HCI (human-computer interaction) applications has been extensively studied to get the category and start time of actions that occur in videos. However, it remains an unsolved problem due to the lack of large amounts of accurate annotation of begin frame, end frame, and action category annotation data in the applications of video analysis. To handle this issue, weakly supervised action segmentation based on the transcript only uses the action annotation on the whole sequence in a long video instead of specific labeling of each frame, which significantly reduces the difficulty of obtaining delicately labeled video datasets. However, the task remains challenging for the video’s complex temporal length partition of actions. In this paper, we use the Viterbi algorithm to generate the initial and coarse action segmentation as the baseline and then design a coarse-to-fine learning framework to refine the length partition. By connecting the candidate frames of the initial segmentation points in an orderly fashion and constructing a fully connected directed graph, a new coarse-to-fine loss function is designed to learn the scores of valid and invalid segmentation paths, respectively. The framework learns the coarse-to-fine loss function in an end-to-end manner to reduce the weight of the scores of invalid segmentation paths and obtain the best video segmentation. Compared with the state-of-the-art methods, the experiments on the breakfast and 50salads datasets show that our fine partition model and coarse-to-fine loss function can obtain higher frame accuracy and significantly reduce the time spent for human action segmentation in HCI videos. The source code will be made publicly available (https://github.com/WeaklyActionSegmentation).
Similar content being viewed by others
Data Availability
Notes
The NN-Viterbi represents that the action segmentation is learned by GRU neural network and Viterbi algorithm in [30].
References
AbuFarha Y, Li S J, Liu Y, et al. (2020) MS-TCN++: Multi-stage temporal convolutional network for action segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence
Adiono T, Aska Y, Fuada S, et al. (2017) Design of an OFDM System for VLC with a Viterbi Decoder. IEIE Transaction on Smart Processing and Computing(SPC) 6(6):455–465
Agrawal A, Vishwakarma S (2013) A survey on activity recognition and behavior understanding in video surveillance. Visual Computer 29:983–1009
Alayrac JB, Agrawal N, Bojanowski P, Laptev I, Lacoste-Julien S, Sivic J (2016) Unsupervised learning from narrated instruction videos. In: IEEE Conference Computer Vision Pattern Recognition, pp 4575–4583
Amin S, Andriluka M, Rohrbach M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: IEEE Conference Computer Vision. Pattern Recognition, pp 1194–1201
Arora S, Kalsotra R (2021) Background subtraction for moving object detection: explorations of recent developments and challenges. Visual Computer
Arunlal KS, Hariprasad SA (2012) An efficient viterbi decoder. International Journal of Computer Science, Engineering and Applications 2(1):95
Bach I, Bojanowski P, Lajugie R, Laptev F, Ponce J, Schmid C, Sivic J (2014) Weakly supervised action labeling in videos under ordering constraints. In: Eur. Conf. Comput. Vis., pp 628–643
Bowden R, Koller O, Ney H (2016) Deep hand: How to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: IEEE Conference Computer Vision Pattern Recognition, pp 3793–3802
Buch S, Escorcia V, Shen C et al (2017) SST: Single-stream temporal action proposals. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 2911–2920
Chang CY, Huang DA, Sui Y, Fei-Fei L, Niebles JC (2019) D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: IEEE Conference Computer Vision Pattern Recognition, pp 3546–3555
Dieleman S, van den Oord A, Zen H et al (2016) WaveNet: A Generative Model for Raw Audio, 9th ISCA Speech Synthesis Workshop., pp 125–125.
Ding L, Xu C (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In: IEEE Conference Computer Vision Pattern Recognition, pp 6508–6516
Dollár P, He K, Goyal P, Girshick R, Lin T Y (2017) Focal Loss for Dense Object Detection, IEEE Transaction Pattern Analysis Machine Intelligence
el Yacoubi MA, Granger N (2017) Comparing hybrid NN-HMM and RNN for temporal modeling in gesture recognition. In: International Conference on Neural Information Processing. Springer, Cham, pp 147–156
Farha YA, Gall J (2019) MS-TCN: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3575–3584
Fayyaz M, Gall J (2020) SCT : Set Constrained Temporal Transformer for Set Supervised Action Segmentation. In: IEEE/CVF Conference Computer Vision Pattern Recognition, pp 501–510
Flynn M D, Hager GD, Lea C, Reiter A, Vidal R (2017) Temporal convolutional networks for action segmentation and detection. In: IEEE Conference Computer Vision. Pattern Recognition, pp 156–165
Flynn MD, Lea C, Vidal R, et al. (2017) Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 156–165
Gall J, Kuehne H, Richard A (2017) Weakly supervised action learning with RNN based fine-to-coarse modeling. In: IEEE Conference Computer Vision Pattern Recognition, pp 754–763
Gall J, Kuehne H, Richard A (2017) Weakly supervised learning of actions from transcripts. Computer Vision Image Understanding 163:78–89
Gall J, Kuehne H, Richard A (2018) A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans Pattern Anal Mach Intell 42(4):765–779
Gall J, Li Z, Farha Y A (2021) Temporal Action Segmentation from Timestamp Supervision, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Gall J, Richard A (2016) Temporal action detection using a statistical language model. In: IEEE Conference Computer Vision. Pattern Recognition, pp 3551–3558
Gall J, Richard A, Kuehne H (2018) Action sets: Weakly supervised action segmentation without ordering constraints. In: IEEE Conference Computer Vision. Pattern Recognition, pp 5987–5996
Gall J, Serre T, Kuehne H (2016) An end-to-end generative framework for video segmentation and recognition. In: IEEE Winter Conference Application Computer Vision, pp 1–8
Gao S, Cheng MM, Zhao K, et al., Zhao K (2019) Res2net: A new multi-scale backbone architecture,IEEE transactions on pattern analysis and machine intelligence
Gao J, Nevatia R, Yang Z (2017) Cascaded boundary regression for temporal action detection. arXiv:1705.01180
Huang W, Tan M, Zeng R et al (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7094–7103
Iqbal A, Gall J, Kuehne H, Richard A (2018) Neuralnetwork-viterbi: A framework for weakly supervised video learning. In: IEEE Conference Computer Vision Pattern Recognition, pp 7386–7395
Jones M, Marks T K, Singh B, Shao M, Tuzel O (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE Conference Computer Vision Pattern Recognition, pp 1961–1970
Kim DY, Yoon Y, Yu J, et al. (2020) Action matching network: open-set action recognition using spatio-temporal representation matching. Vis Comput 36:1457–1471
Koller O, Ney H, Zargaran S (2017) Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: IEEE Conference Computer Vision Pattern Recognition, pp 4297–4305
Laptev I, Marszalek M, Rozenfeld B, Schmid C (2008) Learning realistic human actions from movies. In: IEEE Conference Computer Vision Pattern Recognition, pp 1–8
Laptev I, Marszalek M, Schmid C (2009) Actions in context. In: IEEE Conference Computer Vision Pattern Recognition, pp 2929–2936
Lei P, Li J, Todorovic S (2019) Weakly Supervised Energy-Base Learning for Action Segmentation. In: IEEE Conference Computer Vision Pattern Recognition, pp 6243–6251
Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6742–6751
Li J, Todorovic S (2020) Set-Constrained Viterbi for Set-Supervised Action Segmentation. In: IEEE/CVF Conference Computer Vision Pattern Recognition, pp 10820–10829
Li J, Todorovic S (2021) Anchor-Constrained Viterbi for Set-Supervised Action Segmentation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Mckenna SJ, Stein S (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp 729–738
Mori G, Russakovsky O, Yeung S et al (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European Conference Computer Vision, pp 143–156
Schmid C, Wang H (2013) Action recognition with improved trajectories. In: IEEE Internationa Conference Computer Vision, pp 3551–3558
Souri Y et al (2021) Fast weakly supervised action segmentation using mutual consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence
Viterbi AJ (2006) A personal history of the Viterbi algorithm. IEEE Signal Process Mag 23(4):120–142
Wang L, Xiong Y, Zhao Y, et al. (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2914–2923
Zhou ZH (2018) A brief introduction to weakly supervised learning. National science review 5(1):44–53
Acknowledgements
The authors would like to thank A.P. Tian Wang of Beihang University and A.P. Fang Wan of University of Chinese Academy of Sciences for helpful comments and discussions. Longshuai Sheng, Ce Li declare that they have no conflict of interest.
Funding
This study was funded by National Natural Science Foundation of China (62076016, 61972016, 62176260), Beijing Nova Program of Science and Technology (Z191100001119106,Z211100002121147), Beijing Municipal Natural Science Foundation (4202065), and supported by the Fundamental Research Funds for the Central Universities (2022YJSJD11, J210409).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sheng, L., Li, C. Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos. Multimed Tools Appl 82, 12977–12993 (2023). https://doi.org/10.1007/s11042-022-13792-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13792-1