Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos

Sheng, Longshuai; Li, Ce

doi:10.1007/s11042-022-13792-1

Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos

1224: New Frontiers in Multimedia-based and Multimodal HCI
Published: 01 December 2022

Volume 82, pages 12977–12993, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

122 Accesses
2 Citations
Explore all metrics

Abstract

Human action segmentation in the video analysis for HCI (human-computer interaction) applications has been extensively studied to get the category and start time of actions that occur in videos. However, it remains an unsolved problem due to the lack of large amounts of accurate annotation of begin frame, end frame, and action category annotation data in the applications of video analysis. To handle this issue, weakly supervised action segmentation based on the transcript only uses the action annotation on the whole sequence in a long video instead of specific labeling of each frame, which significantly reduces the difficulty of obtaining delicately labeled video datasets. However, the task remains challenging for the video’s complex temporal length partition of actions. In this paper, we use the Viterbi algorithm to generate the initial and coarse action segmentation as the baseline and then design a coarse-to-fine learning framework to refine the length partition. By connecting the candidate frames of the initial segmentation points in an orderly fashion and constructing a fully connected directed graph, a new coarse-to-fine loss function is designed to learn the scores of valid and invalid segmentation paths, respectively. The framework learns the coarse-to-fine loss function in an end-to-end manner to reduce the weight of the scores of invalid segmentation paths and obtain the best video segmentation. Compared with the state-of-the-art methods, the experiments on the breakfast and 50salads datasets show that our fine partition model and coarse-to-fine loss function can obtain higher frame accuracy and significantly reduce the time spent for human action segmentation in HCI videos. The source code will be made publicly available (https://github.com/WeaklyActionSegmentation).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

ASGSA: global semantic-aware network for action segmentation

Article 26 April 2024

TSRN: two-stage refinement network for temporal action segmentation

Article 15 May 2023

Boundary-sensitive denoised temporal reasoning network for video action segmentation

Article 24 April 2024

Data Availability

The two datasets used during the current study are available in the public web links from the references [20] and [40]. All data generated or analysed during this study are included in this published article and more details are available from the corresponding author on reasonable request.

Notes

The NN-Viterbi represents that the action segmentation is learned by GRU neural network and Viterbi algorithm in [30].

References

AbuFarha Y, Li S J, Liu Y, et al. (2020) MS-TCN++: Multi-stage temporal convolutional network for action segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence
Adiono T, Aska Y, Fuada S, et al. (2017) Design of an OFDM System for VLC with a Viterbi Decoder. IEIE Transaction on Smart Processing and Computing(SPC) 6(6):455–465
Article Google Scholar
Agrawal A, Vishwakarma S (2013) A survey on activity recognition and behavior understanding in video surveillance. Visual Computer 29:983–1009
Article Google Scholar
Alayrac JB, Agrawal N, Bojanowski P, Laptev I, Lacoste-Julien S, Sivic J (2016) Unsupervised learning from narrated instruction videos. In: IEEE Conference Computer Vision Pattern Recognition, pp 4575–4583
Amin S, Andriluka M, Rohrbach M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: IEEE Conference Computer Vision. Pattern Recognition, pp 1194–1201
Arora S, Kalsotra R (2021) Background subtraction for moving object detection: explorations of recent developments and challenges. Visual Computer
Arunlal KS, Hariprasad SA (2012) An efficient viterbi decoder. International Journal of Computer Science, Engineering and Applications 2(1):95
Article Google Scholar
Bach I, Bojanowski P, Lajugie R, Laptev F, Ponce J, Schmid C, Sivic J (2014) Weakly supervised action labeling in videos under ordering constraints. In: Eur. Conf. Comput. Vis., pp 628–643
Bowden R, Koller O, Ney H (2016) Deep hand: How to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: IEEE Conference Computer Vision Pattern Recognition, pp 3793–3802
Buch S, Escorcia V, Shen C et al (2017) SST: Single-stream temporal action proposals. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 2911–2920
Chang CY, Huang DA, Sui Y, Fei-Fei L, Niebles JC (2019) D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In: IEEE Conference Computer Vision Pattern Recognition, pp 3546–3555
Dieleman S, van den Oord A, Zen H et al (2016) WaveNet: A Generative Model for Raw Audio, 9th ISCA Speech Synthesis Workshop., pp 125–125.
Ding L, Xu C (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In: IEEE Conference Computer Vision Pattern Recognition, pp 6508–6516
Dollár P, He K, Goyal P, Girshick R, Lin T Y (2017) Focal Loss for Dense Object Detection, IEEE Transaction Pattern Analysis Machine Intelligence
el Yacoubi MA, Granger N (2017) Comparing hybrid NN-HMM and RNN for temporal modeling in gesture recognition. In: International Conference on Neural Information Processing. Springer, Cham, pp 147–156
Farha YA, Gall J (2019) MS-TCN: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3575–3584
Fayyaz M, Gall J (2020) SCT : Set Constrained Temporal Transformer for Set Supervised Action Segmentation. In: IEEE/CVF Conference Computer Vision Pattern Recognition, pp 501–510
Flynn M D, Hager GD, Lea C, Reiter A, Vidal R (2017) Temporal convolutional networks for action segmentation and detection. In: IEEE Conference Computer Vision. Pattern Recognition, pp 156–165
Flynn MD, Lea C, Vidal R, et al. (2017) Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 156–165
Gall J, Kuehne H, Richard A (2017) Weakly supervised action learning with RNN based fine-to-coarse modeling. In: IEEE Conference Computer Vision Pattern Recognition, pp 754–763
Gall J, Kuehne H, Richard A (2017) Weakly supervised learning of actions from transcripts. Computer Vision Image Understanding 163:78–89
Article Google Scholar
Gall J, Kuehne H, Richard A (2018) A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. IEEE Trans Pattern Anal Mach Intell 42(4):765–779
Google Scholar
Gall J, Li Z, Farha Y A (2021) Temporal Action Segmentation from Timestamp Supervision, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Gall J, Richard A (2016) Temporal action detection using a statistical language model. In: IEEE Conference Computer Vision. Pattern Recognition, pp 3551–3558
Gall J, Richard A, Kuehne H (2018) Action sets: Weakly supervised action segmentation without ordering constraints. In: IEEE Conference Computer Vision. Pattern Recognition, pp 5987–5996
Gall J, Serre T, Kuehne H (2016) An end-to-end generative framework for video segmentation and recognition. In: IEEE Winter Conference Application Computer Vision, pp 1–8
Gao S, Cheng MM, Zhao K, et al., Zhao K (2019) Res2net: A new multi-scale backbone architecture,IEEE transactions on pattern analysis and machine intelligence
Gao J, Nevatia R, Yang Z (2017) Cascaded boundary regression for temporal action detection. arXiv:1705.01180
Huang W, Tan M, Zeng R et al (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7094–7103
Iqbal A, Gall J, Kuehne H, Richard A (2018) Neuralnetwork-viterbi: A framework for weakly supervised video learning. In: IEEE Conference Computer Vision Pattern Recognition, pp 7386–7395
Jones M, Marks T K, Singh B, Shao M, Tuzel O (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE Conference Computer Vision Pattern Recognition, pp 1961–1970
Kim DY, Yoon Y, Yu J, et al. (2020) Action matching network: open-set action recognition using spatio-temporal representation matching. Vis Comput 36:1457–1471
Article Google Scholar
Koller O, Ney H, Zargaran S (2017) Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: IEEE Conference Computer Vision Pattern Recognition, pp 4297–4305
Laptev I, Marszalek M, Rozenfeld B, Schmid C (2008) Learning realistic human actions from movies. In: IEEE Conference Computer Vision Pattern Recognition, pp 1–8
Laptev I, Marszalek M, Schmid C (2009) Actions in context. In: IEEE Conference Computer Vision Pattern Recognition, pp 2929–2936
Lei P, Li J, Todorovic S (2019) Weakly Supervised Energy-Base Learning for Action Segmentation. In: IEEE Conference Computer Vision Pattern Recognition, pp 6243–6251
Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6742–6751
Li J, Todorovic S (2020) Set-Constrained Viterbi for Set-Supervised Action Segmentation. In: IEEE/CVF Conference Computer Vision Pattern Recognition, pp 10820–10829
Li J, Todorovic S (2021) Anchor-Constrained Viterbi for Set-Supervised Action Segmentation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Mckenna SJ, Stein S (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp 729–738
Mori G, Russakovsky O, Yeung S et al (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: European Conference Computer Vision, pp 143–156
Schmid C, Wang H (2013) Action recognition with improved trajectories. In: IEEE Internationa Conference Computer Vision, pp 3551–3558
Souri Y et al (2021) Fast weakly supervised action segmentation using mutual consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence
Viterbi AJ (2006) A personal history of the Viterbi algorithm. IEEE Signal Process Mag 23(4):120–142
Article Google Scholar
Wang L, Xiong Y, Zhao Y, et al. (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2914–2923
Zhou ZH (2018) A brief introduction to weakly supervised learning. National science review 5(1):44–53
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank A.P. Tian Wang of Beihang University and A.P. Fang Wan of University of Chinese Academy of Sciences for helpful comments and discussions. Longshuai Sheng, Ce Li declare that they have no conflict of interest.

Funding

This study was funded by National Natural Science Foundation of China (62076016, 61972016, 62176260), Beijing Nova Program of Science and Technology (Z191100001119106,Z211100002121147), Beijing Municipal Natural Science Foundation (4202065), and supported by the Fundamental Research Funds for the Central Universities (2022YJSJD11, J210409).

Author information

Authors and Affiliations

Computer Science and Technology, China University of Mining & Technology, Beijing, Xueyuan Road, Haidian District, Beijing, 100083, People’s Republic of China
Longshuai Sheng & Ce Li
Ant Group, Z Space No. 556 Xixi Road, Hangzhou, Zhejiang, 310099, People’s Republic of China
Longshuai Sheng

Authors

Longshuai Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Ce Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ce Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sheng, L., Li, C. Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos. Multimed Tools Appl 82, 12977–12993 (2023). https://doi.org/10.1007/s11042-022-13792-1

Download citation

Received: 17 November 2021
Revised: 10 July 2022
Accepted: 05 September 2022
Published: 01 December 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11042-022-13792-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos

Abstract

Access this article

Similar content being viewed by others

ASGSA: global semantic-aware network for action segmentation

TSRN: two-stage refinement network for temporal action segmentation

Boundary-sensitive denoised temporal reasoning network for video action segmentation

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Weakly supervised coarse-to-fine learning for human action segmentation in HCI videos

Abstract

Access this article

Similar content being viewed by others

ASGSA: global semantic-aware network for action segmentation

TSRN: two-stage refinement network for temporal action segmentation

Boundary-sensitive denoised temporal reasoning network for video action segmentation

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation