skip to main content
10.1145/3581783.3613795acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Localization-assisted Uncertainty Score Disentanglement Network for Action Quality Assessment

Published: 27 October 2023 Publication History

Abstract

Action Quality Assessment (AQA) has wide applications in various scenarios. Regarding the AQA of long-term figure skating, the big challenge lies in semantic context feature learning for Program Component Score (PCS) prediction and fine-grained technical subaction analysis for Technical Element Score (TES) prediction. In this paper, we propose a Localization-assisted Uncertainty Score Disentanglement Network (LUSD-Net) to deal with PCS and TES two predictions. In the LUSD-Net, we design an uncertainty score disentanglement solution, including score disentanglement and uncertainty regression, to decouple PCS-oriented and TES-oriented representations from skating sequences, ensuring learning differential representations for two types of score prediction. For long-term feature learning, a temporal interaction encoder is presented to build temporal context relation learning on PCS-oriented and TES-oriented features. To address subactions in TES prediction, a weakly-supervised temporal subaction localization is adopted to locate technical subactions in long sequences. For evaluation, we collect a large-scale Fine-grained Figure Skating dataset (FineFS) involving RGB videos and estimated skeleton sequences, providing rich annotations for multiple downstream action analysis tasks. The extensive experiments illustrate that our proposed LUSD-Net significantly improves the AQA performance, and the FineFS dataset provides a quantity data source for the AQA. The source code of LUSD-Net and the FineFS dataset is released at https://github.com/yanliji/FineFS-dataset.

Supplemental Material

MP4 File
Presentation video of "Localization-assisted Uncertainty Score Disentanglement Network for Action Quality Assessment", including Motivation, Algorithm, Dataset, Experiments and Conlusion.

References

[1]
Yang Bai, Desen Zhou, Songyang Zhang, Jian Wang, Errui Ding, Yu Guan, Yang Long, and Jingdong Wang. 2022. Action quality assessment with temporal parsing transformer. In ECCV. 422--438.
[2]
Gedas Bertasius, Hyun Soo Park, Stella X Yu, and Jianbo Shi. 2017. Am I a baller? basketball performance assessment from first-person videos. In ICCV. 2177--2185.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213--229.
[4]
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018).
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.
[6]
Li-Jia Dong, Hong-Bo Zhang, Qinghongya Shi, Qing Lei, Ji-Xiang Du, and Shangce Gao. 2021. Learning and fusing multiple hidden substages for action quality assessment. Knowledge-Based Systems, Vol. 229 (2021), 107388.
[7]
Hazel Doughty, Dima Damen, and Walterio Mayol-Cuevas. 2018. Who's better? who's best? pairwise deep ranking for skill determination. In CVPR. 6057--6066.
[8]
Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen. 2019. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In CVPR. 7862--7871.
[9]
Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational u-net for conditional appearance and shape generation. In CVPR. 8857--8866.
[10]
Mona Fathollahi, Mohammad Hasan Sarhan, Ramon Pena, Lela DiMonte, Anshu Gupta, Aishani Ataliwala, and Jocelyn Barker. 2022. Video-Based Surgical Skills Assessment Using Long Term Tool Tracking. In MICCAI. 541--550.
[11]
Jibin Gao, Wei-Shi Zheng, Jia-Hui Pan, Chengying Gao, Yaowei Wang, Wei Zeng, and Jianhuang Lai. 2020. An asymmetric modeling for action assessment. In ECCV. 222--238.
[12]
Yixin Gao, S Swaroop Vedula, Carol E Reiley, Narges Ahmidi, Balakrishnan Varadarajan, Henry C Lin, Lingling Tao, Luca Zappella, Benjamin Béjar, David D Yuh, et al. 2014. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. In MICCAI Workshop, Vol. 3.
[13]
Hiteshi Jain, Gaurav Harit, and Avinash Sharma. 2020. Action quality assessment using siamese network-based deep metric learning. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, 6 (2020), 2260--2273.
[14]
Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in bayesian deep learning for computer vision? NeurIPS, Vol. 30 (2017).
[15]
Mingzhe Li, Hong-Bo Zhang, Qing Lei, Zongwen Fan, Jinghua Liu, and Ji-Xiang Du. 2022c. Pairwise Contrastive Learning Network for Action Quality Assessment. In ECCV. 457--473.
[16]
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc Van Gool. 2022b. Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In CVPR. 13147--13156.
[17]
Yongjun Li, Xiujuan Chai, and Xilin Chen. 2019. Scoringnet: Learning key fragment for action quality assessment with ranking loss in skilled sports. In ACCV. 149--164.
[18]
Zhenqiang Li, Lin Gu, Weimin Wang, Ryosuke Nakamura, and Yoichi Sato. 2022a. Surgical Skill Assessment via Video Semantic Aggregation. In MICCAI. 410--420.
[19]
Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, and Xiaogang Wang. 2018. Exploring disentangled feature representation beyond face identification. In CVPR. 2080--2089.
[20]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In CVPR. 3202--3211.
[21]
Boyu Lu, Jun-Cheng Chen, and Rama Chellappa. 2019. Unsupervised domain-specific deblurring via disentangled representations. In CVPR. 10225--10234.
[22]
Takasuke Nagai, Shoichiro Takeda, Masaaki Matsumura, Shinya Shimizu, and Susumu Yamamoto. 2021. Action quality assessment with ignoring scene context. In ICIP. 1189--1193.
[23]
Mahdiar Nekoui, Fidel Omar Tito Cruz, and Li Cheng. 2020. Falcons: Fast learner-grader for contorted poses in sports. In CVPR Workshop. 900--901.
[24]
Qiang Nie, Ziwei Liu, and Yunhui Liu. 2020. Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In ECCV. 102--118.
[25]
Xuesong Niu, Zitong Yu, Hu Han, Xiaobai Li, Shiguang Shan, and Guoying Zhao. 2020. Video-based remote physiological measurement via cross-verified feature disentangling. In ECCV. 295--310.
[26]
Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng. 2019. Action assessment by joint relation graphs. In ICCV. 6331--6340.
[27]
Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng. 2021. Adaptive action assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 12 (2021), 8779--8795.
[28]
Paritosh Parmar and Brendan Morris. 2019a. Action quality assessment across multiple actions. In WACV. 1468--1476.
[29]
Paritosh Parmar and Brendan Tran Morris. 2019b. What and how well you performed? a multitask learning approach to action quality assessment. In CVPR. 304--313.
[30]
Paritosh Parmar and Brendan Tran Morris. 2017. Learning to score olympic events. In CVPR Workshop. 20--28.
[31]
Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. 2014. Assessing the quality of actions. In ECCV. 556--571.
[32]
Sanqing Qu, Guang Chen, Zhijun Li, Lijun Zhang, Fan Lu, and Alois Knoll. 2021. Acm-net: Action context modeling network for weakly-supervised temporal action localization. arXiv preprint arXiv:2104.02967 (2021).
[33]
Yansong Tang, Zanlin Ni, Jiahuan Zhou, Danyang Zhang, Jiwen Lu, Ying Wu, and Jie Zhou. 2020. Uncertainty-aware score distribution learning for action quality assessment. In CVPR. 9839--9848.
[34]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV. 4489--4497.
[35]
Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning gan for pose-invariant face recognition. In CVPR. 1415--1424.
[36]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS, Vol. 30 (2017).
[37]
Martin Wagner, Beat-Peter Müller-Stich, Anna Kisilenko, Duc Tran, Patrick Heger, Lars Mündermann, David M Lubotsky, Benjamin Müller, Tornike Davitashvili, Manuela Capek, et al. 2023. Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Medical Image Analysis, Vol. 86 (2023), 102770.
[38]
Shunli Wang, Dingkang Yang, Peng Zhai, Chixiao Chen, and Lihua Zhang. 2021. Tsa-net: Tube self-attention network for action quality assessment. In ACM Multimedia. 4902--4910.
[39]
Xiang Xiang, Ye Tian, Austin Reiter, Gregory D Hager, and Trac D Tran. 2018. S3d: Stacking segmental p3d for action quality assessment. In ICIP. 928--932.
[40]
Angchi Xu, Ling-An Zeng, and Wei-Shi Zheng. 2022b. Likert Scoring with Grade Decoupling for Long-term Action Assessment. In CVPR. 3232--3241.
[41]
Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, and Xiangyang Xue. 2019. Learning to score figure skating sport videos. IEEE transactions on circuits and systems for video technology, Vol. 30, 12 (2019), 4578--4590.
[42]
Jinglin Xu, Yongming Rao, Xumin Yu, Guangyi Chen, Jie Zhou, and Jiwen Lu. 2022a. Finediving: A fine-grained dataset for procedure-aware action quality assessment. In CVPR. 2949--2958.
[43]
Xumin Yu, Yongming Rao, Wenliang Zhao, Jiwen Lu, and Jie Zhou. 2021. Group-aware contrastive regression for action quality assessment. In ICCV. 7919--7928.
[44]
Sania Zahan, Ghulam Mubashar Hassan, and Ajmal Mian. 2023. Learning Sparse Temporal Video Mapping for Action Quality Assessment in Floor Gymnastics. arXiv preprint arXiv:2301.06103 (2023).
[45]
Ling-An Zeng, Fa-Ting Hong, Wei-Shi Zheng, Qi-Zhi Yu, Wei Zeng, Yao-Wei Wang, and Jian-Huang Lai. 2020. Hybrid dynamic-static context-aware attention network for action assessment in long videos. In ACM Multimedia. 2526--2534.
[46]
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. 2022b. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In CVPR. 13232--13242.
[47]
Qiang Zhang and Baoxin Li. 2014. Relative hidden markov models for video-based evaluation of motion skills in surgical training. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 6 (2014), 1206--1218.
[48]
Shao-Jie Zhang, Jia-Hui Pan, Jibin Gao, and Wei-Shi Zheng. 2022a. Semi-supervised action quality assessment with self-supervised segment feature recovery. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 9 (2022), 6017--6028.
[49]
Ziyuan Zhang, Luan Tran, Xi Yin, Yousef Atoum, Xiaoming Liu, Jian Wan, and Nanxin Wang. 2019. Gait recognition via disentangled representation learning. In CVPR. 4710--4719

Cited By

View all
  • (2025)Visual-Semantic Alignment Temporal Parsing for Action Quality AssessmentIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.348724235:3(2436-2449)Online publication date: Mar-2025
  • (2025)Vision-based human action quality assessment: A systematic reviewExpert Systems with Applications10.1016/j.eswa.2024.125642263(125642)Online publication date: Mar-2025
  • (2024)Self-Supervised Sub-Action Parsing Network for Semi-Supervised Action Quality AssessmentIEEE Transactions on Image Processing10.1109/TIP.2024.346887033(6057-6070)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Localization-assisted Uncertainty Score Disentanglement Network for Action Quality Assessment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action quality assessment
    2. figure skating
    3. finefs dataset

    Qualifiers

    • Research-article

    Funding Sources

    • the Science and Technology Innovation Committee of Shenzhen Municipalit Foundation

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)190
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Visual-Semantic Alignment Temporal Parsing for Action Quality AssessmentIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.348724235:3(2436-2449)Online publication date: Mar-2025
    • (2025)Vision-based human action quality assessment: A systematic reviewExpert Systems with Applications10.1016/j.eswa.2024.125642263(125642)Online publication date: Mar-2025
    • (2024)Self-Supervised Sub-Action Parsing Network for Semi-Supervised Action Quality AssessmentIEEE Transactions on Image Processing10.1109/TIP.2024.346887033(6057-6070)Online publication date: 2024
    • (2024)FineSports: A Multi-Person Hierarchical Sports Video Dataset for Fine-Grained Action Understanding2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02057(21773-21782)Online publication date: 16-Jun-2024
    • (2024)Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality AssessmentComputer Vision – ECCV 202410.1007/978-3-031-72946-1_24(423-440)Online publication date: 2-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media