skip to main content
10.1145/3689061.3689064acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis Matches

Published: 28 October 2024 Publication History

Abstract

Understanding decision-making processes is informative for strategic planning. Aiming to understand human risk-taking behavior in decision-making, we investigate the possibility of classifying whether a shot is offensive or not, targeting table tennis videos. We define the problem in a multi-task setting: detecting shots with frame-level precision while classifying shot offensiveness, and, as an optional task, predicting which player made the shot. We use commercial table tennis videos for target analysis and propose audio-visual self-supervised training, leveraging web videos with similar camera views. Our local contrastive loss encourages the model to learn frame-wise action locality collaboratively with traditional segment-wise contrastive loss, which we call global contrastive loss. Experimental results proved that the collaboration of two contrastive losses boosts the prediction performance.

References

[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In ICCV. 6836--6846.
[2]
Jiang Bian, Xuhong Li, Tao Wang, Qingzhong Wang, Jun Huang, Chen Liu, Jun Zhao, Feixiang Lu, Dejing Dou, and Haoyi Xiong. 2024. P2ANet: A Large-Scale Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 20, 4 (2024), 23 pages.
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. 1597--1607.
[4]
Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. 2023. SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes. In ICCV. 9921--9931.
[5]
Mykyta Fastovets, Jean-Yves Guillemaut, and Adrian Hilton. 2013. Athlete Pose Estimation from Monocular TV Sports Footage. In CVPRW. 1048--1054.
[6]
Rikke Gade, Mohamed Abou-Zleikha, Mads Græsbøll Christensen, and Thomas B. Moeslund. 2015. Audio-Visual Classification of Sports Types. In ICCVW. 768--773.
[7]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked Autoencoders Are Scalable Vision Learners. In CVPR. 16000--16009.
[8]
Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). arxiv: 1606.08415 [cs.LG]
[9]
Guyue Hu, Bo Cui, Yuan He, and Shan Yu. 2020. Progressive Relation Learning for Group Activity Recognition. In CVPR. 977--986. https://doi.org/10.1109/CVPR42600.2020.00106
[10]
Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. 2023. MAViL: Masked Audio-Video Learners. In NeurIPS. https://openreview.net/forum?id=OmTMaTbjac
[11]
Simon Jenni, Alexander Black, and John Collomosse. 2023. Audio-Visual Contrastive Learning with Temporal Self-Supervision. In AAAI, Vol. 37. 7996--8004.
[12]
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In CVPR.
[13]
Kaustubh Milind Kulkarni, Rohan S Jamadagni, Jeffrey Aaron Paul, and Sucheth Shenoy. 2023. Table Tennis Stroke Detection and Recognition Using Ball Trajectory Data. arxiv: 2302.09657 [cs.CV] https://arxiv.org/abs/2302.09657
[14]
Kaustubh Milind Kulkarni and Sucheth Shenoy. 2021. Table Tennis Stroke Recognition Using Two-Dimensional Human Pose Estimation. In CVPRW. 4576--4584.
[15]
Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2023. Vision Transformers are Parameter-Efficient Audio-Visual Learners. In CVPR.
[16]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.
[17]
Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and Li Fei-Fei. 2017. Unsupervised Learning of Long-Term Motion Dynamics for Videos. In CVPR.
[18]
Ivan Malagoli Lanzoni, Rocco Di Michele, and Franco Merni. 2012. Performance indicators in table tennis: a review of the literature. International Journal of Table Tennis Sciences, Vol. 7 (2012), 71--75.
[19]
Pierre-Etienne Martin, Jenny Benois-Pineau, and Renaud Péteri. 2019. Fine-Grained Action Detection and Classification in Table Tennis with Siamese Spatio-Temporal Convolutional Neural Network. In ICIP. 3027--3028.
[20]
Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, and Jack Gallifant. 2024. A Closer Look at AUROC and AUPRC under Class Imbalance. arxiv: 2401.06091 [cs.LG]
[21]
Noroozi Mehdi and Favaro Paolo. 2016. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In ECCV. 69--84.
[22]
Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification. In ECCV.
[23]
Banoth Thulasya Naik, Mohammad Farukh Hashmi, and Neeraj Dhanraj Bokde. 2022. A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions. Applied Sciences, Vol. 12, 9 (April 2022), 4429. https://doi.org/10.3390/app12094429
[24]
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. 2016. Visually Indicated Sounds. In CVPR. 2405--2413.
[25]
Thomas Pellegrini, Ismail Khalfaoui-Hassani, Etienne Labbé, and Timothée Masquelier. 2023. Adapting a ConvNeXt Model to Audio Classification on AudioSet. In Proc. INTERSPEECH 2023. 4169--4173. https://doi.org/10.21437/Interspeech.2023--1564
[26]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML, Marina Meila and Tong Zhang (Eds.), Vol. 139. 8748--8763.
[27]
Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie Zhou, and Jifeng Dai. 2023. Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information. In CVPR. 15888--15899.
[28]
Sho Tamaki and Hideo Saito. 2013. Reconstruction of 3D Trajectories for Performance Analysis in Table Tennis. In CVPRW. 1019--1026.
[29]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In NeurIPS.
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS, Vol. 30.
[31]
Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet: Real-Time Temporal and Spatial Video Analysis of Table Tennis. In CVPRW.
[32]
Xinyu Wei, Long Sha, Patrick Lucey, Peter Carr, Sridha Sridharan, and Iain Matthews. 2015. Predicting Ball Ownership in Basketball from a Monocular View Using Only Player Trajectories. In ICCVW. 780--787.
[33]
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. 2023. ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders. In CVPR. 16133--16142.
[34]
Dan Zecha, Moritz Einfalt, and Rainer Lienhart. 2019. Refining Joint Locations for Human Pose Tracking in Sports Videos. In CVPRW. 2524--2532.
[35]
Jingran Zhang, Xing Xu, Fumin Shen, Huimin Lu, Xin Liu, and Heng Tao Shen. 2021. Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning. AAAI, Vol. 35, 4 (2021), 3351--3359.
[36]
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. In ECCV.

Index Terms

  1. Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis Matches

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMSports '24: Proceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports
    October 2024
    113 pages
    ISBN:9798400711985
    DOI:10.1145/3689061
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual analysis
    2. risk-taking behavior
    3. self-supervision
    4. table tennis

    Qualifiers

    • Research-article

    Funding Sources

    • JSPS Kakenhi
    • JST PRESTO

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    Overall Acceptance Rate 29 of 49 submissions, 59%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 43
      Total Downloads
    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media