research-article

Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis Matches

Authors:

Atsushi Hashimoto,

Hidehito Honda,

Kazutoshi TanakaAuthors Info & Claims

MMSports '24: Proceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports

Pages 27 - 33

https://doi.org/10.1145/3689061.3689064

Published: 28 October 2024 Publication History

MMSports '24: Proceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports

Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis Matches

Pages 27 - 33

Abstract
References

Abstract

Understanding decision-making processes is informative for strategic planning. Aiming to understand human risk-taking behavior in decision-making, we investigate the possibility of classifying whether a shot is offensive or not, targeting table tennis videos. We define the problem in a multi-task setting: detecting shots with frame-level precision while classifying shot offensiveness, and, as an optional task, predicting which player made the shot. We use commercial table tennis videos for target analysis and propose audio-visual self-supervised training, leveraging web videos with similar camera views. Our local contrastive loss encourages the model to learn frame-wise action locality collaboratively with traditional segment-wise contrastive loss, which we call global contrastive loss. Experimental results proved that the collaboration of two contrastive losses boosts the prediction performance.

References

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luvcić, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. In ICCV. 6836--6846.

[2]

Jiang Bian, Xuhong Li, Tao Wang, Qingzhong Wang, Jun Huang, Chen Liu, Jun Zhao, Feixiang Lu, Dejing Dou, and Haoyi Xiong. 2024. P2ANet: A Large-Scale Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 20, 4 (2024), 23 pages.

Digital Library

[3]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. 1597--1607.

[4]

Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. 2023. SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes. In ICCV. 9921--9931.

[5]

Mykyta Fastovets, Jean-Yves Guillemaut, and Adrian Hilton. 2013. Athlete Pose Estimation from Monocular TV Sports Footage. In CVPRW. 1048--1054.

[6]

Rikke Gade, Mohamed Abou-Zleikha, Mads Græsbøll Christensen, and Thomas B. Moeslund. 2015. Audio-Visual Classification of Sports Types. In ICCVW. 768--773.

[7]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked Autoencoders Are Scalable Vision Learners. In CVPR. 16000--16009.

[8]

Dan Hendrycks and Kevin Gimpel. 2023. Gaussian Error Linear Units (GELUs). arxiv: 1606.08415 [cs.LG]

[9]

Guyue Hu, Bo Cui, Yuan He, and Shan Yu. 2020. Progressive Relation Learning for Group Activity Recognition. In CVPR. 977--986. https://doi.org/10.1109/CVPR42600.2020.00106

[10]

Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. 2023. MAViL: Masked Audio-Video Learners. In NeurIPS. https://openreview.net/forum?id=OmTMaTbjac

[11]

Simon Jenni, Alexander Black, and John Collomosse. 2023. Audio-Visual Contrastive Learning with Temporal Self-Supervision. In AAAI, Vol. 37. 7996--8004.

Digital Library

[12]

Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In CVPR.

[13]

Kaustubh Milind Kulkarni, Rohan S Jamadagni, Jeffrey Aaron Paul, and Sucheth Shenoy. 2023. Table Tennis Stroke Detection and Recognition Using Ball Trajectory Data. arxiv: 2302.09657 [cs.CV] https://arxiv.org/abs/2302.09657

[14]

Kaustubh Milind Kulkarni and Sucheth Shenoy. 2021. Table Tennis Stroke Recognition Using Two-Dimensional Human Pose Estimation. In CVPRW. 4576--4584.

[15]

Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, and Gedas Bertasius. 2023. Vision Transformers are Parameter-Efficient Audio-Visual Learners. In CVPR.

[16]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.

[17]

Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi, and Li Fei-Fei. 2017. Unsupervised Learning of Long-Term Motion Dynamics for Videos. In CVPR.

[18]

Ivan Malagoli Lanzoni, Rocco Di Michele, and Franco Merni. 2012. Performance indicators in table tennis: a review of the literature. International Journal of Table Tennis Sciences, Vol. 7 (2012), 71--75.

[19]

Pierre-Etienne Martin, Jenny Benois-Pineau, and Renaud Péteri. 2019. Fine-Grained Action Detection and Classification in Table Tennis with Siamese Spatio-Temporal Convolutional Neural Network. In ICIP. 3027--3028.

[20]

Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang, Giovanni Angelotti, and Jack Gallifant. 2024. A Closer Look at AUROC and AUPRC under Class Imbalance. arxiv: 2401.06091 [cs.LG]

[21]

Noroozi Mehdi and Favaro Paolo. 2016. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In ECCV. 69--84.

[22]

Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification. In ECCV.

[23]

Banoth Thulasya Naik, Mohammad Farukh Hashmi, and Neeraj Dhanraj Bokde. 2022. A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions. Applied Sciences, Vol. 12, 9 (April 2022), 4429. https://doi.org/10.3390/app12094429

[24]

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. 2016. Visually Indicated Sounds. In CVPR. 2405--2413.

[25]

Thomas Pellegrini, Ismail Khalfaoui-Hassani, Etienne Labbé, and Timothée Masquelier. 2023. Adapting a ConvNeXt Model to Audio Classification on AudioSet. In Proc. INTERSPEECH 2023. 4169--4173. https://doi.org/10.21437/Interspeech.2023--1564

[26]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML, Marina Meila and Tong Zhang (Eds.), Vol. 139. 8748--8763.

[27]

Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie Zhou, and Jifeng Dai. 2023. Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information. In CVPR. 15888--15899.

[28]

Sho Tamaki and Hideo Saito. 2013. Reconstruction of 3D Trajectories for Performance Analysis in Table Tennis. In CVPRW. 1019--1026.

[29]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In NeurIPS.

[30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS, Vol. 30.

[31]

Roman Voeikov, Nikolay Falaleev, and Ruslan Baikulov. 2020. TTNet: Real-Time Temporal and Spatial Video Analysis of Table Tennis. In CVPRW.

[32]

Xinyu Wei, Long Sha, Patrick Lucey, Peter Carr, Sridha Sridharan, and Iain Matthews. 2015. Predicting Ball Ownership in Basketball from a Monocular View Using Only Player Trajectories. In ICCVW. 780--787.

[33]

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. 2023. ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders. In CVPR. 16133--16142.

[34]

Dan Zecha, Moritz Einfalt, and Rainer Lienhart. 2019. Refining Joint Locations for Human Pose Tracking in Sports Videos. In CVPRW. 2524--2532.

[35]

Jingran Zhang, Xing Xu, Fumin Shen, Huimin Lu, Xin Liu, and Heng Tao Shen. 2021. Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning. AAAI, Vol. 35, 4 (2021), 3351--3359.

[36]

Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. In ECCV.

Index Terms

Audio-Visual Self-Supervision for Frame-Level Player-wise Offensive Shot Detection in Table Tennis Matches
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Semi-supervised learning settings

Recommendations

A table tennis game for three players
OZCHI '06: Proceedings of the 18th Australia conference on Computer-Human Interaction: Design: Activities, Artefacts and Environments

Table tennis is a game that can provide healthy exercise and is also a social pastime for players of all ages across the world. However, players have to be collocated to play, and three players cannot usually play at the same time in fair or equitable ...
Design and Analysis of a Virtual Table Tennis Game Machine Circuit
ICITEE '22: Proceedings of the 5th International Conference on Information Technologies and Electrical Engineering

With the improvement of living standard, people pay more and more attention to physical exercise and leisure entertainment. Table tennis is the national sport of our country and is loved by the Chinese people. The traditional table tennis is limited by ...
Hopping-Pong: Computational Curveball in Table Tennis by Noncontact Ultrasound Force
SIGGRAPH '20: ACM SIGGRAPH 2020 Emerging Technologies

Augmented sports is the attempts to enhance sports as entertainment and bridge the skill gap in sports between players by computer technologies. As an augmentation method, physically interfering with sports is proposed such as changing ball trajectories. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMSports '24: Proceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports

October 2024

113 pages

ISBN:9798400711985

DOI:10.1145/3689061

Program Chairs:
Rainer Lienhart
University of Augsburg
,
Thomas B. Moeslund
Aalborg University
,
Hideo Saito
Keio University

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

JSPS Kakenhi
JST PRESTO

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

Overall Acceptance Rate 29 of 49 submissions, 59%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
43
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)8

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten