Elsevier

Neurocomputing

Volume 413, 6 November 2020, Pages 360-367
Neurocomputing

FSD-10: A fine-grained classification dataset for figure skating

https://doi.org/10.1016/j.neucom.2020.06.108Get rights and content

Abstract

Action recognition is an important and challenging problem in video analysis. Although the past decade has witnessed progress in action recognition with the development of deep learning, such process has been slow in competitive sports content analysis. To promote the research on action recognition from competitive sports video clips, we introduce a Figure Skating Dataset (FSD-10) for fine-grained sports content analysis. To this end, we collect 1484 clips from the worldwide figure skating championships in 2017–2018, which consist of 10 different actions in men/ladies programs. Each clip is at a rate of 30 frames per second with resolution 1080 × 720, which are annotated by experts. To build a baseline for action recognition in figure skating, we evaluate state-of-the-art action recognition methods on FSD-10. Motivated by the idea that domain knowledge is of great concern in sports field, we propose a keyframe based temporal segment network (KTSN) for classification and achieve remarkable performance. Experimental results demonstrate that FSD-10 is an ideal dataset for benchmarking action recognition algorithms, as it requires to accurately extract action motions rather than action poses. We hope FSD-10, which is designed to have a large collection of finegrained actions, can serve as a new challenge to develop more robust and advanced action recognition models.

Introduction

Due to the popularity of media-sharing platforms, sports content analysis (SCA [23]) has become an important research topic in computer vision [29], [8], [22]. A vast amount of sports videos are piled up in computer storage, which are potential resources for deep learning. In recent years, many enterprises (e.g. Bloomberg, SAP) have focus on SCA [23]. In SCA, datasets are required to reflect characteristics of competitive sports, which is a guarantee for training deep learning models. Generally, competitive sports content is a series of diversified, high professional and ultimate actions. Unfortunately, existing trending human motion datasets (e.g. HMDB51 [14], UCF50 [25]) or action datasets of human sports (e.g. MIT Olympic sports[20], Nevada Olympic sports [18]) are not quite representative of the richness and complexity of competitive sports. The discriminant of an action largely depends on scene, person and object elements [10], which limit the research process of action recognition. This dependence inclines that most human action datasets concerning form (contents) rather than motion, while both motion and form are important in human action analysis.

To address the above issues, this paper proposes a figure skating dataset called FSD-10. FSD-10 consists of 1484 figure skating videos with 10 different actions manually labeled. These skating videos are segmented from around 80 h of competitions of worldwide figure skating championships in 2017–2018. FSD-10 videos range from 3 s to 30 s, and the camera is moving to focus the skater to ensure that person appears in each frame during the process of actions. Compared with existing datasets, our proposed dataset has several appealing properties. First, actions of FSD-10 are original from figure skating competitions, which are consistent in type and sports environment (including skating rink and auditorium). Second, actions in FSD-10 are complex in content and fast in action switching. For instance, the complex 2-loop-Axel jump is finished in only about 2s in Fig. 2. It’s worth note that the jump type heavily depends on the take off process, which is a hard-captured moment. The above two aspects create difficulties for machine learning model to conclude the action types by a single pose or background.

Along with the introduction of FSD-10, we propose a key frame indicator called human pose scatter (HPS). Based on HPS, we adopt key frame sampling to improve current video classification methods and evaluate these methods on FSD-10. Furthermore, experimental results validate that key frame sampling is an important approach to improve performance of frame-based model in FSD-10, which is in concert with cognition rules of human in figure skating. The main contributions of this paper can be summarised as follows.

  • To our best knowledge, FSD-10 is the first fine grained, full motion-based dataset without multi-scene and object elements.

  • To set a baseline for future achievements, we also benchmark state-of-the-art sport classification methods on FSD-10. Besides, the key frame sampling is proposed to capture the pivotal action details in competitive sports, which achieves better performance than state-of-the-art methods in FSD-10.

In addition, compared to current datasets, we hope FSD-10 will be a challenging benchmark dataset for background-independent action recognition, which makes an excellent contribution to a specialized workshop in sports. The aim of our research is to explore human motion rather than form of video analysis. Motion is an important topic in action research, which can be applied in many fields (such as sport content analysis, physical rehabilitation, human environmental behavior, physical emotion analysis in cognitive psychology, video synthesis). In this regard, motion related datasets and methods are urgently needed. Therefore, new dataset provides a broad scope and challenges researchers with general core problems of computer vision.

Section snippets

Related works

Professional sports dataset (PSD) is a series of competitive sports actions. Compared with common action dataset, for example UCF101 [25] and HMDB [14], PSD is consist of highly specialized actions instead of actions in daily life. MIT Olympic sports [20] and Nevada Olympic sports [19] are examples of PSD, which are derived from Olympic competitions (see Table 1).

In PSD, classification is important to attract people’s attention and to highlight athlete’s performance, and even to assist referees

Figure Skating Dataset

In this section, we describe details regarding the setup and protocol followed to capture our dataset. Then, we discuss the temporal segmentation and assessment tasks of FSD-10 and its future extensions.

Keyframe based temporal segment network (KTSN)

In this section, we give a detailed description of our keyframe based temporal segment network. Specifically, we first discuss the motivation of key frame sampling in Section 4.1. Then, sampling method of key frame is proposed in Section 4.2 Human Pose Scatter (HPS), 4.3 Key frame sampling. Finally, network structure of KTSN is detailedly introduced in Section 4.4.

Experiments

In order to provide a benchmark for our FSD-10 dataset, we evaluate various approaches under three different modalities: RGB, optical flow and anatomical keypoints (skeleton). We also conduct experiments on cross dataset validation. The following describes the details of our experiments and results.

Conclusion

In this paper, we build an action dataset for competitive sports analysis, which is characterised by high action switching speed and complex action content. We find that motion is more valuable than form (content and background) in this task. Therefore, compared with other related datasets, our dataset focuses on the action itself rather background. Our dataset creates many interesting tasks, such as fine-grained action classification, action quality assessment and action temporal segmentation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported in part by the National Key Research and Development Program of China (2017YFB1300200, 2017YFB1300203) and the Fundamental Research Funds for the Central Universities – No. DUT20RC(5)010.

Shenglan Liu received the Ph.D. degree in the School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China, in 2015. Currently, he is an associate professor with the School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian, Liaoning, China. His research interests include manifold learning, human perception computing. Dr. Liu is currently the editorial board member of Neurocomputing.

References (34)

  • B.K. Horn et al.

    Determining optical flow

    Artificial Intelligence

    (1981)
  • S.C.B. Lo et al.

    Artificial convolution neural network for medical image pattern recognition

    Neural Networks

    (1995)
  • S. Wold et al.

    Principal component analysis

    Chemometrics and Intelligent Laboratory Systems

    (1987)
  • S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, S. Vijayanarasimhan, Youtube-8m: A...
  • A.C. Bovik

    Handbook of Image and Video Processing

    (2010)
  • Z. Cao, G. Hidalgo, T. Simon, S.E. Wei, Y. Sheikh, OpenPose: realtime multi-person 2D pose estimation using Part...
  • Z. Cao, T. Simon, S.E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part affinity fields, in: CVPR,...
  • J. Carreira et al.

    Quo vadis, action recognition? A new model and the kinetics dataset

  • H. El-Ghaish et al.

    Human action recognition based on integrating body pose, part shape, and motion

    IEEE Access

    (2018)
  • S. Giancola et al.

    Soccernet: A scalable dataset for action spotting in soccer videos

  • J. Gudmundsson et al.

    Spatio-temporal analysis of team sports

    ACM Computing Surveys (CSUR)

    (2017)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Y. He et al.

    Human action recognition without human

  • G. Huang et al.

    Densely connected convolutional networks

  • A. Karpathy et al.

    Large-scale video classification with convolutional neural networks

  • H. Kuehne et al.

    Hmdb: a large video database for human motion recognition

  • M. Marszałek, I. Laptev, C. Schmid, Actions in context, in: CVPR 2009-IEEE Conference on Computer Vision & Pattern...
  • Cited by (14)

    View all citing articles on Scopus

    Shenglan Liu received the Ph.D. degree in the School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China, in 2015. Currently, he is an associate professor with the School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian, Liaoning, China. His research interests include manifold learning, human perception computing. Dr. Liu is currently the editorial board member of Neurocomputing.

    Xiang Liu received the B.E. degree from the Dalian University of Technology, China, in 2017. He is currently working toward the M.E. degree in the School of Computer Science and Technology, Dalian University of Technology, China. His research interests include visualization, crowd counting and machine learning.

    Gao Huang received the Ph.D. degree in the Tsinghua University, China, in 2015. Currently, he is an Assistant Professor in the Department of Automation, Tsinghua University. His research interests include machine learning and computer vision, in particular deep learning, resource-efficient learning and unsupervised learning. His work on DenseNet won the Best Paper Award of CVPR (2017).

    Hong Qiao received the B.E. degree in hydraulics and control and the M.E. degree in robotics and automation from Xi’an Jiaotong University, Xi’an, China, and the Ph.D. degree in robotics control from De Montfort University, Leicester, U.K., in 1995. She was an Assistant Professor with the City University of Hong Kong, Hong Kong, and a Lecturer with the University of Manchester, Manchester, U.K., from 1997 to 2004. She is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include robotics, machine learning and pattern recognition.

    Lianyu Hu received the B.S. degree of Electronics and Information Engineering from Dalian University of Technology, China, 2018. Currently, he is a M.S. degree candidate in the Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology. His research interests include action recognition, graph covolution networks and skeleton-based video classification.

    Dong Jiang received the B.S. degree in the School of Mechanical Engineering, Dalian University of Technology, China, in 2018. Currently, he is working toward the M.S. degree in the School of Computer Science and Technology, Dalian University of Technology, China. His research interests include graph convolution and video classification.

    Aibin Zhang received the B.S. degree in the School of Microelectronics, Dalian University of Technology, China, in 2020. He is about to study for a M.S. degree in the School of Computer Science and Technology, Dalian University of Technology, China.His future research directions mainly include deep learning and computer vision.

    Yang Liu received his B.S. degree and Ph.D. degree in the School of Computer Science and Technology from Dalian University of Technology, China, in 2013 and 2019 respectively. He is currently a lecturer in Dalian University of Technology, China. His research interests include video analysis, image retrieval and machine learning.

    Ge Guo currently studying as an undergraduate majoring in CST at Dalian University of Technology, Dalian, China. Her research focuses on data visualization and human–computer interaction.

    View full text