Structure analysis of soccer video with domain knowledge and hidden Markov models

https://doi.org/10.1016/j.patrec.2004.01.005Get rights and content

Abstract

In this paper, we present statistical techniques for parsing the structure of produced soccer programs. The problem is important for applications such as personalized video streaming and browsing systems, in which videos are segmented into different states and important states are selected based on user preferences. While prior work focuses on the detection of special events such as goals or corner kicks, this paper is concerned with generic structural elements of the game. We define two mutually exclusive states of the game, play and break based on the rules of soccer. Automatic detection of such generic states represents an original challenging issue due to high appearance diversities and temporal dynamics of such states in different videos. We select a salient feature set from the compressed domain, dominant color ratio and motion intensity, based on the special syntax and content characteristics of soccer videos. We then model the stochastic structures of each state of the game with a set of hidden Markov models. Finally, higher-level transitions are taken into account and dynamic programming techniques are used to obtain the maximum likelihood segmentation of the video sequence. The system achieves a promising classification accuracy of 83.5%, with light-weight computation on feature extraction and model inference, as well as a satisfactory accuracy in boundary timing.

Introduction

In this paper, we present algorithms for the analysis of video structure using domain knowledge and supervised learning of statistical models. The domain of interest here is soccer video, and the structure we are interested in is the temporal sequence of high-level game states; namely, play and break. The goal of this work is to parse the continuous video stream into a sequence of component state labels automatically, i.e., to jointly segment the video sequence into homogeneous chunks and classify each segment as one of the semantic states as well. Structure parsing is not only useful in automatic content filtering for general TV audience and soccer professionals in this special domain, it is also related to an important general problem of video structure analysis and content understanding. While most existing work focuses on the detection of domain-specific events, our approach in generic high-level structure analysis is distinctive with several important advantages: (1) the generic state information can be used to filter and significantly reduce the video data. For example, typically no more than 60% of the video corresponds to play, thus we can achieve significant information reduction; (2) videos in different states clearly have different temporal variations, which can be captured by statistical temporal models such as the hidden Markov models (HMM).

Related work in the literature of sports video analysis has addressed soccer and various sports games. For soccer video, prior work has been on shot classification (Gong et al., 1995), scene reconstruction (Yow et al., 1995), and rule-based semantic classification (Qian and Tovinkere, 2001). For other sports video, supervised learning was used by Zhong and Chang (2001) to recognize canonical views such as baseball pitching and tennis serve. In the area of video genre segmentation and classification, Wang et al. (2000) have developed HMM-based models for classifying videos into news, commercial, sports and weather reports.

In this work, we first exploit domain-specific video syntax to identify salient high-level structures. Such syntactic structures are usually associated with important semantic meanings in specific domains. Taking soccer as a test case, we identify play and break as two recurrent high-level structures, which correspond well to the semantic states of the game. Such observations then lead us to choosing two simple, but effective features in the compressed domain, dominant color ratio and motion intensity. In our prior work (Xu et al., 2001), we showed such specific set of features, when combined with rule-based detection techniques, were indeed effective in play/break detection in soccer. In this paper, we will use formal statistical techniques to model domain-specific syntactic constraints rather than using heuristic rules only. The stochastic structure within a play or a break is modelled with a set of HMMs, and the transition among these HMMs is addressed with dynamic programming. Average classification accuracy per segment is 83.5%, and most of the play/break boundaries are correctly detected within a 3-second window (Xie et al., 2002). It is encouraging that high-level domain-dependent video structures can be computed with high accuracy using compressed-domain features and generic statistical tools. We believe that the performance can be attributed to the match of features to the domain syntax and the power of the statistical tools in capturing the perceptual variations and temporal dynamics of the video.

In Section 2, we define the high-level structures of play and break in soccer, and present relevant observations of soccer video syntax; in Section 3 we describe algorithms for feature extraction and validation results of such a feature set with rule-based detection; in Section 4 we discuss algorithms for training HMMs and using the models to segment new videos to play and break; experiments and results are presented in Section 5; and in Section 6 we draw conclusions and discuss future work.

Section snippets

The syntax and high-level structures in soccer video

In this section, we present a few observations on soccer video that explore the interesting relations between syntactic structures and semantic states of the video.

Computing informative features

Based on observations relating soccer video semantics, video production syntax and low-level perceptual features, we use one special feature, dominant color ratio, along with one generic feature, motion intensity, to capture the characteristics of soccer video content. Moreover, out attention here is on compressed-domain features, since one of the objectives of the system is real-time performance under constrained resource and diverse device settings.

Play-break segmentation with HMMs

In a sense, distinguishing the distinct inherent states of a soccer game, play (P) and break (B), is analogous to isolated word recognition in (Rabiner, 1989). Here each model corresponds to a class––phoneme in the speech case, P or B in a soccer video; the sub-structures within each model accounts for transitions and variations within and between phonemes in speech, and the switching of shots and the variations of motion in a soccer game. This analogy leads to our use of HMMs for soccer video

Experiments

Four soccer video clips used in our experiment are briefly described in Table 1. All clips are in MPEG-1 format, SIF size, 30 frames per second or 25 frames per second. The dominant hue values are adaptively learned for each clip (Section 3.1) and the dominant color ratios are computed on I- and P-frames only. The motion intensities are computed on P-frames and interpolated on I-frames. A window of three seconds long sliding by one second is used to convert continuous feature stream into short

Conclusion

In this paper, we presented new algorithms for soccer video segmentation and classification. First, play and break are defined as the basic semantic elements of a soccer video; second, observations of soccer video syntax are described and feature set is chosen based on these observations; and then, classification/segmentation is performed with HMM followed by dynamic programming. The results are evaluated in terms of classification accuracy and segmentation accuracy; extensive statistical

References (13)

  • FIFA, 2002. Laws of the game. Federation Internationale de Football Association,...
  • Gong, Y., Lim, T., Chua, H., May 1995. Automatic parsing of TV soccer programs. In: IEEE International Conference on...
  • Qian, R.J., Tovinkere, V., August 2001. Detecting semantic events in soccer games: Towards a complete solution. In:...
  • L.R. Rabiner

    A tutorial on hidden Markov models and selected applications in speech recognition

    Proceedings of the IEEE

    (February 1989)
  • Ramesh, P., Wilpon, J., 1992. Modeling state durations in hidden Markov models for automatic speech recognition. In:...
  • Shook, F. (Ed.), 1995. Television field production and reporting, 2nd Edition. Longman Publisher USA, Sports...
There are more references available in the full text version of this article.

Cited by (178)

  • Hand Drawn Optical Circuit Recognition

    2016, Procedia Computer Science
  • Automatic play segmentation of hockey videos

    2021, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
  • Image on the street is...: Folk depictions of the global south in social imagery and social video at mass scale

    2019, Deconstructing Images of the Global South Through Media Representations and Communication
View all citing articles on Scopus
View full text