Audiovisual integration with Segment Models for tennis video parsing

https://doi.org/10.1016/j.cviu.2007.09.002Get rights and content

Abstract

Automatic video content analysis is an emerging research subject with numerous applications to large video databases and personal video recording systems. The aim of this study is to fuse multimodal information in order to automatically parse the underlying structure of tennis broadcasts. The frame-based observation distributions of Hidden Markov Models are too strict in modeling heterogeneous audiovisual data. We propose instead the use of segmental features, of the framework of Segment Models, to overcome this limitation and extend the synchronization points to the segment boundaries. Considering each segment as a video scene, auditory and visual features collected inside the scene boundaries can thus be sampled and modeled with their native sampling rates and models. Experimental results on a corpus of 15-h tennis video demonstrated a performance superiority of Segment Models with synchronous audiovisual fusion over Hidden Markov Models. Results though with asynchronous fusion are less optimistic.

Introduction

Automatic annotation of video documents is a powerful tool for managing large video databases and, more recently, for the development of sophisticated consumer products that meet high-level user needs like highlight extraction. One can accomplish this task by using explicit hand-crafted and thus domain-dependent models which can perform reasonably well in some cases [1]. However, more effective ways are needed to bridge the required high-level user needs and the low-level video features at hand, such as image histograms or speaker excitation. A key question towards this end is an efficient video content representation scheme [2]. Hidden Markov Model [3] (HMM) is a powerful statistical approach for modeling video content and can be used as a statistical parser of a video sequence [4], sharing notions from the field of speech recognition.

We use Markovian models for tennis broadcasts structure analysis. In this type of video, game rules as well as production rules result in a structured document. In a previous work, we used flat or hierarchical HMMs [5] to parse this structure and to segment raw video data into human-meaningful scenes. The table of contents of the video can then be automatically constructed.

Audiovisual integration with HMMs is generally addressed with a concatenative fusion scheme that assumes homogeneous and synchronous features. However, the visual and auditory modalities are sampled at different rates. In addition, the visual content follows the production rules while the auditory one captures raw sounds from the court, interlaced with commentary speech. There is thus, firstly, a certain degree of asynchrony between auditory and visual features and, secondly, they follow different temporal models.

In this study, we introduce video indexing with Segment Models [6] (SMs) as a means of a more efficient and versatile multimodal fusion and provide an experimental comparison with HMM-based fusion. With SMs, the synchrony constraints between the modalities can be relaxed to the scene boundaries, thus enabling to process each modality with their native sampling rates and models. The aim of this work is the exploration of the many possibilities of audiovisual integration that SMs can offer rather than the design of an integrated and general-purpose tennis video parser. To this end, we employ a small but sufficient to our needs set of audiovisual features and restrict ourselves to the producer styles of French television.

This paper is organized as follows. Relative work on the HMM-based multimodal fusion is given in Section 2. Ground definitions on the problem at hand are provided in Section 3. Feature extraction is discussed in Section 4. In Section 5 we see how the visual content is modeled by HMMs and SMs, stressing conceptual differences between these two modeling alternatives. Audiovisual integration is then discussed in Section 6. Parameter estimation details are provided in Section 7 and experimental results in Section 8. Finally, Section 9 concludes this study.

Section snippets

Previous work on multimodal fusion

For a detailed review of the various aspects of the multimodal video indexing problem, the interested reader is referred to the literature reviews [1], [7], [2], [8]. In this section we will focus on HMMs, which are widely used to exploit the temporal aspect of video data. Indeed, depending on the video genre and the production rules, video events occur with a temporal order that will finally reveal the semantics. HMMs provide a powerful statistical framework for handling sequential data and

Problem definition

Before proceeding to the discussion on the models and audiovisual integration, we give basic definitions on the problem at hand in this section and present the audiovisual features below in Section 4.

There are a number of invariant characteristics that occur in every tennis video as a result of the game rules and the work of the producer on it, before being broadcasted. When game action occurs, for instance, then the camera is switched to a court view. It is extremely rare, although still

Feature extraction

We used a corpus of 6 complete tennis videos, recorded between 1999 and 2001 and kindly provided by INA.1 Details of the videos are given in Table 1. Every video contains a single tennis match, i.e. there are no court views that are split in order to display two or more tennis matches. The parts of the broadcasts before and after the actual tennis match were manually removed from the videos. Nevertheless, the programs still contain

The video models

Based on the features defined in the previous section, we now provide details on how HMMs and SMs are applied to the problem of video structure parsing. We first consider a unimodal (video-only) scenario to conceptually point out differences between these two models. The integration of audio is discussed in the following section. The problem of parameter estimation is discussed later in the paper.

In both the HMM and SM framework, a video is modeled as an ergodic structure of scenes. Note that

Audiovisual integration

The sound track of the video is an important source of information that should be taken into consideration in our modeling. For instance, states 1, 3, and 5 of Fig. 2 all correspond to court views, which are visually very similar to states 9 and 12, which correspond to idle court views. The detection of ball hits in the sound track can greatly help in the disambiguation between idle and action court views. In this section, we discuss audiovisual integration in HMMs and SMs.

Parameter estimation

We manually annotated the video sequences on top of the automatic video segmentation with the state labels defined in Fig. 2. The training sequences contain 807 scenes in total and the test ones 979.

In order to yield discrete observation distributions, the visual similarity and length features were quantized homogeneously into 10 bins each. The number of bins was determined experimentally. Estimating the HMM parameters, i.e. transition and observation probabilities, as relative frequency of

Experimental results

Firstly, let us recall that half of the videos were reserved for testing purposes. In addition, as the ground truth of the videos was collected on top of the automatic video track segmentation, errors of the hard cut and dissolve detection algorithm are not taken into account in this analysis.

Performance measurements include firstly the percentage C of shots classified with the correct scene label, averaged over the test sequences. This is a measurement for the quality of the classification. In

Conclusions

In this study, the framework of SMs has been introduced in video indexing as a means of performing audiovisual integration with relaxed synchrony constraints. Baseline HMM systems suffer from the state-synchrony constraint imposed by the frame-based observations. The use of segmental features instead can extend the synchronization points between the modalities at the segment boundaries. By modeling each modality inside its own segment, native sampling rates and model topologies can be used. SMs

Acknowledgment

This work was partially supported by funding from EC Network of Excellence MUSCLE (FP6-507752).

References (32)

  • R. Brunelli et al.

    A survey on the automatic indexing of video data

    Journal of Visual Communication and Image Representation

    (1999)
  • N. Oliver et al.

    Layered representations for learning and inferring office activity from multiple sensory channels

    Computer Vision and Image Understanding

    (2004)
  • L. Xie et al.

    Structure analysis of soccer video with domain knowledge and hidden Markov models

    Pattern Recognition Letters

    (2004)
  • C. Snoek et al.

    Multimodal video indexing: a review of the state-of-the-art

    Multimedia Tools and Applications

    (2005)
  • L. Rabiner

    A tutorial on hidden Markov models and selected applications in speech recognition

    Proceedings of the IEEE

    (1989)
  • W. Wolf, Hidden Markov model parsing of video programs, in: Proceedings of ICASSP, 1997, pp....
  • E. Kijak et al.

    Audiovisual integration for tennis broadcast structuring

    Multimedia Tools and Applications

    (2006)
  • M. Ostendorf et al.

    From HMMs to segment models: a unified view of stochastic modeling for speech recognition

    IEEE Transactions on Speech and Audio Processing

    (1996)
  • Y. Wang et al.

    Multimedia content analysis

    IEEE Signal Processing Magazine

    (2000)
  • J. Calic, N. Campbell, S. Dasiopoulou, Y. Kompatsiaris, An overview of multimodal video representation for semantic...
  • J. Huang, Z. Liu, Y. Wang, Y. Chen, E. Wong, Integration of multimodal features for video classification based on HMM,...
  • J. Boreczky, L. Wilcox, A hidden Markov model framework for video segmentation using audio and image features, in:...
  • T. Bae, S. Jin, Y. Ro, Video segmentation using hidden Markov model with multimodal features, in: Proceedings of the...
  • A. Alatan et al.

    Multi-modal dialog scene detection using hidden Markov models for content-based multimedia indexing

    Multimedia Tools and Applications

    (2001)
  • N. Dimitrova, L. Agnihorti, G. Wei, Video classification based on HMM using text and faces, in: Proceedings of the...
  • S. Eickeler, S. Muller, Content-based video indexing of TV broadcast news using Hidden Markov Models, in: IEEE Int....
  • Cited by (16)

    View all citing articles on Scopus
    View full text