Audiovisual integration with Segment Models for tennis video parsing
Introduction
Automatic annotation of video documents is a powerful tool for managing large video databases and, more recently, for the development of sophisticated consumer products that meet high-level user needs like highlight extraction. One can accomplish this task by using explicit hand-crafted and thus domain-dependent models which can perform reasonably well in some cases [1]. However, more effective ways are needed to bridge the required high-level user needs and the low-level video features at hand, such as image histograms or speaker excitation. A key question towards this end is an efficient video content representation scheme [2]. Hidden Markov Model [3] (HMM) is a powerful statistical approach for modeling video content and can be used as a statistical parser of a video sequence [4], sharing notions from the field of speech recognition.
We use Markovian models for tennis broadcasts structure analysis. In this type of video, game rules as well as production rules result in a structured document. In a previous work, we used flat or hierarchical HMMs [5] to parse this structure and to segment raw video data into human-meaningful scenes. The table of contents of the video can then be automatically constructed.
Audiovisual integration with HMMs is generally addressed with a concatenative fusion scheme that assumes homogeneous and synchronous features. However, the visual and auditory modalities are sampled at different rates. In addition, the visual content follows the production rules while the auditory one captures raw sounds from the court, interlaced with commentary speech. There is thus, firstly, a certain degree of asynchrony between auditory and visual features and, secondly, they follow different temporal models.
In this study, we introduce video indexing with Segment Models [6] (SMs) as a means of a more efficient and versatile multimodal fusion and provide an experimental comparison with HMM-based fusion. With SMs, the synchrony constraints between the modalities can be relaxed to the scene boundaries, thus enabling to process each modality with their native sampling rates and models. The aim of this work is the exploration of the many possibilities of audiovisual integration that SMs can offer rather than the design of an integrated and general-purpose tennis video parser. To this end, we employ a small but sufficient to our needs set of audiovisual features and restrict ourselves to the producer styles of French television.
This paper is organized as follows. Relative work on the HMM-based multimodal fusion is given in Section 2. Ground definitions on the problem at hand are provided in Section 3. Feature extraction is discussed in Section 4. In Section 5 we see how the visual content is modeled by HMMs and SMs, stressing conceptual differences between these two modeling alternatives. Audiovisual integration is then discussed in Section 6. Parameter estimation details are provided in Section 7 and experimental results in Section 8. Finally, Section 9 concludes this study.
Section snippets
Previous work on multimodal fusion
For a detailed review of the various aspects of the multimodal video indexing problem, the interested reader is referred to the literature reviews [1], [7], [2], [8]. In this section we will focus on HMMs, which are widely used to exploit the temporal aspect of video data. Indeed, depending on the video genre and the production rules, video events occur with a temporal order that will finally reveal the semantics. HMMs provide a powerful statistical framework for handling sequential data and
Problem definition
Before proceeding to the discussion on the models and audiovisual integration, we give basic definitions on the problem at hand in this section and present the audiovisual features below in Section 4.
There are a number of invariant characteristics that occur in every tennis video as a result of the game rules and the work of the producer on it, before being broadcasted. When game action occurs, for instance, then the camera is switched to a court view. It is extremely rare, although still
Feature extraction
We used a corpus of 6 complete tennis videos, recorded between 1999 and 2001 and kindly provided by INA.1 Details of the videos are given in Table 1. Every video contains a single tennis match, i.e. there are no court views that are split in order to display two or more tennis matches. The parts of the broadcasts before and after the actual tennis match were manually removed from the videos. Nevertheless, the programs still contain
The video models
Based on the features defined in the previous section, we now provide details on how HMMs and SMs are applied to the problem of video structure parsing. We first consider a unimodal (video-only) scenario to conceptually point out differences between these two models. The integration of audio is discussed in the following section. The problem of parameter estimation is discussed later in the paper.
In both the HMM and SM framework, a video is modeled as an ergodic structure of scenes. Note that
Audiovisual integration
The sound track of the video is an important source of information that should be taken into consideration in our modeling. For instance, states 1, 3, and 5 of Fig. 2 all correspond to court views, which are visually very similar to states 9 and 12, which correspond to idle court views. The detection of ball hits in the sound track can greatly help in the disambiguation between idle and action court views. In this section, we discuss audiovisual integration in HMMs and SMs.
Parameter estimation
We manually annotated the video sequences on top of the automatic video segmentation with the state labels defined in Fig. 2. The training sequences contain 807 scenes in total and the test ones 979.
In order to yield discrete observation distributions, the visual similarity and length features were quantized homogeneously into 10 bins each. The number of bins was determined experimentally. Estimating the HMM parameters, i.e. transition and observation probabilities, as relative frequency of
Experimental results
Firstly, let us recall that half of the videos were reserved for testing purposes. In addition, as the ground truth of the videos was collected on top of the automatic video track segmentation, errors of the hard cut and dissolve detection algorithm are not taken into account in this analysis.
Performance measurements include firstly the percentage C of shots classified with the correct scene label, averaged over the test sequences. This is a measurement for the quality of the classification. In
Conclusions
In this study, the framework of SMs has been introduced in video indexing as a means of performing audiovisual integration with relaxed synchrony constraints. Baseline HMM systems suffer from the state-synchrony constraint imposed by the frame-based observations. The use of segmental features instead can extend the synchronization points between the modalities at the segment boundaries. By modeling each modality inside its own segment, native sampling rates and model topologies can be used. SMs
Acknowledgment
This work was partially supported by funding from EC Network of Excellence MUSCLE (FP6-507752).
References (32)
- et al.
A survey on the automatic indexing of video data
Journal of Visual Communication and Image Representation
(1999) - et al.
Layered representations for learning and inferring office activity from multiple sensory channels
Computer Vision and Image Understanding
(2004) - et al.
Structure analysis of soccer video with domain knowledge and hidden Markov models
Pattern Recognition Letters
(2004) - et al.
Multimodal video indexing: a review of the state-of-the-art
Multimedia Tools and Applications
(2005) A tutorial on hidden Markov models and selected applications in speech recognition
Proceedings of the IEEE
(1989)- W. Wolf, Hidden Markov model parsing of video programs, in: Proceedings of ICASSP, 1997, pp....
- et al.
Audiovisual integration for tennis broadcast structuring
Multimedia Tools and Applications
(2006) - et al.
From HMMs to segment models: a unified view of stochastic modeling for speech recognition
IEEE Transactions on Speech and Audio Processing
(1996) - et al.
Multimedia content analysis
IEEE Signal Processing Magazine
(2000) - J. Calic, N. Campbell, S. Dasiopoulou, Y. Kompatsiaris, An overview of multimodal video representation for semantic...
Multi-modal dialog scene detection using hidden Markov models for content-based multimedia indexing
Multimedia Tools and Applications
Cited by (16)
Advances and challenges in deep lip reading
2021, arXivVideo Scene Segmentation Using Tensor-Train Faster-RCNN for Multimedia IoT Systems
2021, IEEE Internet of Things JournalClassification-oriented structure learning in Bayesian networks for multimodal event detection in videos
2014, Multimedia Tools and ApplicationsHierarchical Hidden Markov Model in detecting activities of daily living in wearable videos for studies of dementia
2014, Multimedia Tools and ApplicationsAn automatic visual analysis system for tennis
2013, Proceedings of the Institution of Mechanical Engineers, Part P: Journal of Sports Engineering and TechnologyLeveraging lexical cohesion and disruption for topic segmentation
2013, EMNLP 2013 - 2013 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference