Framework for measurement of the intensity of motion activity of video segments

https://doi.org/10.1016/j.jvcir.2004.04.007Get rights and content

Abstract

We present a psychophysical and analytical framework for comparing the performance of motion activity measures for video segments, with respect to a subjective ground truth. We first obtain a ground truth for the motion activity by conducting a psychophysical experiment. Then we present several low-complexity motion activity descriptors computed from compressed domain block motion vectors. In the first analysis, we quantize the descriptors and show that they perform well against the ground truth. The MPEG-7 motion activity descriptor is also among the best performers. In the second analysis, we examine the specific cases where each descriptor fails, using a novel pair-wise comparison method. The analytical measures overestimate or underestimate the intensity of motion activity under strong camera motion or extreme camera angles. We finally discuss the experimental methodology and analysis methods we used, and possible alternatives. We review the applications of motion activity and how our results relate to them.

Introduction

Indexing of the vast amounts of digital video content for browsing, retrieval and summarization purposes has been an active research topic in recent years. The MPEG-7 standard for the description of multimedia metadata is developed to cover key technologies in this area. It covers a number of features, or descriptors (Ds), of video content such as shape, color, motion, etc., used for indexing. Motion activity is one of the descriptors included in the MPEG-7 specification (Jeannin and Divakaran, 2001; MPEG, 1999).

The intensity of motion activity is a subjective measure of the perceived intensity, or amount, of motion activity in a video segment. A talking head in an interview is usually a low activity segment, whereas a close-up shot of a slam-dunk in basketball is perceived as high activity. Note that it is different from camera or global motion in that it considers the overall perceived intensity of motion activity in the scene.

A number of low-complexity measures have been used to describe such motion activity characteristics of video segments (Akutsu et al., 1992; Ardizzone et al., 1999; Divakaran and Peker, 2001; Divakaran et al., 2001; Kobla et al., 1997; MPEG, 1999; Wolf, 1996). Pfeiffer et al. use a combination of image and audio features to determine the activity level of video segments and use that information in selecting interesting segments from video for summarization (Pfeifer et al., 1996). Vasconcelos et al. use a ‘tangent distance’ between consecutive frames as the measure of motion activity, and use it to characterize video sequences in terms of their action or romance content (Vasconcelos and Lippman, 1997). Wolf uses the mode of motion vector magnitudes as the measure of activity level, which he then uses to find the most still image in a video segment, and selects it as a representative frame (Wolf, 1996). The motion activity descriptor used in MPEG-7 is the variance of the motion vector magnitudes, which is readily computable from MPEG compressed video streams (MPEG, 1999). Motion activity is interpreted as a measure of “summarizability” or the entropy of a video segment in (Divakaran and Peker, 2001; Divakaran et al., 2001), and based on this interpretation, the average of magnitudes of block motion vectors is used in summarizing video segments. A similar motion activity descriptor is used in detecting interesting events in sports video (Peker et al., 2002; Xie et al., 2002).

The motion activity descriptor enables applications such as video browsing, surveillance, video content re-purposing and content based querying of video databases (Manoranjan et al., 1999). It is more effective when the video content consists of semantic units that significantly differ in their motion activity levels. For instance, in a news video, anchorperson shots are very low activity, whereas outdoors footage or the sports segments have higher activity levels (Divakaran and Peker, 2001; Divakaran et al., 2001). Still, motion activity has its strength in the very low-cost, compressed domain descriptors for it, which allow very fast pre-filtering of data, or dynamic and interactive browsing and summarization applications where the user, or further automatic processing tolerates the initial low precision. We believe that the machine vision techniques, which mostly involve optical flow and segmentation, are not easily applicable to natural video for now, especially within the constraints of the applications where the data volume is large and higher speeds that will allow interactivity is desired.

There are two possible approaches to the performance evaluation of a motion activity measure. One is as an estimator of the perceived subjective motion activity as described in the introduction. The second is as an analytical measure that is used within a specific application context, where it is considered successful to the degree that it contributes to the performance of the overall application. In the second case, the conformance of the descriptor to the perceived intensity of motion activity is not of primary concern.

In the case of MPEG-7, the motion activity descriptor is defined as an estimator of the subjective motion activity. Hence, an evaluation of the alternative motion activity measures using a subjective ground truth is necessary. The ground truth in this case is the perceived intensity of motion activity in a given video segment evaluated by human subjects. While the MPEG-7 descriptor has been developed with a ground truth data-set of 622 video segments (Divakaran et al., 2000a, Divakaran et al., 2000b), the data-set lacks statistical data about the subjects and also has segments that vary in both length and quality of shot segmentation. This has made it difficult to assess the efficacy of any automatically computed descriptor of the intensity of motion activity. In Peker (2001), Peker and Divakaran (2001), and Peker et al. (2001a), we provide a psychophysically sound basis for subjective and objective measurement of the intensity of motion activity.

In the following sections, we describe the psychophysical framework for the measurement of the intensity of motion activity of video segments, and the evaluation of the performance of different analytical measures of motion activity. First, we construct a test-set of video segments carefully selected so as to cover a wide variety and dynamic range of motion activity. We conduct a psychophysical experiment with 15 subjects to obtain a ground truth for the motion activity. Then we present several low complexity motion activity descriptors computed from MPEG motion vectors in the compressed domain.

We compare the motion activity of the video segments in the test set as assessed by the subjects, and as computed by described analytical descriptors. In the first comparison method, we quantize the analytical descriptors and compute the error with respect to a ground truth computed across the subjects. In the second comparison method, we compare the test video segments in pairs to determine the pairs where one video segment is unanimously rated as higher activity than the other by the subjects. Then for each analytical descriptor, we find the number of such pairs where the descriptor fails to give the correct ordering. Based on these results, we examine the specific cases (pairs of video segments) where each analytical descriptor, and motion vector based descriptors in general, tend to fail. We verify our initial subjective observation that the distance from camera, and strong camera motion are main cases where motion vector based descriptors tend to overestimate or underestimate the intensity of motion activity. We also show that the variance of the magnitude of motion vectors, on which the MPEG-7 motion activity descriptor is based, is one of the best among the descriptors tested.

Section snippets

Selection of the test clips

We select 294 video segments of length 1.5 s from the whole MPEG-7 test content. The number and duration of clips are chosen as a compromise between viewer fatigue, memory effects, etc., and sufficient data size for analysis, sufficient duration for perception, etc. We select the test clips from over 11 h of MPEG-7 test video sequences through several elimination steps to cover a diverse range of semantic activity types and activity levels. We first use biased random sampling to have a better

Motion vector based measures of motion activity

We use a number of low-complexity measures computed from compressed domain MPEG block motion vectors. Although the compressed domain motion vectors are not precise enough for object motion analysis, they are sufficient for the measurement of the gross motion in video. The low-complexity and compressed domain computation of the measures allow low-cost, high speed processing of large amount of video data, allowing applications such as data reduction through pre-filtering, real-time processing,

Average error performance of the descriptors

The first analysis of performance of the descriptors is based on the average error with respect to the ground truth over the whole data set (Peker et al., 2001a). We use the median of the subjects’ evaluations as our ground truth so as to minimize the effect of outliers. Taking the mean of the subjective levels would assume a linear scale for the subjective motion activity, which is not necessarily true. Rounding of the mean, as well, is problematic in the context of discrete activity levels of

Limitations of the average error analysis

The average error analysis described in the previous section validates the proposed descriptors as acceptable estimators of the subjective level of activity, and shows their comparative performance. However, the aforementioned framework of analysis based on derivation of a ground truth from 15 subjective evaluations and quantization of the computed descriptors does not allow for a more precise and detailed performance study. Note that we need to make certain assumptions to overcome a number of

Selection of the data set

The randomness of the test set is an important factor in statistical analysis of experiments. Any such psychophysical experimental test set should not be biased towards a particular subset where some of the measures are particularly strong or weak. In our case, the test set should provide a fair representation of the “universe” of video segments that we want the motion activity descriptor to be applied to. We could not carry out a simple random sampling of the initial MPEG-7 test set because it

Conclusions

We reviewed our past work in which we established a subjective ground truth for motion activity and an analysis framework for the performance of descriptors. We showed that the low-complexity, motion vector based descriptors proposed in this paper are acceptable estimators of motion activity. We analyzed the descriptors’ performance and how they compare to each other in terms of average error. We then presented a novel pairwise comparison analysis method to investigate the performance of the

References (20)

  • A. Akutsu et al.

    Video indexing using motion vectors

  • Ardizzone, E., LaCascia, M., Avanzato, A., Bruna, A., 1999. Video indexing using MPEG motion compensation vectors. In:...
  • Divakaran, A., Peker, K.A., Sun, H., Vetro, A., 2000a. A supplementary ground-truth dataset for intensity of motion...
  • Divakaran, A., Sun, H., Kim, H., Park, C.S., Manjunath, B.S., Sun, X., Shin, H., Vinod, V.V., et al., 2000b. Report on...
  • Divakaran, A., Peker, K.A., 2001. Video summarization using motion descriptors. In: Proceedings of the SPIE Conference...
  • A. Divakaran et al.

    Video summarization with motion descriptors

    J. Electronic Imaging

    (2001)
  • S. Jeannin et al.

    MPEG-7 visual motion descriptors

    Proc. IEEE Trans. Circuits Systems Video Technol.

    (2001)
  • V. Kobla et al.

    Compressed domain video indexing techniques using DCT and motion vector information in MPEG video

  • Manoranjan, D., Divakaran, A., Manjunath, B.S., Ganesh R, Vinod, V., 1999. Requirements for an activity feature and...
  • MPEG-7, 1999. Visual part of the XM 4.0, ISO/IEC MPEG99/W3068, Maui, USA, December...
There are more references available in the full text version of this article.

Cited by (13)

  • Adaptive bitstream switching of scalable video

    2007, Signal Processing: Image Communication
    Citation Excerpt :

    Some other techniques employ low-level visual content features, such as the frame-type of the video stream [23]. The frame-type is extracted from the syntax of the video stream and shows only low correlation with actual visual content descriptors (expressed by MPEG-7 descriptors [24–27]), which we consider in our study. The study [17] presents and evaluates adaptive streaming mechanisms, which are based on the visual content features, for non-scalable (single-layer) encoded video, whereby the adaptation is achieved by selectively dropping B-frames.

  • Adaptive video-aware FEC-based mechanism with unequal error protection scheme

    2013, Proceedings of the ACM Symposium on Applied Computing
View all citing articles on Scopus
View full text