Scene-adaptive video partitioning by semantic object tracking

https://doi.org/10.1016/j.jvcir.2005.02.003Get rights and content

Abstract

An adaptive mechanism for video partitioning by semantic objects tracking is proposed. A video scene consists of the sequence of frames between two adjacent video scene changes which can be detected according to the video scene complexity. In general, the video scene complexity can be described in twofold characteristics—the temporal domain motion complexity and the spatial domain activity complexity. For this purpose, we propose a novel spatial-temporal segmentation method as a general segmentation algorithm combining several types of information including color and motion. A region within a foreground object is called as a foreground region, which is characterized as a moving uniform region. An algorithm for object tracking based on the foreground regions is also included in order to recognize camera and object movements and obtain correct video shots. By analyzing foreground objects between consecutive frames, the types of scene change and the types of camera movement can be detected according to the number of entering and existing regions and the motion vectors, respectively. Based on these parameters, the frames of a video sequence are categorized into normal, cut, fade, and dissolve classes. Adaptation is realized by grouping variable number of the labeled frames as a unit, which contains a scene change to be automatically determined by the moment-preserving thresholding techniques. Experimental results are presented to demonstrate the performance of the new method in terms of better segmentation.

Introduction

Video segmentation is the first step to achieve the goal of automatically annotating video data for browsing and retrieval [1], [2], [3]. A video sequence is usually partitioned by a video segmentation algorithm into a set of meaningful and manageable segments (shots) which are the basic elements for indexing. A shot taken from one camera is a continuous sequence of frames. Each shot could be represented by key frames and indexed according to spatial and temporal features. Given a query video, a video which is stored in the database is retrieved if the feature vector of the database video is similar to that of the query video.

There are two basic types of shot transitions: abrupt and gradual [4]. An abrupt transition occurs in a single frame between two shots, on the other hand, and gradual transitions combine two shots by the fade-in and fade-out, the dissolve, and other cinematic effects. Gradual transitions are more difficult to be detected than cuts. It is particular difficult to detect dissolves between sequences involving intensive motion [5].

Another important issue for video segmentation is the detection of camera operations which might reveal the focus of the viewer and are useful to select key frames. Some camera operations, such as panning and zooming, would substantially change the content of a scene, thus suggest the use of more than one key frame. Many methods had been proposed in the literature [4], [6], [7], [8], [9] for camera operation recognition.

Algorithms for temporal video segmentation can be classified into two classes according to the domain (compressed and uncompressed) a technique developed. The problem of temporal video segmentation has been approached from different aspects. These can be broadly divided into four categories: similarity-based temporal video segmentation, clustering-based video segmentation, feature-based video segmentation, and model driven temporal video segmentation.

The majority of algorithms are similarity-based temporal video segmentation methods, which separate uncompressed videos on the basis of a similarity measurement between two successive images. When two images are sufficiently dissimilar, there may be a cut. Gradual transitions are found by using cumulative difference measures and more sophisticated thresholding schemes. Pair-wise pixel [10], [11] and block-based [12], [13] comparison evaluate the differences in intensity or color values of corresponding pixels in two successive frames. Although some irrelevant frame differences are removed, these approaches are sensitive to object and camera movements. In contrast to define the measure metric based on template matching, a step further towards reducing the sensitivity to camera and object movements can be done by comparing the histograms of successive images [14]. One of the advantages of the histogram-based comparison is that histograms are invariant to object rotation. On the other hand, when this method is applied to segment a film into a set of shots, a problem would be occurred if two images with similar distributions of histograms have complete different content [4].

Comparing with similarity-based segmentation algorithms, cluster-based segmentation algorithms aim at separating video shots without setting the thresholds of similarities between successive images [15]. The thresholds are typically highly dependent on the type of input video. Basically, the images in a video sequence can be classified into two classes, scene change and no scene change and the well-known K-means clustering [16] can be used to cluster frame dissimilarity. As a limitation of the cluster-based temporal video segmentation, it is difficult to detect gradual transitions.

An interesting feature-based approach for temporal video segmentation is proposed by Zabih et al. [5]. It assumes that the locations of intensity edges between two successive images would change during a cut or a dissolve. Thus, by counting the entering and exiting edge pixels, cuts, fades, and dissolves are detected and classified. To obtain better results in case of object and camera movements, an algorithm for motion compensation is also included. Before detecting entering and exiting edge pixels, the global motion would be estimated first and it is used to align the frames. However, this technique is not able to handle multiple rapidly moving objects. As the authors have pointed out, another weakness of the approach are the false positives due to the limitations of the edge detection method. In particular, rapid changes in the overall shot brightness, and very dark or very light frames, may cause false positives.

An interesting approach to separate a video sequence into several meaningful video shots is to analyze the trajectories of the semantic video objects, which correspond to meaningful entities in the input data. To provide new functionalities for multimedia applications, such as content-based video retrieval, the new video coding standard, MPEG-4 [22], treats video scenes as compositions of audio-visual objects. A real world object in a frame is represented by a video object plane (VOP), which is a snapshot of moving object at a given time. Two video scenes contain different video objects could then be supposed to be semantically different. As a result, video shots cannot be uniquely characterized by a low-level feature such as motion, texture, color, etc.

Video object segmentation is generally far more difficult than low-level segmentation due to the lack of complete image understanding models. Although human eyes can identify video objects easily, automatic segmentation of semantic video objects still remains one of the fundamental research problems in the image analysis community. During the past two decades, intensive research works have been carried out in the automatic segmentation domain [23], [24], [25], [26], [34]. These techniques achieve efficient segmentation by subdividing a frame into a number of arbitrarily shaped moving objects and the background according to a homogenous color criterion, a homogeneous motion criterion, and object tracking. A single homogeneous color or motion criterion does not lead to satisfactory extraction of complete semantic video object in general due to the fact that a mantic video object might contain multiple colors and multiple regions. A feasible solution is to cooperate a human and a computer and construct a semi-automatic semantic object segmentation algorithm [23], [24], [25], [26], [27], [28]. In the first step, users initially identify a semantic object by using tracing interface and the computer automatically tracking the segmented object for the successive frames. Another interesting development for object tracking is based on particle filtering which is an inference technique for estimating the unknown motion state from a noisy collection of observations arriving in a sequential fashion [34].

This paper presents an adaptive video scene analysis based on a spatial-temporal segmentation method and an object tracking scheme. The foreground object, consists of a number of moving regions between two successive frames in a video sequence is first extracted in order to identify the content of the video scene. For this purpose, we propose a novel spatial-temporal segmentation method as a general segmentation algorithm combining several types of information including color and motion. A region within a foreground object is called as a foreground region, which is characterized as a moving uniform region. An algorithm for object tracking based on the foreground regions is also included in order to recognize camera and object movements and obtain correct video shots. By analyzing foreground regions between consecutive frames, the types of scene change and the types of camera movement can be detected according to the number of entering and existing regions and the motion vectors, respectively. Moreover, adequate attention is also paid to separate key frames from the video sequence by analyzing the tracking results of foreground regions using the moment-preserving thresholding techniques [17], [18]. Experimental results are presented to demonstrate the performance of the new method in terms of better segmentation and computational efficiency.

The remainder of this paper is organized as follows. Section 2 presents the foreground/background separation procedure and the spatial-temporal segmentation algorithm. In Section 3, the proposed video scene detection method is discussed. Some experimental tests to illustrate the effectiveness of the proposed method are shown in Section 4. Finally, conclusions are drawn in Section 5.

Section snippets

Spatial-temporal video object segmentation

Fig. 1 presents the block diagram of an integrated object-based video segmentation, compression, and retrieval system through the region-based analysis. In such kind of system, automatic video object segmentation and tracking are the two essential modules. The basic video object segmentation and tracking algorithm on the basis of spatial and temporal information is described below.

Algorithm spatial-temporal video region segmentation and tracking (STVRST).

  • 1.

    Read in the first F frame images from an

Proposed video scene detection method

In this paper, an automatic video scene detection method is proposed by analyzing the change of foreground objects in consecutive frames. The useful change patterns are: (1) during a cut or a dissolve, new regions appear far from the locations of the old regions; (2) during a fade, old regions gracefully disappear and new regions gracefully appear; (3) comparing with a cut, the frame differences of a dissolve make a graceful increase; (4) similar to a cut, a fade has at least an abrupt change

Experimental results

To evaluate the proposed approach, a series of experiments was conducted on an Intel PENTIUM-III 800 MHz PC and on a set of video sequences, including the Forman, News Reporter, Puppetry Show, Star Fighters, Planets, Movie, and Sports where each video sequence contains different number of frame of size 352 × 288.

The performance of the proposed video segmentation algorithm is compared with that of Tsaig and Averbuch’s method [23] in terms of segmentation quality and computational complexity. The

Conclusions

This paper has presented an adaptive video scene analysis method based on a video object segmentation method and an object tracking scheme. The contributions of this paper include: (1) a novel spatial-temporal segmentation method as a general segmentation algorithm combining several types of information including color and motion is proposed; (2) a region-based motion estimation algorithm by representing regions as the region shortest-path trees is proposed; (3) an automatic video scene

References (34)

  • I. Koprinska et al.

    Temporal video segmentation: a survey

    Signal Process.: Image Commun.

    (2001)
  • N.V. Patel et al.

    Video shot detection and characterization for video databases

    Pattern Recognit.

    (1997)
  • W. Xiong et al.

    Efficient scene change detection and camera motion annotation for video classification

    Comput. Vis. Image Understand.

    (1998)
  • M. Flickner

    Query by image and video content: the QBIC system

    IEEE Computer

    (1995)
  • S.-F. Chang, W. Chen, H.J. Meng, H. Sundaram, D. Zhong. VideoQ: an automated content based video search system using...
  • M. Smith, T. Kanade. Video skimming and characterization through the combination of image and language understanding,...
  • R. Zabih et al.

    A feature-based algorithm for detecting and classifying production effects

    Multimedia Syst.

    (1999)
  • A. Akutsu, Y. Tonomura, H. Hashimoto, Y. Ohba. Video indexing using motion vectors, in: Proceedings of SPIE: Visual...
  • H.J. Zhang, C.Y. Low, Y.H. Gong, S.W. Smoliar. Video parsing using compressed data, in: Proceedings of SPIE Conference...
  • T. Kikukawa et al.

    Development of an automatic summary editing system for the audio-visual resources

    Trans. Electron. Inform.

    (1992)
  • A. Nagasaka et al.

    Automatic video indexing and full-video search for object appearances

  • R. Kasturi et al.

    Dynamic vision

  • B. Shahraray, Scene change detection and content-based sampling of video sequences, in: Proceedings of IS & T/SPIE,...
  • I.K. Sethi, N. Patel, A statistical approach to scene change detection, in: Proceedings of SPIE Conference on Storage...
  • B. GuKnsel et al.

    Temporal video segmentation using unsupervised clustering and semantic object tracking

    J. Electron. Imag.

    (1998)
  • T.N. Pappas

    An adaptive clustering algorithm for image segmentation

    IEEE Trans. Signal Process.

    (1992)
  • W.H. Tsai

    Moment preserving thresholding: a new approach

    Comput. Vis. Graph. Image Process.

    (1984)
  • Cited by (0)

    This research was supported in part by the National Science Council, ROC, under Contract NSC92-2213-E-327-011.

    View full text