Content based video matching using spatiotemporal volumes

https://doi.org/10.1016/j.cviu.2007.09.016Get rights and content

Abstract

This paper presents a novel framework for matching video sequences using the spatiotemporal segmentation of videos. Instead of using appearance features for region correspondence across frames, we use interest point trajectories to generate video volumes. Point trajectories, which are generated using the SIFT operator, are clustered to form motion segments by analyzing their motion and spatial properties. The temporal correspondence between the estimated motion segments is then established based on most common SIFT correspondences. A two pass correspondence algorithm is used to handle splitting and merging regions. Spatiotemporal volumes are extracted using the consistently tracked motion segments. Next, a set of features including color, texture, motion, and SIFT descriptors are extracted to represent a volume. We employ an Earth Mover’s Distance (EMD) based approach for the comparison of volume features. Given two videos, a bipartite graph is constructed by modeling the volumes as vertices and their similarities as edge weights. Maximum matching of this graph produces volume correspondences between the videos, and these volume matching scores are used to compute the final video matching score. Experiments for video retrieval were performed on a variety of videos obtained from different sources including BBC Motion Gallery and promising results were achieved. We present qualitative and quantitative analysis of retrieval along with a comparison with two baseline methods.

Introduction

The amount of digital content generated in the form of video has seen tremendous growth over the last decade. Key elements providing impetus for this growth are: proliferation of inexpensive digital cameras, hand held devices, popularity of web based video streaming, and adoption of digital video by broadcast industry as a part of their distribution services. As a record number of video clips are generated and added into digital libraries every day all over the world, the need for management of this content by means of efficient storage, indexing, and retrieval has never been more pressing than today. Recent major search initiatives in video domain by companies such as Google, Yahoo, MSN etc., show realization on part of the industry for the proper management of this video content. Apparently they want to build upon their experience of text based search to develop video search engines. Pivotal to achieving this goal will be a viable search methodology capable of computing video content similarities.

Content based video matching is considered to be a complex task. One main reason for this is the amount of intra-class variation where the same semantic concept can occur under different illumination, appearance, and scene settings, just to name a few. For example, videos containing a person riding a bicycle can have variations such as different viewpoints, sizes, appearances, bicycle types, and camera motions. Most of the research in the area of content based video matching is therefore aimed at addressing these challenges.

In this paper we present a content based video matching framework that aims to address certain limitations of the existing methods. The crux of the proposed approach is to use features computed from spatiotemporal volumes as the basic building blocks. The intuition behind this representation stems from the observation that there are several factors that should be considered for deciding whether two videos are similar or not. These factors include similarity of the foreground objects, object motion, background appearance, camera motion, etc. The method presented in this paper addresses these issues by detecting important regions in the (foreground and background) scene, extracting features that are less sensitive to the aforementioned variations, and finally employing a volume correspondence technique that handles partial video matches.

Image and video retrieval have been an active area of research in the multimedia community and provides the foundation for tasks like video similarity matching. Over the year several methods and systems have been proposed for the content based image retrieval (CBIR). Most of these earlier systems like MIT’s Photobook [1], IBM’s QBIC [2] etc., were based on global image features. However, in most cases a user of a CBIR system is interested in searching for images of a particular object (e.g. car, boat, airplane etc.,) or a semantic concept which are functions of local image features. Therefore, CBIR systems relying only on global image features are expected to have limited performance in such scenarios. To overcome this problem, researchers proposed region based image features. Such content representation and modelling approach has been used in a variety of ways. See [3], [4], [5], [6], [7] for some of the region based image retrieval (RBIR) systems. The RBIR systems have been shown to perform better than the CBIR systems that are based on only global image features.

A comprehensive video matching system should fuse information from all available media types that can be extracted from a video. This can include audio, video, caption, and text transcript. Some of the earlier video retrieval system like [8], [9], [10] focused on the integration of these different types of media. An important issue here is to ensure that the content extraction and matching of the any individual medium is accurate and robust. This challenging aspect of Content Based Video Retrieval (CBVR) has been addressed by several researchers [11], [12], [13], [14], [15], [16]. Similar to the paradigm of RBIR, many CBVR approaches also rely on region based features. Often these are spatial regions belonging to keyframe the video [16]. However, since video is a spatiotemporal entity, spatial region based approaches can be extended to represent spatiotemporal regions of the video volume. The approach described in this paper belongs to this category of methods which rely on motion based spatiotemporal segmentation.

Region based video retrieval starts by computation of spatial regions for every frame which are then extended to spatiotemporal regions. For instance, methods proposed in [11], [12], [15], [17] compute spatial color segmentation of every frame in the video which is followed by the temporal correspondence of these regions. However, in highly textured scenes these approaches are not able to perform adequately due to over-segmentation which leads to incorrect region matches. In addition, a complex video can have significant variations in the appearance of the same object throughout the video. Therefore, a simple color segmentation, which is known to give inconsistent results under varying noise and illumination conditions, will not be a viable option. This in turn, limits the effectiveness of several CBVR methods that rely on color based spatial segmentation.

In this line of research, few approaches also used global and local motion information to recover coherent image regions. For example, the motion segmentation and object tracking method presented in [11] relies on color segmentation and optical flow computation. The accuracy and reliability of optical flow is known to be limited in case of large motion or textureless regions (aperture problem). Region tracking in [17] also relies on appearance features computed from regions. Again the performance of these approaches is also limited due to the adverse quality of color segmentation.

Recently, vocabulary based text retrieval techniques have been applied in [18] for object matching in videos. However, their method did not perform explicit object extraction before the matching step. In [19], spatiotemporal volumes were extracted which were specific to faces in the video sequence. This approach relies on the facial structure and the appearance features related to it. In contrast, the framework presented in this paper is more general and applicable to a wide variety of objects and scenarios. Furthermore, [20] presented a framework where specific objects were recognized using the tracked salient regions. However, they require to manually select the particular object that is to be searched in the query video. The main difference of the proposed approach from their technique is that they focus on specific object recognition, whereas our emphasis is more on object/scene category matching. Moreover, in our method, we consider the entire content of the query video and automatically compute the matching between different foreground and the background volumes. In short, we propose a more general framework that can be used to match video shots with similar kinds of objects and scenes. In another recent work, [21] addresses the matching of similar shots and presents a solution based on three-dimensional models of scene content, which are built using affine covariant patches. Another interesting work for matching background scenes in movie shots was presented in [22]. Their matching technique relies on the local similarity of features, an epipolar constraint, and a temporal constraint. Unlike their approach, we consider static as well as moving objects in the foreground to match the video shots.

We feel that there is a need of a better content based video matching approach that could handle partial matches based on similar types of the foreground objects, and the background scene. We consider motion information as a strong que in a video and feel that it should be utilized to extract more reliable video contents. For the video retrieval task, it is desirable to build a system that does not require extensive training for each semantic concept. The following section presents the proposed approach that addresses these issues.

The proposed framework comprises of two major components: video volume extraction and video matching using volume features. Unlike conventional approaches, we utilize the interest point trajectories in the video sequence to extract spatiotemporal video volumes. Interest points and their correspondences are established using the Scale Invariant Feature Transform (SIFT [23]) operator. The point correspondences are used to generate trajectories, which are further refined by performing velocity prediction to merge the broken trajectories. These trajectories are then grouped into clusters based on their motion similarity and the spatial proximity. The temporal correspondence between the estimated motion segments is then established based on the highest number of SIFT correspondences. A two pass algorithm is used to handle region noise, splitting, and merging. The tracked regions are then stacked together to produce spatiotemporal volumes. Each volume encompasses independently moving region, which could either belong to the scene background or the foreground object. This provides a more structured information about the scene for the task of video matching. A set of features including color, texture, motion, and SIFT descriptors are extracted from each volume. The weighted combination of feature similarities between two volumes provides a measure of their similarity. The degree of similarity between the features is computed through Earth Mover’s Distance. Two videos to be matched are modeled as a bipartite graph, where volumes are represented by vertices and similarities between them as edge weights. The maximum matching of this graph is then used to establish the correspondences between the volumes. The score between each pair of matched volumes is then combined towards the final video matching score. The proposed video matching framework is tested on several videos for the task of content based video retrieval.

It should be noted that our framework is not designed to search for exact matches of an object observed in a video shot as suggested by [20]. On the other hand, our approach is more suitable for establishing similarity among videos based on similar types of the foreground objects and the background scene. The novelty of our approach for video matching lies in (a) the extraction of spatiotemporal volumes that correspond to meaningful foreground and background objects (b) a partial video matching framework based on several strong features from the volumes.

The details of the proposed framework are discussed in the following sections. Steps involved in the extraction of volumes are described in Section 2. Section 3 discusses the volume features used and their role in the matching task. The graph based video matching technique is described in Section 4. The experimental results and performance analysis are presented in Section 5. Finally, the conclusions and future directions are discussed in Section 6.

Section snippets

Spatiotemporal volume extraction

In this paper we propose a framework that relies on spatiotemporal regions (volumes) for solving the video matching problem. For a given video, we first extract interest point trajectories using SIFT correspondences (see Section 2.1). These trajectories are then used to recover different motion segments in each frame (see Section 2.2). The correspondence between the motion segments is then resolved using a two pass algorithm (see Section 2.3). In this paper, the term foreground refers to the

Volume features extraction

Once the volumes are available for a complete video shot, we extract features that are used for the video matching step. We use features that capture interest point descriptors, color, texture, and motion of video volumes. The features are local for the video volume as opposed to using the global video features. A common representation of these features has been used in form of a group set of clusters in the corresponding feature space. Each cluster in this set is represented by the mean

Volume based video matching

This section explains the method used to determine similarity between two given videos. In this framework, it is desirable that the matching technique should be able to handle partial matches between videos. For instance, in case of two very similar foreground objects observed in two dissimilar backgrounds, the system should be able to generate a high similarity score. Different parts of the scene are captured by volumes and corresponding set of features computed from each of these volumes

Experimental results

Several experiments were performed to verify the effectiveness of the proposed framework. Section 5.1 presents some implementation details along with the results of volume extraction. An application of the proposed framework for the task of content based video retrieval is presented in Section 5.2. We also compare our approach with two baseline methods, and present qualitative and quantitative analysis of the retrieval.

We have performed experiments on a dataset of 337 videos obtained from

Conclusions

In this paper, we have proposed a novel and robust video matching framework by analyzing properties of spatiotemporal volumes in videos. Volumes are constructed based on the clustering of the interest point trajectories. Multiple features are extracted to model the appearance of the volumes, including color, texture, motion, and interest point descriptors. Similarity between two videos is computed by solving the maximum matching problem of the graph formed by the volumes. Utilizing the proposed

References (37)

  • F. Schaffalitzky et al.

    Automated location matching in movies

    Computer Vision and Image Understanding

    (2003)
  • G. Ahanger et al.

    A survey of technologies for parsing and indexing digital video

    Journal of Visual Communication and Image Representation

    (1996)
  • A. Pentland et al.

    Photobook: content-based manipulation of image databases

    International Journal of Computer Vision

    (1996)
  • C. Faloutsos et al.

    Efficient and effective querying by image content

    Journal of Intelligent Information Systems

    (1994)
  • J. Smith, S. Chang, Visualseek: a fully automated content-based image query system, in: Proceedings of the 4th ACM...
  • C. Carson et al.

    Blobworld: image segmentation using expectation-maximization and its application to image querying

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • H. Greenspan, G. Dvir, Y. Rubner, Region correspondence for image matching via emd flow, in: Proceedings of IEEE...
  • F. Jing, M. Li, H. Zhang, B. Zhang, Region-based relevance feedback in image retrieval, in: IEEE International...
  • J. Wang et al.

    Simplicity: semantics-sensitive integrated matching for picturelibraries

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • R. Mohan

    Text-based search of tv news stories

    Proceedings of SPIE

    (1996)
  • A. Hampapur, A. Gupta, B. Horowitz, C. Shu, C. Fuller, J. Bach, M. Gorkani, R. Jain, Virage video engine, in:...
  • A. Hauptmann et al.

    Informedia: News-on-demand multimedia information acquisition and retrieval

    Intelligent Multimedia Information Retrieval

    (1997)
  • S. Chang, W. Chen, H. Meng, H. Sundaram, Videoq: an automated content based video search system using visual cues, in:...
  • J. Lee, J. Oh, S. Hwang, Strg-index: spatio-temporal region graph indexing for large video databases, in: Proceedings...
  • S. Dagtas et al.

    Models for motion-based video indexing and retrieval

    IEEE Transactions on Image Processing

    (2000)
  • S. Sav, N. OConnor, A. Smeaton, N. Murphy, Associating low-level features with semantic concepts using video objects...
  • A. Smeaton, H. Le Borgne, N. OConnor, T. Adamek, O. Smyth, S. De Burca, Coherent segmentation of video into syntactic...
  • E. Ardizzone, M. La Cascia, D. Molinelli, Motion and color based video indexing and retrieval, in: Proceedings of the...
  • Cited by (77)

    • Video classification and retrieval through spatio-temporal Radon features

      2020, Pattern Recognition
      Citation Excerpt :

      These videos are challenging in the sense that they consist of objects captured from various viewpoints, object motion, and different instances of appearance of the objects. These challenging videos make the data set very difficult with respect to content-based video retrieval and matching [39]. To reduce the search time for retrieval in such a challenging data set, a robust STHRP pattern has been proposed which effectively represents the objects present in the video volume.

    View all citing articles on Scopus
    View full text