A spatio-temporal pyramid matching for video retrieval

doi:10.1016/j.cviu.2013.02.003

Computer Vision and Image Understanding

Volume 117, Issue 6, June 2013, Pages 660-669

https://doi.org/10.1016/j.cviu.2013.02.003 Get rights and content

Abstract

An efficient video retrieval system is essential to search relevant video contents from a large set of video clips, which typically contain several heterogeneous video clips to match with. In this paper, we introduce a content-based video matching system that finds the most relevant video segments from video database for a given query video clip. Finding relevant video clips is not a trivial task, because objects in a video clip can constantly move over time. To perform this task efficiently, we propose a novel video matching called Spatio-Temporal Pyramid Matching (STPM). Considering features of objects in 2D space and time, STPM recursively divides a video clip into a 3D spatio-temporal pyramidal space and compares the features in different resolutions. In order to improve the retrieval performance, we consider both static and dynamic features of objects. We also provide a sufficient condition in which the matching can get the additional benefit from temporal information. The experimental results show that our STPM performs better than the other video matching methods.

Highlights

► We introduce a content-based video retrieval system for a query video shot. ► The shot boundaries are found using a classifier learnt from a boosting algorithm. ► The similarity of video shots is calculated by spatio-temporal pyramid matching. ► The pyramid-matching kernel includes temporal dimension into the matching schema. ► Experiments using sports and UCF50 shows effectiveness of our method.

Introduction

The convenient access to networked multimedia devices and multimedia hosting services has contributed the huge increase in network traffic and data storage. The recent reports say that 34% of the current cell phone users do video recording [1] and video traffic is 40% of consumer Internet traffic [2]. In addition to the current video hosting services such as YouTube [3] and Vimeo [4], major IT companies such as Google [5] and Apple [6] have started to offer cloud audio/video storage services to customers.

Compared to the recent efforts and deployments of content-based image search such as automatic tagging based on face recognition [7], [8], content-based video search is still under-developed. We have two main observations to explain its shortcomings.

First, The temporal information on videos adds more complexity of dimensions of data, so queries could be more complex than typical text-based ones. In addition, the representations of these queries generated by simple sketch tools [9], [10] are so primitive or generic compared with text-represented queries, they would lead either wrong or diverse query results. More complex querying system (such as dynamical construction of hierarchical structures on targeting videos [11]) requires more elaboration on queries by users, which could be more error-prone.

Second, this has been assumed that the user does not have sample videos at hand for query, so additional querying tools are required. However, this assumption is no longer valid because mobile devices such as digital cameras, PDAs, and cell-phones with camera and solid-state memory enable instant image and video recording which can be used for a video query.

Taking advantage of this opportunity from mobile and ubiquitous multimedia, our content-based video query system takes a sample video clip³ as a query and searches the collection of videos typically stored in multimedia portal service (such as YouTube, Vimeo, Google Video [12], Yahoo! Video [13], etc.), and suggests similar video clips from the database with relevance evaluation. As shown in Fig. 1, our system mainly performs the following two functionalities – (1) offline population of our video database for new video entries to database and (2) online video matching for a new query video. When a video is introduced in the database, it is partitioned into multiple clips by a clip boundary detection based on feature analysis and classification. The partitioned video clips are stored along with metadata information in the database. Next, for a new video from query process, it is analyzed and matched to the stored videos and the relevant scores are calculated by our spatio-temporal pyramid matching system.

The rest of this paper is organized as follows. In Section 2, related work on image and video retrieval is discussed. Our spatio-temporal pyramid matching system is presented in Section 3. We also analyze formal conditions where our spatio-temporal pyramid matching system gets benefits from temporal information in Section 4. The experimental results are presented in Section 5, and finally we conclude our paper.

Section snippets

Related work

The challenges and characteristics of content-based image and video query systems are well discussed in [14]. A significant amount of research on automatic image annotation [15], [16], [17], [18], [19], [20] has been done, and recently researchers are more focusing on automatic video annotation [21], [22], [23], [24]. Especially, Ulges et al. [25] and Ando et al. [26] discussed video tagging and scene recognition problems, which have similar goals to ours but takes different approaches. Dynamic

System design

Given a video clip as a new entry to the database, the clip boundary detection in our system divides it into multiple video clips and they are stored in our video matching database. Once the system receives a query video clip, the similarity between the query video clip and the video clips already stored in the database is measured by our Spatio-Temporal Pyramid Matching (STPM) kernel. The measured similarity is used for the rank of video matching. The higher rank of a query and a video clip in

Analysis of spatio-temporal pyramid matching kernel

In this section, we analyze a formal condition in which temporal information contributes to the video matching.

First, in order for the fair comparison of SPM and STPM, we use the conventional weighted sum of matching scores of key frames for SPM without any temporal information. It is straightforward to show that the matching score of STPM is equal to or greater than the score of SPM. Second, we introduce category gain to represent the gain from temporal information, which is calculated as

Experimental evaluation

In this section, we present our experimental settings including the datasets and features that we use, performance criteria for video matching, and experimental evaluations. We use two datasets for benchmarking – (1) USF dataset [56] and (2) sports videos that we collected from YouTube.⁵ The performance of video matching with our spatio-temporal pyramid matching is evaluated with two parameters – (1)

Conclusions

In this paper, we addressed the problem of classifying video clips for content-based video query. The clip boundaries are found using a strong classifier learnt from a boosting algorithm on top of weak classifiers. Then, the similarity of video clips is calculated by our spatio-temporal pyramid matching kernel which includes temporal dimension into the matching schema.

Our experimental evaluation using sports videos and standard UCF50 dataset shows that the temporal dimension is an effective

Acknowledgement

This work was supported by INHA UNIVERSITY Research Grant. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No2012R1A1A1044658).

References (58)

S. Liang et al.
Sketch retrieval and relevance feedback with biased SVM classification
Pattern Recognition Letters
(2008)
Y. Gong et al.
Maximum entropy model-based baseball highlight detection and classification
International Journal of Computer Vision and Image Understanding
(2004)
D.A. Adjeroh et al.
A distance measure for video sequences
Vision and Image Understanding
(1999)
A. Smith, Mobile access 2010, July 2010....
Cisco, Cisco visual networking index: Forecast and methodology, 2010–2015, June 2011....
YouTube, YouTube – Broadcast Yourself....
Vimeo, Vimeo....
Google, Google play....
Cnet, Apple trying to store your video in the cloud....
Google, Picasa....

Facebook, Facebook....

E.D. Sciascio et al.

Query by sketch and relevance feedback for content-based image retrieval over the web

Journal of Visual Language and Computing

(1999)

J.R. Smith et al.

VisualSEEk: a fully automated content-based image query system

Google, Google Videos....

Yahoo!, Yahoo! Video....

M.S. Lew et al.

Content-based multimedia information retrieval: state of the art and challenges

ACM Transactions on Multimedia Computing, Communications and Applications

(2006)

P. Duyguru, K. Barnard, J. de Freitas, D. Forsyth, Object recognition as machine translation: learning a lexicon for a...

J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and retrieval using cross-media relevance models, in:...

R. Jin, J.Y. Chai, L. Si, Effective automatic annotation via a coherent language model and active learning, in:...

Y. Jin, L. Khan, M.A.L. Wang, Image annotations by combining multiple evidence & wordnet, in: Proceedings of ACM...

Y. Gao, J. Fan, X. Xue, R. Jain, Automatic image annotation by incorporating feature hierarchy and boosting to scape up...

T. Bailloeul, C. Zhu, Y. Xu, Automatic image tagging as a random walk with priors on the canonical correlation...

S. Feng, R. Manmatha, V. Lavrenko, Multiple Bernoulli relevance models for image and video annotation, in: Proceedings...

M. Wang et al.

Automatic video annotation by semi-supervised learning with kernel density estimation

M. Bertini et al.

Automatic video annotation using ontologies extended with visual information

S.-H. Jung, Y. Guo, H. Sawhney, Action video retrieval based on atomic action vocabulary, in: Proceedings of ACM...

A. Ulges, C. Schulze, D. Keysers, T. Beuel, Content-based video tagging for online video portals, in: Proceedings of...

R. Ando, K. Shinoda, S. Furui, T. Mochizuki, A robust scene recognition system for baseball broadcast using data-driven...

P. Chang, M. Han, Y. Gong, Extract highlights from baseball game video with hidden Markov models, in: Proceedings of...

Cited by (26)

A systematic review on content-based video retrieval
2020, Engineering Applications of Artificial Intelligence
Citation Excerpt :
One can note that k-means is also useful to group trajectory segments for sketch-based retrieval (Ghosal and Namboodiri, 2014). Finally, this algorithm and variations are often associated with usual visual words techniques for video indexing (Shen et al., 2015; Choi et al., 2013; André et al., 2012). As k-means is an unsupervised learning algorithm, it is applicable in DR problems without class or target feature (Section 2.4).
Content-based video retrieval and indexing have been associated with intelligent methods in many applications such as education, medicine and agriculture. However, an extensive and replicable review of the recent literature is missing. Moreover, relevant topics that can support video retrieval, such as dimensionality reduction, have not been surveyed. This work designs and conducts a systematic review to find papers able to answer the following research question: “what segmentation, feature extraction, dimensionality reduction and machine learning approaches have been applied for content-based video indexing and retrieval?”. By applying a research protocol proposed by us, 153 papers published from 2011 to 2018 were selected. As a result, it was found that strategies for cut-based segmentation, color-based indexing, k-means based dimensionality reduction and data clustering have been the most frequent choices in recent papers. All the information extracted from these papers can be found in a publicly available spreadsheet. This work also indicates additional findings and future research directions.
A Time Flexible Kernel framework for video-based activity recognition
2016, Image and Vision Computing
Citation Excerpt :
The long-term temporal information has been used in some other methods. A recent one is the extension of the work of Spatial Pyramid Matching (SPM) [32] called Spatio-Temporal Pyramid Matching (STPM) [13]. The method suggests dividing the videos into equal number of spatio-temporal volumes at several scales, called pyramids, computing in each volume a BoF, and finally obtaining a similarity between two video clips by comparing the corresponding volumes.
This work deals with the challenging task of activity recognition in unconstrained videos. Standard methods are based on video encoding of low-level features using Fisher Vectors or Bag of Features. However, these approaches model every sequence into a single vector with fixed dimensionality that lacks any long-term temporal information, which may be important for recognition, especially of complex activities. This work proposes a novel framework with two main technical novelties: First, a video encoding method that maintains the temporal structure of sequences and second a Time Flexible Kernel that allows comparison of sequences of different lengths and random alignment. Results on challenging benchmarks and comparison to previous work demonstrate the applicability and value of our framework.
Statistical quantization for similarity search
2014, Computer Vision and Image Understanding
Citation Excerpt :
This problem is also known as Nearest Neighbor (NN) search, which is defined as accurately finding the close samples for a given query within a large database [4,5]. It is of great importance to a wide range of multimedia applications, such as content-based image/video retrieval [6–8], image/video auto-tagging [9–11], image classification [12,13], and scene recognition [14]. Naively searching for the neighbors according to their similarities entails exhaustively comparing the queries with the examples over the entire database.
Approximate nearest neighbor search has attracted much attention recently, which allows for fast query with a predictable sacrifice in search quality. Among the related works, k-means quantizers are possibly the most adaptive methods, and have shown the superiority on search accuracy than the others. However, a common problem shared by the traditional quantizers is that during the out-of-sample extension process, the naive strategy considers only the similarities in Euclidean space without taking into account the statistical and geometrical properties of the data. To cope with this problem, in this paper a novel approach is proposed by formulating a generalized likelihood ratio analysis. In particular, the proposed method takes a physically meaningful discrimination on the affiliations of the new samples with respect to the obtained Voronoi cells. This discrimination essentially imposes the measure of statistical consistency on out-of-sample extension. The experimental studies on two large data sets show that the proposed method is more effective than the benchmark algorithms.
Particle swarm optimized deep spatio-temporal features for efficient video retrieval
2024, International Journal of Information Technology (Singapore)
Learning clustered deep spatio-temporal prototypes using softmax regression for video information systems
2024, International Journal of Information Technology (Singapore)
Conditional deep clustering based transformed spatio-temporal features and fused distance for efficient video retrieval
2023, International Journal of Information Technology (Singapore)

View all citing articles on Scopus

^☆: This paper has been recommended for acceptance by Chung-Sheng Li.

¹: Present address: Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.

²: Co-corresponding author.

View full text

A spatio-temporal pyramid matching for video retrieval☆