A spatio-temporal pyramid matching for video retrieval☆
Highlights
► We introduce a content-based video retrieval system for a query video shot. ► The shot boundaries are found using a classifier learnt from a boosting algorithm. ► The similarity of video shots is calculated by spatio-temporal pyramid matching. ► The pyramid-matching kernel includes temporal dimension into the matching schema. ► Experiments using sports and UCF50 shows effectiveness of our method.
Introduction
The convenient access to networked multimedia devices and multimedia hosting services has contributed the huge increase in network traffic and data storage. The recent reports say that 34% of the current cell phone users do video recording [1] and video traffic is 40% of consumer Internet traffic [2]. In addition to the current video hosting services such as YouTube [3] and Vimeo [4], major IT companies such as Google [5] and Apple [6] have started to offer cloud audio/video storage services to customers.
Compared to the recent efforts and deployments of content-based image search such as automatic tagging based on face recognition [7], [8], content-based video search is still under-developed. We have two main observations to explain its shortcomings.
First, The temporal information on videos adds more complexity of dimensions of data, so queries could be more complex than typical text-based ones. In addition, the representations of these queries generated by simple sketch tools [9], [10] are so primitive or generic compared with text-represented queries, they would lead either wrong or diverse query results. More complex querying system (such as dynamical construction of hierarchical structures on targeting videos [11]) requires more elaboration on queries by users, which could be more error-prone.
Second, this has been assumed that the user does not have sample videos at hand for query, so additional querying tools are required. However, this assumption is no longer valid because mobile devices such as digital cameras, PDAs, and cell-phones with camera and solid-state memory enable instant image and video recording which can be used for a video query.
Taking advantage of this opportunity from mobile and ubiquitous multimedia, our content-based video query system takes a sample video clip3 as a query and searches the collection of videos typically stored in multimedia portal service (such as YouTube, Vimeo, Google Video [12], Yahoo! Video [13], etc.), and suggests similar video clips from the database with relevance evaluation. As shown in Fig. 1, our system mainly performs the following two functionalities – (1) offline population of our video database for new video entries to database and (2) online video matching for a new query video. When a video is introduced in the database, it is partitioned into multiple clips by a clip boundary detection based on feature analysis and classification. The partitioned video clips are stored along with metadata information in the database. Next, for a new video from query process, it is analyzed and matched to the stored videos and the relevant scores are calculated by our spatio-temporal pyramid matching system.
The rest of this paper is organized as follows. In Section 2, related work on image and video retrieval is discussed. Our spatio-temporal pyramid matching system is presented in Section 3. We also analyze formal conditions where our spatio-temporal pyramid matching system gets benefits from temporal information in Section 4. The experimental results are presented in Section 5, and finally we conclude our paper.
Section snippets
Related work
The challenges and characteristics of content-based image and video query systems are well discussed in [14]. A significant amount of research on automatic image annotation [15], [16], [17], [18], [19], [20] has been done, and recently researchers are more focusing on automatic video annotation [21], [22], [23], [24]. Especially, Ulges et al. [25] and Ando et al. [26] discussed video tagging and scene recognition problems, which have similar goals to ours but takes different approaches. Dynamic
System design
Given a video clip as a new entry to the database, the clip boundary detection in our system divides it into multiple video clips and they are stored in our video matching database. Once the system receives a query video clip, the similarity between the query video clip and the video clips already stored in the database is measured by our Spatio-Temporal Pyramid Matching (STPM) kernel. The measured similarity is used for the rank of video matching. The higher rank of a query and a video clip in
Analysis of spatio-temporal pyramid matching kernel
In this section, we analyze a formal condition in which temporal information contributes to the video matching.
First, in order for the fair comparison of SPM and STPM, we use the conventional weighted sum of matching scores of key frames for SPM without any temporal information. It is straightforward to show that the matching score of STPM is equal to or greater than the score of SPM. Second, we introduce category gain to represent the gain from temporal information, which is calculated as
Experimental evaluation
In this section, we present our experimental settings including the datasets and features that we use, performance criteria for video matching, and experimental evaluations. We use two datasets for benchmarking – (1) USF dataset [56] and (2) sports videos that we collected from YouTube.5 The performance of video matching with our spatio-temporal pyramid matching is evaluated with two parameters – (1)
Conclusions
In this paper, we addressed the problem of classifying video clips for content-based video query. The clip boundaries are found using a strong classifier learnt from a boosting algorithm on top of weak classifiers. Then, the similarity of video clips is calculated by our spatio-temporal pyramid matching kernel which includes temporal dimension into the matching schema.
Our experimental evaluation using sports videos and standard UCF50 dataset shows that the temporal dimension is an effective
Acknowledgement
This work was supported by INHA UNIVERSITY Research Grant. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No2012R1A1A1044658).
References (58)
- et al.
Sketch retrieval and relevance feedback with biased SVM classification
Pattern Recognition Letters
(2008) - et al.
Maximum entropy model-based baseball highlight detection and classification
International Journal of Computer Vision and Image Understanding
(2004) - et al.
A distance measure for video sequences
Vision and Image Understanding
(1999) - A. Smith, Mobile access 2010, July 2010....
- Cisco, Cisco visual networking index: Forecast and methodology, 2010–2015, June 2011....
- YouTube, YouTube – Broadcast Yourself....
- Vimeo, Vimeo....
- Google, Google play....
- Cnet, Apple trying to store your video in the cloud....
- Google, Picasa....
Query by sketch and relevance feedback for content-based image retrieval over the web
Journal of Visual Language and Computing
VisualSEEk: a fully automated content-based image query system
Content-based multimedia information retrieval: state of the art and challenges
ACM Transactions on Multimedia Computing, Communications and Applications
Automatic video annotation by semi-supervised learning with kernel density estimation
Automatic video annotation using ontologies extended with visual information
Cited by (26)
A systematic review on content-based video retrieval
2020, Engineering Applications of Artificial IntelligenceCitation Excerpt :One can note that k-means is also useful to group trajectory segments for sketch-based retrieval (Ghosal and Namboodiri, 2014). Finally, this algorithm and variations are often associated with usual visual words techniques for video indexing (Shen et al., 2015; Choi et al., 2013; André et al., 2012). As k-means is an unsupervised learning algorithm, it is applicable in DR problems without class or target feature (Section 2.4).
A Time Flexible Kernel framework for video-based activity recognition
2016, Image and Vision ComputingCitation Excerpt :The long-term temporal information has been used in some other methods. A recent one is the extension of the work of Spatial Pyramid Matching (SPM) [32] called Spatio-Temporal Pyramid Matching (STPM) [13]. The method suggests dividing the videos into equal number of spatio-temporal volumes at several scales, called pyramids, computing in each volume a BoF, and finally obtaining a similarity between two video clips by comparing the corresponding volumes.
Statistical quantization for similarity search
2014, Computer Vision and Image UnderstandingCitation Excerpt :This problem is also known as Nearest Neighbor (NN) search, which is defined as accurately finding the close samples for a given query within a large database [4,5]. It is of great importance to a wide range of multimedia applications, such as content-based image/video retrieval [6–8], image/video auto-tagging [9–11], image classification [12,13], and scene recognition [14]. Naively searching for the neighbors according to their similarities entails exhaustively comparing the queries with the examples over the entire database.
Particle swarm optimized deep spatio-temporal features for efficient video retrieval
2024, International Journal of Information Technology (Singapore)Learning clustered deep spatio-temporal prototypes using softmax regression for video information systems
2024, International Journal of Information Technology (Singapore)Conditional deep clustering based transformed spatio-temporal features and fused distance for efficient video retrieval
2023, International Journal of Information Technology (Singapore)
- ☆
This paper has been recommended for acceptance by Chung-Sheng Li.
- 1
Present address: Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA.
- 2
Co-corresponding author.