Automatic scene detection for advanced story retrieval

https://doi.org/10.1016/j.eswa.2008.07.009Get rights and content

Abstract

Browsing video scenes is just the process to unfold the story scenarios of a long video archive, which can help users to locate their desired video segments quickly and efficiently. Automatic scene detection of a long video stream file is hence the first and crucial step toward a concise and comprehensive content-based representation for indexing, browsing and retrieval purposes. In this paper, we present a novel scene detection scheme for various video types. We first detect video shot using a coarse-to-fine algorithm. The key frames without useful information are detected and removed using template matching. Spatio-temporal coherent shots are then grouped into the same scene based on the temporal constraint of video content and visual similarity of shot activity. The proposed algorithm has been performed on various types of videos containing movie and TV program. Promising experimental results shows that the proposed method makes sense to efficient retrieval of video contents of interest.

Introduction

In recent years, rapid technology advance in multimedia information processing, and World-Wide-Web have led to such a following fact vast amounts of digital information can be available at a reasonable price. Meanwhile, the price of both digital instrument and digital storage media has decreased at a surprising speed, such as digital video camera, hard disk, and storage card. Combined with human nature souvenir of niceness and almost no special technique needed to manipulate, we have witnessed the ever growing amount of digital data in both professional and amateurish environment, many examples of applications such as visual commerce, remote distance internet teaching, professionally created movie and home made video.

The exponential growth of visual data in digital format has led to a corresponding search problem. For example, if a human wants to locate one segment of his interest from a large collection of video archives, he has to watch each video segment from the very beginning using fast-forward and fast-backward operation. Due to the huge number of video archive, the manual searching process is naturally low efficiency and time-consuming. Accordingly, using which sort of strategy to manage large amounts of video archives efficiently and providing which kind of effective index for users to browse and retrieve quickly is ever becoming an open problem. Up to now, the most used approach for browsing and retrieving interesting contents is still using corresponding keywords, which fits well for text-driven database. However, for the visual information with rich semantic inherent, browsing and retrieving method based on text is obviously not the effective solution. As a result, it is necessary to present a method to organize, index and retrieve video archives in view of semantic level.

According to the general perspective, the structure of video file is a hierarchical fashion. According to the order from down to top, there are: frame, shot, scene and video. Frame is the lowest level in the hierarchical structure. A shot is the basic unit of a video, which consists of a series of adjacent frames in which the background is unchanged. Compared with based on the complete visual content of a video track, browsing and retrieving based on shots is in the higher level and more meaning. In some cases, method based on shot can achieve the browse and retrieval purpose, such as the case of less camera motion. For a typical produced video, however, due to the frequent camera motion, the amount of shots could be large, it is hence necessary to provide a more concise and compact semantic segmentation to improve the performance of browsing and retrieving. A scene grouped semantically related shots reflects a certain topic or theme. Correspondingly, the efficiency of browsing and retrieving based on scenes is obviously and naturally higher that based on shots. At the top level of the hierarchical structure is the video stream, which is composed of entire scenes. From our observation and intuitionist knowledge, such four-level hierarchical structure fashion helps users to locate their desired segments effectively and quickly from the perspective of semantics.

Due to manually segmenting a long video stream file being a process of time consuming and the efficiency of it being low, the automatic aspect is hence ever becoming a very important problem to be processed.

The problem of how to segment the video stream file into more meaningful scene for the purpose of browsing and retrieving is becoming a research topic of increasing important. Recent years have seen an explosive increase of research on automatic video segmentation techniques. In accord with what we have discussed above, automatic video stream file segmentation includes three important steps as presented below. The first step is the shot boundary detection. According to whether the transition is abrupt or gradual, the shot boundaries can be classified into following two types: cut or gradual transition. Furthermore, based on the different editing effects (Kobla, DeMenthon, & Doermann, 1999), the gradual transition can be categorized into more detailed types, such as dissolve, wipe and fade out/fade in. Shot boundary detection, sometimes also called as temporal video segmentation, completes the performance of identifying the transition between contiguous shot and subsequently separates a video stream file into different shots. Much work has been reported in this area and highly accurate results have been obtained such as in Albanese and Chianese, 2004, Boccignone and Chianese, 2005, Cernekova et al., 2003, Dimitrova and Zhang, 2002, Hanjalic, 2002, Yuan et al., 2005. The second step is the key frame selection. This step extracts a characteristic set of either independent frames or short sequences from the corresponding shot detected according to both the complexity of activity content and the duration of shot. No matter how many frames finally selected from each shot, the extracted key frames should best depict the content of corresponding shot. In some cases, the selected key frames form each shot detected can be directly used as the condensed presentation of an entire video stream and can be accordingly used for the purpose of indexing, comparison and category. In recent years, there are several contributing products include (Girgensohn and Boreczky, 2000, Hasebe et al., 2004, Ren and Singh, 2003, Thormaehlen et al., 2004, Togawa et al., 2005, Xiong and Zhou, 2006). The last step is the scene segmentation, also named as video-content organization based on the semantic level. Namely, related shots detected are clustered into meaningful segments in view of semantic level and the resulting scenes can provide a concise and compact index for the purposes of browsing and retrieving. How to define a scene based on the human understanding is still an open topic and the segmentation of scene is accordingly the crucial step in the whole video stream file segmentation process. Recently a large amount of techniques have been reported to in this area involving (Boreczky et al., 1998, Rasheed and Shah, 2003, Rui et al., 1999, Sundaram et al., 2000, Tavanapong and Zhou, 2004, Yeung et al., 1998, Zhai and Shah, 2005).

Generally speaking, video stream file can be classified into following types: news video, sport video, feature film, television video, home video, and so on. Accordingly, much work has been reported in different types of videos. Jiang, Hao, Lin, Tong, and Zhang (2000) used the features of audio and visual of the video to complete the scene segmentation of news video and sport video. Audio scene boundaries and visual scene boundaries are first detected separately using respective feature. Then, the final scene boundaries are where the two types of scene boundaries coincide with each other. Sundaram et al. (2000) described a strategy of the segmentation of a film. Similar to Jiang et al. (2000), the audio and video data are first segmented into corresponding scenes separately. However, unlike the strategy of integration used in Jiang et al. (2000), the final scene boundaries are determined using a nearest neighbor algorithm and a time-constrained refined analysis. Boreczky et al. (1998) used hidden Markov models to search boundaries of different types of video including television shows, news broadcast, movies, commercials, and cartoons. Features for segmentation include audio, visual and motorial. The three types of features are firstly computed and then used to construct a vector for training hidden Markov model. The hidden Markov models contain seven states and the parameters trained include seven transition probabilities as well as the means and variance of various Gaussian distributions. Yeung et al. (1998) presented an approach for the segmentation of video into story units by applying time-constrained clustering and a scene transition graph, STG. Each node in a STG is a story unit, and the edge represents the transition from one story unit to the next. Based on the complete link technique for hieratical clustering, the STG is then further split into several subgraphs and correspondingly each one presents a scene. Rasheed and Shah (2003) proposed a graphical representation of feature films by constructing a shot similarity graph, SSG. The meaning of nodes and edges is same to that in Yeung et al. (1998). Shot similarities are firstly computed by utilizing the information of audio and motion of the video. Then, the normalized cut method is used to split the SSG into smaller ones presenting story units. Tavanapong and Zhou (2004) exploited film making technique, and discovered certain regions in video frames can well capture essential information needed to cluster visually similar shots into the same scene. Each shot is presented by the first and last frames with the shot. Based on the average value of all DC coefficients of the Y color component in certain region of key frames within each shot, the clustering of shots is done by forward comparison, backward comparison and temporal limitations. However, several of above methods of scene segmentation of new video use the inimitable characteristic: the location of anchor persons and the corresponding background are unchanged. Furthermore, new video has a comparable fixed structure. Namely the shots of anchor persons are shown in certain time interval, which helps to locate different news scene. Such an inimitable characteristic and structure are not fit for feature film. On the other hand, other methods discussed above do not fully taken into account the characteristic of film editing, such as the linking of shots and scene is determined what coefficients. Zhai and Shah (2005) used the Markov Chain Monte Carlo technique to determine the boundaries between video scenes. The Markov Chain has two types of states: diffusion and jumps. The Monte Carlo model parameters contain the number of scenes and their corresponding boundary locations. The temporal scene boundaries are detected by maximizing the posterior probability of the model parameters based on the model priors and the data likelihood. However, the Markov Chain Monte Carlo technique is too time-consuming.

In this paper, a novel scene segmentation and semantic representation scheme is proposed for the efficient retrieval of video data using a large number of low-level and high-level features. To well depict the properties of different shots, gray-level variance histogram and Wavelet texture variance histogram are chosen as the features to perform the shot detection and key frame selection. To accurately segment video data into different scenes and semantically represent scene contents, redundant frames in the set of key frames should be recognized and removed using template matching. These shots within temporal constraint having visually similar activity are then grouped into the same scene. To retrieve interested video contents in terms of human accustomed usage, it is necessary to represent scene content from the semantic level, e.g. a conversation scene, a suspense scene and an action scene.

The remainder of the paper is organized as follows. Section 2 presents the segmentation of scene boundary. Section 3 deals with the application of the general framework on the segmentation of the feature film. Finally, Section 4 provides the conclusion and discussions of the proposed framework.

Section snippets

Procedure of segmentation

According to Webster dictionary, a shot consists of a continuous sequence of video frames recorded by an uninterrupted camera operation and consistent background settings, and a scene can be considered as a group of semantically related shots which represents different views of the same theme and contains the same object of interest. The definition of shot and scene is just the basis for editor to perform video data, and is also the basis for our proposed algorithm.

Fig. 1 shows the flowchart of

Experimental results

This section describes experimental results of the analyses applied to three test feature films and one TV interview. The results of the segmentation of original video sequences into semantic scene units are based on the temporal locality constraint and shot coherence comparison discussed detailed in Section 2.5 and 3.5. All the experiments are done on the Intel Pentium 3.0 GHz machine running Windows. Performance measures and matching rule of scenes are presented in Section 3.1. Section 3.2

Discussion and conclusion

In this paper, a new method is presented to complete the temporal scene segmentation of video track. The detected scenes can closely approximate factual video episodes. Our segmentation is based on the investigation of the visual information between adjacent scenes, as well as on the assumption that the visual content within a scene is similar. Consequently, the two-dimension entropy model is used to depict the representative character for different scenes. To depict the content of a shot more

Acknowledgements

This work is supported by the National High-Tech Research and development Plan of China (973) under Grant No. 2006CB303103, and also supported by the National Natural Science Foundation of China under Grant No. 60675017. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the P.R.C. Government.

References (24)

  • M. Albanese et al.

    A formal model for video shot segmentation and its application via animate vision

    Transactions on multimedia tools and application

    (2004)
  • G. Boccignone et al.

    Foveated shot detection for video segmentation

    IEEE transactions on circuits system and video technology

    (2005)
  • Boreczky, John S., & Wilcox, Lynn D. (1998). A hidden markov model framework for video segmentation using audio and...
  • Cernekova, Z., Kotropoulos, C., & Pitas, I. (2003). Video shot segmentation using singular value decomposition. In:...
  • N. Dimitrova et al.

    Applications of video content analysis and retrieval

    IEEE transactions on multimedia

    (2002)
  • Andreas Girgensohn et al.

    Time-constrained key frame selection technique

    Transactions on multimedia tools and application

    (2000)
  • A. Hanjalic

    Shot-boundary detection: Unraveled and resolved?

    IEEE transactions on circuits system and video technology

    (2002)
  • Hasebe, Satoshi, Nagumo, Makoto, et al. (2004). Video key frame selection by clustering wavelet coefficients. In:...
  • Jiang, Hao, Lin, Tong, & Zhang, Hongjiang. (2000). Video segmentation with the support of audio segmentation and...
  • Kobla, V., DeMenthon, D., & Doermann, D. (1999). Special effect edit detection using video trails: A comparison with...
  • Niblack, W., Barber, R., et al. (1993). The QBIC project: Querying images by content using color, texture and shape....
  • Rasheed, Zeeshan, & Shah, Mubarak. (2003). A graph theoretic approach for scene detection in produced videos. In:...
  • Cited by (16)

    • Identifying turning points in animated cartoons

      2019, Expert Systems with Applications
      Citation Excerpt :

      The huge gap between the state-of-the-art computer vision algorithms and story analytics seems hard to be bridged, and therefore, novel approaches to understanding the video stories are needed. As a complement to the computer vision algorithms, features from multiple modalities in a video (i.e., textual, visual, and aural) are utilized for applications such as movie summarization (Evangelopoulos et al., 2013), recommendation (Bougiatiotis & Giannakopoulos, 2018) and scene detection (Baraldi, Grana, & Cucchiara, 2017; Zhu & Liu, 2009). In terms of textual feature extraction, it is common to use typical methods such as word count (Baraldi et al., 2017), Bag-of-Words, topic modeling (Bougiatiotis & Giannakopoulos, 2018), and textual saliency (Evangelopoulos et al., 2013), then fuse them with features extracted from other modalities.

    • Determining the best suited semantic events for cognitive surveillance

      2011, Expert Systems with Applications
      Citation Excerpt :

      In the literature, many methods for content-based video indexing deal with similarity measures based on trajectory, color, texture, and shape (Conci & Castro, 2002; Yoo, Park, & Jang, 2005). They commonly search for video shots by computing low-level features on entire or partitioned image frames, which are compared to those in consecutive frames to detect strong transitions (Lee, Yoo, & Jang, 2006; Zhu & Liu, 2009). Although low-level features are particularly useful for still image retrieval (Conci & Castro, 2002; Smeulders et al., 2000; Yoo et al., 2005) and video retrieval in movies, broadcast news, or sports (Xiong et al., 2006), they exhibit practical drawbacks for video surveillance.

    • Syncretic matching: Story similarity between documents

      2018, ACM International Conference Proceeding Series
    View all citing articles on Scopus
    View full text