Elsevier

Information Sciences

Volume 577, October 2021, Pages 483-509
Information Sciences

Graph-based structural difference analysis for video summarization

https://doi.org/10.1016/j.ins.2021.07.012Get rights and content

Abstract

Keyframe extraction is an effective way to achieve video summarization. More recent studies using deep learning networks are heavily dependent on massive historical datasets for training. For practicality in real applications, we focus more on unsupervised online analysis and present a novel graph-based structural difference analysis method for this purpose. Unlike traditional methods of video representation based on raw features, undirected weighted graphs are constructed from the resulting features to represent video frames. The detailed structural changes between graphs are more consistent with the actual changes between video frames than raw features, thus making the newly proposed method robust for detecting various types of shot transitions, such as hard cuts, dissolves, wipes, and fade-ins/fade-outs. Then, considering the local influence between successive frames, a structural difference analysis of graphs is performed to detect the video shot boundaries. Finally, the median graph of each shot is obtained to extract the corresponding keyframe. Extensive experiments are conducted on three video summarization benchmark datasets. Quantitative and qualitative comparisons are made between the proposed method and other state-of-the-art methods, with the proposed method yielding remarkable improvements from 1.9% to 3.1% in terms of the F-score on the three datasets.

Introduction

In recent years, many new photography technologies and video applications have emerged, which have led to enormous growth in the variety of videos on the Internet. It is increasingly necessary to quickly browse and view massive video data in a limited time to facilitate video browsing and video retrieval [1]. This is a widespread concern in related fields that require extensive video data storage, archival, analysis, and visualization. Techniques involving automatic video summarization (VS) can solve these problems by generating a concise version of the video stream that retains only the most informative and representative content.

VS has been extensively studied, and the relevant methods can be divided into two categories: static VS (static keyframes) [2], [3] and dynamic VS (dynamic video skimming) [4]. Static VS methods are designed to select the most informative and representative keyframes from the original video sequence and arrange them in chronological order to represent the main content of the video. The keyframe set is not limited by any timing or synchronization issues, thus providing excellent flexibility and adaptability. Dynamic VS methods focus on extracting video clips, that is, a set of frames representing the most exciting and meaningful content of the video, including basic audio and visual motion elements. Although this approach is more attractive to users than just looking at a series of static keyframes, video skimming requires advanced semantic analysis. In contrast, static VS methods based on keyframe extraction have been widely studied due to their simplicity, flexibility, and practicality.

The traditional VS methods based on keyframe extraction are mainly implemented by video frame clustering [5], [6] or shot segmentation-based approaches [7], [8], [9], [10]. However, the former methods are limited by a priori fixed number of clusters, and in most real-world scenarios, the number of clusters is unknown in advance. The latter methods are widely used since they do not require any prior knowledge and involve unsupervised online analysis. Therefore, this paper focuses on a keyframe extraction method based on shot boundary detection.

Generally, shot boundary detection can be realized by analysing the differences between consecutive frames, and a significant difference indicates that there is probably a boundary at the current detected location. The components of shot detection algorithms generally include feature representation, a distance metric and a decision-making method [11]. Choosing the appropriate features to represent the content of the video is a challenge. Frequently used video features include pixels [12], colour histograms [13], MPEG-7 visual descriptors [14], and motion vectors [15]. The methods based on such features are effective in detecting sudden shot changes (e.g., hard cutting). However, the main limitation that affects the performance of existing methods is insufficient capacity for detecting subtle transitions. For transitions, such as hard cuts, dissolves, wipes and fades, the interframe changes in progressive shots are subtle; thus, it is relatively difficult to detect transitions based only on the low-level features adopted in traditional methods. Notably, low-level features, such as pixels, pixel blocks, and histograms, cannot express the implied detailed structural information in video features, and such information plays a critical role in distinguishing the small differences between successive frames in gradually changing shots.

Due to the above limitations, there is a clear motivation for investigating a new feature model that can represent the sophisticated structural information to reflect the dynamic changes that may occur in video frames. Therefore, we propose a new keyframe extraction framework for VS based on graph modelling, as shown in Fig. 1. This method consists of the following three steps.(1) Graph modelling-based Video Representation (VR). The feature representation of each video frame is a histogram composed of frequency components (i.e., bins) in terms of the colour and gradient direction. This approach can provide the relevant statistical information for video frame content. However, to reveal the corresponding structural characteristics, an appropriate modelling method is required. Therefore, we introduce a graph-based modelling method with two main advantages: (a) it represents each video frame from a global perspective through a graph structure1 rather than directly using one-dimensional linear bins in the histogram. Consequently, the bins represented by the nodes in the graph are intimately connected, which is the key to determining the structural characteristics of the histogram extracted from the corresponding video frame, and (b) it can better capture the long-range dependencies among the bins in the histogram to reflect the relationships between nonadjacent bins.(2) Structural difference analysis-based shot boundary detection (SBD). The standard difference analysis method [16], [17] uses a predefined metric that can quantify the difference between two adjacent frames in a dynamic video stream, such that a decision can be made regarding shot boundary detection. The commonly used analysis methods include the likelihood ratio test (LRT), symmetric K-L divergence, and cumulative sum metric (CUSUM). To further adapt to the proposed graph model, we use the metric of differences in edge weight values [18] to calculate the distance between graphs and adopt hypothesis testing to determine the shot boundaries.(3) Median graph calculation-based keyframe extraction (KE). Based on the detected shots, we calculate the median graphs and extract the corresponding frames as the keyframes for shots.

In summary, the main contributions of this paper are summarized as follows.

  • We propose a novel graph-based structural difference analysis model to formulate the VS problem, where the structural information in the feature of each video frame is considered and modelled in graphs to bridge the gap between the actual semantic structural information and the raw features of video frames.

  • We develop a graph-based metric to measure the dissimilarity between frames. This graph structural difference can reflect the potential discrepancies between continuous frames, and median graphs can be obtained as the corresponding keyframes to reflect the overall trend of the video.

We present comprehensive experimental results for the proposed method and demonstrate that our method outperforms other state-of-the-art methods on three different datasets. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 describes the video representation method based on graph modelling. Section 4 presents the structural difference analysis for shot boundary detection. Section 5 provides a method for keyframe extraction by calculating median graphs. The pseudocode of the algorithm is given in Section 6. Section 7 shows the experimental results. Finally, the conclusions are drawn in Section 8.

Section snippets

Related work

In recent decades, many keyframe extraction-based VS methods have been proposed, and the main techniques can be divided into the following categories.

Video representation based on graph modelling

For a given video V, we extract the frame set F={f1,f2,f3,,fN} containing N frames at a predefined sampling rate, where frame fi represents the frame at time i. Based on our method, the most representative subset F={kf1,kf2,kf3,,kfm} is finally extracted to represent the static video summary, where FF is the selected set of keyframes, and mN. The variable mrepresents the number of shots in F, and kfm is the keyframe of the mth shot. In this section, we provide the modelling process for each

Shot boundary detection based on structural difference analysis

The next step in our method is video shot detection. Fig. 4 shows a flowchart of the proposed shot detection method. By comparing the graph models of k pairs of patches corresponding to the positions of two consecutive frames, k graph difference scores between two frames can be obtained. Based on these scores, a commonly used hypothesis testing method is employed to make a shot detection decision.

Keyframe extraction based on median graphs

After detecting shots in the video, the next step is to select the most representative frame in each shot to obtain the target summary. The basic objective of this approach is to determine a representative frame that is the most similar frame to the rest of the frames in a shot. Therefore, we introduce the concept of the median graph for keyframe extraction. In graph theory, median graphs are an effective tool for representing a set of graphs [48]. As shown in Fig. 5, given a graph set (S={G1,G2

Algorithm of the proposed method

The techniques described in the above sections can be summarized in the following steps.

  • (1)  Collect the frame set of the video at a predefined sampling rate;

  • (2)  Extract the HSV-histogram and HOG-histogram from each frame patch to construct the HSV-HOG histogram;

  • (3)  Construct a graph for each frame patch based on the HSV-HOG histogram;

  • (4)  Measure the dissimilarity scores between constructed graphs;

  • (5)  Detect the shot boundary by hypothesis testing. If a shot boundary is detected, mark the

Experiments

In this section, we first introduce the datasets and evaluation indicators. Based on these results, we study several factors that affect the performance of the proposed method to obtain the best experimental setup. Finally, we compare the test results with several variations and state-of-the-art methods to verify the effectiveness of our method.

Conclusion

In this paper, we presented a novel keyframe extraction framework for VS. On the one hand, unweighted directed graphs representing each video frame are exploited to maintain the detailed structural information in frame features. On the other hand, a structural difference analysis is adopted to identify the potential differences between continuous frames, such that various types of shot transitions, e.g., hard cuts, dissolves, wipes and fades, can be detected based on hypothesis testing. By

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is financially supported in part by the National Natural Science Foundation of China (61602286, 61976127), in part by the Shandong Key Research and Development Program (2018GGX101003), and in part by the Shandong Province Higher Educational Science and Technology Program (J16LN09).

Chunlei Chai received her bachelor’s degree from Qufu Normal University, Shandong, China, in 2018 and is now a master’s candidate at Shandong Normal University, Jinan, China. Her research interests include computer vision, multimedia processing, and pattern recognition.

References (50)

  • R. Hannane et al.

    Mskvs: Adaptive mean shift-based keyframe extraction for video summarization and a new objective verification approach

    J. Vis. Commun. Image Represent.

    (2018)
  • G. Litjens et al.

    A survey on deep learning in medical image analysis

    Medical Image Anal.

    (2017)
  • Z. Gao et al.

    Key-frame selection for video summarization: an approach of multidimensional time series analysis

    Multidimension. Syst. Signal Process.

    (2018)
  • E. Asadi, N.M. Charkari, Video summarization using fuzzy c-means clustering, in: 20th Iranian Conference on Electrical...
  • J. Wu et al.

    A novel clustering method for static video summarization

    Multimedia Tools Appl.

    (2017)
  • N.D. Doulamis et al.

    Efficient summarization of stereoscopic video sequences

    IEEE Trans. Circuits Syst. Video Technol.

    (2000)
  • A.D. Doulamis et al.

    Efficient video summarization based on a fuzzy video content representation

  • R. Hannane et al.

    An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram

    Int. J. Multimedia Inform. Retrieval

    (2016)
  • C. Cotsaces et al.

    Video shot detection and condensed representation. a review

    IEEE Signal Processing Magazine

    (2006)
  • H. Zhang et al.

    Automatic partitioning of full-motion video

    Multimedia Syst.

    (1993)
  • M. Asim et al.

    A key frame based video summarization using color features

  • J.-H. Lee et al.

    Automatic video summarizing tool using mpeg-7 descriptors for personal video recorder

    IEEE Trans. Consum. Electron.

    (2003)
  • A.M. Amel et al.

    Video shot boundary detection using motion activity descriptor

    Telecommunications

    (2010)
  • L. Ciabattoni et al.

    Statistical spectral analysis for fault diagnosis of rotating machines

    IEEE Trans. Industr. Electron.

    (2018)
  • T. Wang et al.

    Graph-based change detection for condition monitoring of rotating machines: Techniques for graph similarity

    IEEE Trans. Reliab.

    (2018)
  • Cited by (13)

    • Exploring deep learning approaches for video captioning: A comprehensive review

      2023, e-Prime - Advances in Electrical Engineering, Electronics and Energy
    • Popularity sensitive and domain-aware summarization for web tables

      2023, Information Sciences
      Citation Excerpt :

      Unstructured data summarization. Besides tabular data, summarization techniques are frequently utilized on unstructured data including document[26,27,29,28], picture[30], video[31]. For example, the work [26] combines sentence compression and summarization-oriented optimization in order to preserve both informativeness and grammatical correctness of the summary sentences.

    • Toward a perceptive pretraining framework for Audio-Visual Video Parsing

      2022, Information Sciences
      Citation Excerpt :

      AVVP [10] has a wide potential application in downstream video understanding tasks. ( such as monitoring analysis, video summarization [21–23], and retrieval [24–26]). It is a newly introduced multi-modal task that detects and localizes events within the audio and visual streams.

    • Static video summarization with multi-objective constrained optimization

      2024, Journal of Ambient Intelligence and Humanized Computing
    View all citing articles on Scopus

    Chunlei Chai received her bachelor’s degree from Qufu Normal University, Shandong, China, in 2018 and is now a master’s candidate at Shandong Normal University, Jinan, China. Her research interests include computer vision, multimedia processing, and pattern recognition.

    Guoliang Lu received his bachelor’s and master’s degrees from Shandong University, Jinan, China, in 2006 and 2009, respectively, and his Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in March 2013. He is currently an associate professor at Shandong University. His research interests include computer vision, visual servo control, signal processing and machine monitoring.

    Ruyun Wang is currently working towards a bachelor’s degree at the School of Information Science and Engineering, Shandong Normal University, Jinan, China. Her research interests include computer vision and multimedia processing.

    Chen Lyu received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2015. He is currently an associate professor with the School of Information Science and Engineering, Shandong Normal University, Jinan, China. His research interests include computer vision, multimedia information processing and artificial intelligence.

    Lei Lyu received his Ph.D. degree in computer application technology from the University of Chinese Academy of Sciences in 2013. He is currently an associate professor with the School of Information Science and Engineering, Shandong Normal University, Jinan, China. His research interests include computer vision and software engineering.

    Peng Zhang received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China. His research interests mainly include enterprise-level big data processing, distributed systems and service computing.

    Hong Liu, Professor and Ph.D., is the supervisor of the School of Information Science and Engineering at Shandong Normal University. She received her Ph.D. degree in engineering from the Institute of Computing Technology, Chinese Academy of Science, Beijing, China, in 1998. She is an academic leader in computer science and technology. Her research is the cross study of distributed artificial intelligence, software engineering, and computer-aided design, including the research of multi-agent systems and co-evolutionary computing technology.

    View full text