Graph-based structural difference analysis for video summarization
Introduction
In recent years, many new photography technologies and video applications have emerged, which have led to enormous growth in the variety of videos on the Internet. It is increasingly necessary to quickly browse and view massive video data in a limited time to facilitate video browsing and video retrieval [1]. This is a widespread concern in related fields that require extensive video data storage, archival, analysis, and visualization. Techniques involving automatic video summarization (VS) can solve these problems by generating a concise version of the video stream that retains only the most informative and representative content.
VS has been extensively studied, and the relevant methods can be divided into two categories: static VS (static keyframes) [2], [3] and dynamic VS (dynamic video skimming) [4]. Static VS methods are designed to select the most informative and representative keyframes from the original video sequence and arrange them in chronological order to represent the main content of the video. The keyframe set is not limited by any timing or synchronization issues, thus providing excellent flexibility and adaptability. Dynamic VS methods focus on extracting video clips, that is, a set of frames representing the most exciting and meaningful content of the video, including basic audio and visual motion elements. Although this approach is more attractive to users than just looking at a series of static keyframes, video skimming requires advanced semantic analysis. In contrast, static VS methods based on keyframe extraction have been widely studied due to their simplicity, flexibility, and practicality.
The traditional VS methods based on keyframe extraction are mainly implemented by video frame clustering [5], [6] or shot segmentation-based approaches [7], [8], [9], [10]. However, the former methods are limited by a priori fixed number of clusters, and in most real-world scenarios, the number of clusters is unknown in advance. The latter methods are widely used since they do not require any prior knowledge and involve unsupervised online analysis. Therefore, this paper focuses on a keyframe extraction method based on shot boundary detection.
Generally, shot boundary detection can be realized by analysing the differences between consecutive frames, and a significant difference indicates that there is probably a boundary at the current detected location. The components of shot detection algorithms generally include feature representation, a distance metric and a decision-making method [11]. Choosing the appropriate features to represent the content of the video is a challenge. Frequently used video features include pixels [12], colour histograms [13], MPEG-7 visual descriptors [14], and motion vectors [15]. The methods based on such features are effective in detecting sudden shot changes (e.g., hard cutting). However, the main limitation that affects the performance of existing methods is insufficient capacity for detecting subtle transitions. For transitions, such as hard cuts, dissolves, wipes and fades, the interframe changes in progressive shots are subtle; thus, it is relatively difficult to detect transitions based only on the low-level features adopted in traditional methods. Notably, low-level features, such as pixels, pixel blocks, and histograms, cannot express the implied detailed structural information in video features, and such information plays a critical role in distinguishing the small differences between successive frames in gradually changing shots.
Due to the above limitations, there is a clear motivation for investigating a new feature model that can represent the sophisticated structural information to reflect the dynamic changes that may occur in video frames. Therefore, we propose a new keyframe extraction framework for VS based on graph modelling, as shown in Fig. 1. This method consists of the following three steps.(1) Graph modelling-based Video Representation (VR). The feature representation of each video frame is a histogram composed of frequency components (i.e., bins) in terms of the colour and gradient direction. This approach can provide the relevant statistical information for video frame content. However, to reveal the corresponding structural characteristics, an appropriate modelling method is required. Therefore, we introduce a graph-based modelling method with two main advantages: (a) it represents each video frame from a global perspective through a graph structure1 rather than directly using one-dimensional linear bins in the histogram. Consequently, the bins represented by the nodes in the graph are intimately connected, which is the key to determining the structural characteristics of the histogram extracted from the corresponding video frame, and (b) it can better capture the long-range dependencies among the bins in the histogram to reflect the relationships between nonadjacent bins.(2) Structural difference analysis-based shot boundary detection (SBD). The standard difference analysis method [16], [17] uses a predefined metric that can quantify the difference between two adjacent frames in a dynamic video stream, such that a decision can be made regarding shot boundary detection. The commonly used analysis methods include the likelihood ratio test (LRT), symmetric K-L divergence, and cumulative sum metric (CUSUM). To further adapt to the proposed graph model, we use the metric of differences in edge weight values [18] to calculate the distance between graphs and adopt hypothesis testing to determine the shot boundaries.(3) Median graph calculation-based keyframe extraction (KE). Based on the detected shots, we calculate the median graphs and extract the corresponding frames as the keyframes for shots.
In summary, the main contributions of this paper are summarized as follows.
- •
We propose a novel graph-based structural difference analysis model to formulate the VS problem, where the structural information in the feature of each video frame is considered and modelled in graphs to bridge the gap between the actual semantic structural information and the raw features of video frames.
- •
We develop a graph-based metric to measure the dissimilarity between frames. This graph structural difference can reflect the potential discrepancies between continuous frames, and median graphs can be obtained as the corresponding keyframes to reflect the overall trend of the video.
We present comprehensive experimental results for the proposed method and demonstrate that our method outperforms other state-of-the-art methods on three different datasets. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 describes the video representation method based on graph modelling. Section 4 presents the structural difference analysis for shot boundary detection. Section 5 provides a method for keyframe extraction by calculating median graphs. The pseudocode of the algorithm is given in Section 6. Section 7 shows the experimental results. Finally, the conclusions are drawn in Section 8.
Section snippets
Related work
In recent decades, many keyframe extraction-based VS methods have been proposed, and the main techniques can be divided into the following categories.
Video representation based on graph modelling
For a given video V, we extract the frame set containing N frames at a predefined sampling rate, where frame represents the frame at time i. Based on our method, the most representative subset is finally extracted to represent the static video summary, where is the selected set of keyframes, and . The variable mrepresents the number of shots in F, and is the keyframe of the mth shot. In this section, we provide the modelling process for each
Shot boundary detection based on structural difference analysis
The next step in our method is video shot detection. Fig. 4 shows a flowchart of the proposed shot detection method. By comparing the graph models of k pairs of patches corresponding to the positions of two consecutive frames, k graph difference scores between two frames can be obtained. Based on these scores, a commonly used hypothesis testing method is employed to make a shot detection decision.
Keyframe extraction based on median graphs
After detecting shots in the video, the next step is to select the most representative frame in each shot to obtain the target summary. The basic objective of this approach is to determine a representative frame that is the most similar frame to the rest of the frames in a shot. Therefore, we introduce the concept of the median graph for keyframe extraction. In graph theory, median graphs are an effective tool for representing a set of graphs [48]. As shown in Fig. 5, given a graph set (
Algorithm of the proposed method
The techniques described in the above sections can be summarized in the following steps.
(1) Collect the frame set of the video at a predefined sampling rate;
(2) Extract the HSV-histogram and HOG-histogram from each frame patch to construct the HSV-HOG histogram;
(3) Construct a graph for each frame patch based on the HSV-HOG histogram;
(4) Measure the dissimilarity scores between constructed graphs;
(5) Detect the shot boundary by hypothesis testing. If a shot boundary is detected, mark the
Experiments
In this section, we first introduce the datasets and evaluation indicators. Based on these results, we study several factors that affect the performance of the proposed method to obtain the best experimental setup. Finally, we compare the test results with several variations and state-of-the-art methods to verify the effectiveness of our method.
Conclusion
In this paper, we presented a novel keyframe extraction framework for VS. On the one hand, unweighted directed graphs representing each video frame are exploited to maintain the detailed structural information in frame features. On the other hand, a structural difference analysis is adopted to identify the potential differences between continuous frames, such that various types of shot transitions, e.g., hard cuts, dissolves, wipes and fades, can be detected based on hypothesis testing. By
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is financially supported in part by the National Natural Science Foundation of China (61602286, 61976127), in part by the Shandong Key Research and Development Program (2018GGX101003), and in part by the Shandong Province Higher Educational Science and Technology Program (J16LN09).
Chunlei Chai received her bachelor’s degree from Qufu Normal University, Shandong, China, in 2018 and is now a master’s candidate at Shandong Normal University, Jinan, China. Her research interests include computer vision, multimedia processing, and pattern recognition.
References (50)
- et al.
A comprehensive survey of multi-view video summarization
Pattern Recogn.
(2021) - et al.
Query-aware sparse coding for web multi-video summarization
Inf. Sci.
(2019) - et al.
Creating memorable video summaries that satisfy the user’s intention for taking the videos
Neurocomputing
(2018) - et al.
Browsing and exploration of video sequences: A new scheme for key frame extraction and 3d visualization using entropy based jensen divergence
Inf. Sci.
(2014) - et al.
A novel framework of change-point detection for machine monitoring
Mech Syst. Signal Process.
(2017) - et al.
Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method
Pattern Recogn. Lett.
(2011) - et al.
Dfp-alc: Automatic video summarization using distinct frame patch index and appearance based linear clustering
Pattern Recogn. Lett.
(2019) - et al.
Video summarization via minimum sparse reconstruction
Pattern Recogn. Lett.
(2015) - et al.
A salient dictionary learning framework for activity video summarization via key-frame extraction
Inf. Sci.
(2018) - et al.
Video summarization via block sparse dictionary selection
Neurocomputing
(2020)
Mskvs: Adaptive mean shift-based keyframe extraction for video summarization and a new objective verification approach
J. Vis. Commun. Image Represent.
A survey on deep learning in medical image analysis
Medical Image Anal.
Key-frame selection for video summarization: an approach of multidimensional time series analysis
Multidimension. Syst. Signal Process.
A novel clustering method for static video summarization
Multimedia Tools Appl.
Efficient summarization of stereoscopic video sequences
IEEE Trans. Circuits Syst. Video Technol.
Efficient video summarization based on a fuzzy video content representation
An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram
Int. J. Multimedia Inform. Retrieval
Video shot detection and condensed representation. a review
IEEE Signal Processing Magazine
Automatic partitioning of full-motion video
Multimedia Syst.
A key frame based video summarization using color features
Automatic video summarizing tool using mpeg-7 descriptors for personal video recorder
IEEE Trans. Consum. Electron.
Video shot boundary detection using motion activity descriptor
Telecommunications
Statistical spectral analysis for fault diagnosis of rotating machines
IEEE Trans. Industr. Electron.
Graph-based change detection for condition monitoring of rotating machines: Techniques for graph similarity
IEEE Trans. Reliab.
Cited by (13)
Progressive reinforcement learning for video summarization
2024, Information SciencesExploring deep learning approaches for video captioning: A comprehensive review
2023, e-Prime - Advances in Electrical Engineering, Electronics and EnergyOn neighborhood inverse sum indeg index of molecular graphs with chemical significance
2023, Information SciencesPopularity sensitive and domain-aware summarization for web tables
2023, Information SciencesCitation Excerpt :Unstructured data summarization. Besides tabular data, summarization techniques are frequently utilized on unstructured data including document[26,27,29,28], picture[30], video[31]. For example, the work [26] combines sentence compression and summarization-oriented optimization in order to preserve both informativeness and grammatical correctness of the summary sentences.
Toward a perceptive pretraining framework for Audio-Visual Video Parsing
2022, Information SciencesCitation Excerpt :AVVP [10] has a wide potential application in downstream video understanding tasks. ( such as monitoring analysis, video summarization [21–23], and retrieval [24–26]). It is a newly introduced multi-modal task that detects and localizes events within the audio and visual streams.
Static video summarization with multi-objective constrained optimization
2024, Journal of Ambient Intelligence and Humanized Computing
Chunlei Chai received her bachelor’s degree from Qufu Normal University, Shandong, China, in 2018 and is now a master’s candidate at Shandong Normal University, Jinan, China. Her research interests include computer vision, multimedia processing, and pattern recognition.
Guoliang Lu received his bachelor’s and master’s degrees from Shandong University, Jinan, China, in 2006 and 2009, respectively, and his Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in March 2013. He is currently an associate professor at Shandong University. His research interests include computer vision, visual servo control, signal processing and machine monitoring.
Ruyun Wang is currently working towards a bachelor’s degree at the School of Information Science and Engineering, Shandong Normal University, Jinan, China. Her research interests include computer vision and multimedia processing.
Chen Lyu received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2015. He is currently an associate professor with the School of Information Science and Engineering, Shandong Normal University, Jinan, China. His research interests include computer vision, multimedia information processing and artificial intelligence.
Lei Lyu received his Ph.D. degree in computer application technology from the University of Chinese Academy of Sciences in 2013. He is currently an associate professor with the School of Information Science and Engineering, Shandong Normal University, Jinan, China. His research interests include computer vision and software engineering.
Peng Zhang received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China. His research interests mainly include enterprise-level big data processing, distributed systems and service computing.
Hong Liu, Professor and Ph.D., is the supervisor of the School of Information Science and Engineering at Shandong Normal University. She received her Ph.D. degree in engineering from the Institute of Computing Technology, Chinese Academy of Science, Beijing, China, in 1998. She is an academic leader in computer science and technology. Her research is the cross study of distributed artificial intelligence, software engineering, and computer-aided design, including the research of multi-agent systems and co-evolutionary computing technology.