Graph-based structural difference analysis for video summarization

doi:10.1016/j.ins.2021.07.012

Information Sciences

Volume 577, October 2021, Pages 483-509

https://doi.org/10.1016/j.ins.2021.07.012 Get rights and content

Abstract

Keyframe extraction is an effective way to achieve video summarization. More recent studies using deep learning networks are heavily dependent on massive historical datasets for training. For practicality in real applications, we focus more on unsupervised online analysis and present a novel graph-based structural difference analysis method for this purpose. Unlike traditional methods of video representation based on raw features, undirected weighted graphs are constructed from the resulting features to represent video frames. The detailed structural changes between graphs are more consistent with the actual changes between video frames than raw features, thus making the newly proposed method robust for detecting various types of shot transitions, such as hard cuts, dissolves, wipes, and fade-ins/fade-outs. Then, considering the local influence between successive frames, a structural difference analysis of graphs is performed to detect the video shot boundaries. Finally, the median graph of each shot is obtained to extract the corresponding keyframe. Extensive experiments are conducted on three video summarization benchmark datasets. Quantitative and qualitative comparisons are made between the proposed method and other state-of-the-art methods, with the proposed method yielding remarkable improvements from 1.9% to 3.1% in terms of the F-score on the three datasets.

Introduction

In recent years, many new photography technologies and video applications have emerged, which have led to enormous growth in the variety of videos on the Internet. It is increasingly necessary to quickly browse and view massive video data in a limited time to facilitate video browsing and video retrieval [1]. This is a widespread concern in related fields that require extensive video data storage, archival, analysis, and visualization. Techniques involving automatic video summarization (VS) can solve these problems by generating a concise version of the video stream that retains only the most informative and representative content.

VS has been extensively studied, and the relevant methods can be divided into two categories: static VS (static keyframes) [2], [3] and dynamic VS (dynamic video skimming) [4]. Static VS methods are designed to select the most informative and representative keyframes from the original video sequence and arrange them in chronological order to represent the main content of the video. The keyframe set is not limited by any timing or synchronization issues, thus providing excellent flexibility and adaptability. Dynamic VS methods focus on extracting video clips, that is, a set of frames representing the most exciting and meaningful content of the video, including basic audio and visual motion elements. Although this approach is more attractive to users than just looking at a series of static keyframes, video skimming requires advanced semantic analysis. In contrast, static VS methods based on keyframe extraction have been widely studied due to their simplicity, flexibility, and practicality.

The traditional VS methods based on keyframe extraction are mainly implemented by video frame clustering [5], [6] or shot segmentation-based approaches [7], [8], [9], [10]. However, the former methods are limited by a priori fixed number of clusters, and in most real-world scenarios, the number of clusters is unknown in advance. The latter methods are widely used since they do not require any prior knowledge and involve unsupervised online analysis. Therefore, this paper focuses on a keyframe extraction method based on shot boundary detection.

Generally, shot boundary detection can be realized by analysing the differences between consecutive frames, and a significant difference indicates that there is probably a boundary at the current detected location. The components of shot detection algorithms generally include feature representation, a distance metric and a decision-making method [11]. Choosing the appropriate features to represent the content of the video is a challenge. Frequently used video features include pixels [12], colour histograms [13], MPEG-7 visual descriptors [14], and motion vectors [15]. The methods based on such features are effective in detecting sudden shot changes (e.g., hard cutting). However, the main limitation that affects the performance of existing methods is insufficient capacity for detecting subtle transitions. For transitions, such as hard cuts, dissolves, wipes and fades, the interframe changes in progressive shots are subtle; thus, it is relatively difficult to detect transitions based only on the low-level features adopted in traditional methods. Notably, low-level features, such as pixels, pixel blocks, and histograms, cannot express the implied detailed structural information in video features, and such information plays a critical role in distinguishing the small differences between successive frames in gradually changing shots.

Due to the above limitations, there is a clear motivation for investigating a new feature model that can represent the sophisticated structural information to reflect the dynamic changes that may occur in video frames. Therefore, we propose a new keyframe extraction framework for VS based on graph modelling, as shown in Fig. 1. This method consists of the following three steps.(1) Graph modelling-based Video Representation (VR). The feature representation of each video frame is a histogram composed of frequency components (i.e., bins) in terms of the colour and gradient direction. This approach can provide the relevant statistical information for video frame content. However, to reveal the corresponding structural characteristics, an appropriate modelling method is required. Therefore, we introduce a graph-based modelling method with two main advantages: (a) it represents each video frame from a global perspective through a graph structure¹ rather than directly using one-dimensional linear bins in the histogram. Consequently, the bins represented by the nodes in the graph are intimately connected, which is the key to determining the structural characteristics of the histogram extracted from the corresponding video frame, and (b) it can better capture the long-range dependencies among the bins in the histogram to reflect the relationships between nonadjacent bins.(2) Structural difference analysis-based shot boundary detection (SBD). The standard difference analysis method [16], [17] uses a predefined metric that can quantify the difference between two adjacent frames in a dynamic video stream, such that a decision can be made regarding shot boundary detection. The commonly used analysis methods include the likelihood ratio test (LRT), symmetric K-L divergence, and cumulative sum metric (CUSUM). To further adapt to the proposed graph model, we use the metric of differences in edge weight values [18] to calculate the distance between graphs and adopt hypothesis testing to determine the shot boundaries.(3) Median graph calculation-based keyframe extraction (KE). Based on the detected shots, we calculate the median graphs and extract the corresponding frames as the keyframes for shots.

In summary, the main contributions of this paper are summarized as follows.

•
We propose a novel graph-based structural difference analysis model to formulate the VS problem, where the structural information in the feature of each video frame is considered and modelled in graphs to bridge the gap between the actual semantic structural information and the raw features of video frames.
•
We develop a graph-based metric to measure the dissimilarity between frames. This graph structural difference can reflect the potential discrepancies between continuous frames, and median graphs can be obtained as the corresponding keyframes to reflect the overall trend of the video.

We present comprehensive experimental results for the proposed method and demonstrate that our method outperforms other state-of-the-art methods on three different datasets. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 describes the video representation method based on graph modelling. Section 4 presents the structural difference analysis for shot boundary detection. Section 5 provides a method for keyframe extraction by calculating median graphs. The pseudocode of the algorithm is given in Section 6. Section 7 shows the experimental results. Finally, the conclusions are drawn in Section 8.

Section snippets

Related work

In recent decades, many keyframe extraction-based VS methods have been proposed, and the main techniques can be divided into the following categories.

Video representation based on graph modelling

For a given video V, we extract the frame set $F = {f_{1}, f_{2}, f_{3}, \dots, f_{N}}$ containing N frames at a predefined sampling rate, where frame $f_{i}$ represents the frame at time i. Based on our method, the most representative subset $F = {{kf}_{1}, {kf}_{2}, {kf}_{3}, \dots, {kf}_{m}}$ is finally extracted to represent the static video summary, where $F \subset F$ is the selected set of keyframes, and $m ≪ N$ . The variable mrepresents the number of shots in F, and ${kf}_{m}$ is the keyframe of the mth shot. In this section, we provide the modelling process for each

Shot boundary detection based on structural difference analysis

The next step in our method is video shot detection. Fig. 4 shows a flowchart of the proposed shot detection method. By comparing the graph models of k pairs of patches corresponding to the positions of two consecutive frames, k graph difference scores between two frames can be obtained. Based on these scores, a commonly used hypothesis testing method is employed to make a shot detection decision.

Keyframe extraction based on median graphs

After detecting shots in the video, the next step is to select the most representative frame in each shot to obtain the target summary. The basic objective of this approach is to determine a representative frame that is the most similar frame to the rest of the frames in a shot. Therefore, we introduce the concept of the median graph for keyframe extraction. In graph theory, median graphs are an effective tool for representing a set of graphs [48]. As shown in Fig. 5, given a graph set ( $S = {G_{1}, G_{2}$

Algorithm of the proposed method

The techniques described in the above sections can be summarized in the following steps.

(1) Collect the frame set of the video at a predefined sampling rate;
(2) Extract the HSV-histogram and HOG-histogram from each frame patch to construct the HSV-HOG histogram;
(3) Construct a graph for each frame patch based on the HSV-HOG histogram;
(4) Measure the dissimilarity scores between constructed graphs;
(5) Detect the shot boundary by hypothesis testing. If a shot boundary is detected, mark the

Experiments

In this section, we first introduce the datasets and evaluation indicators. Based on these results, we study several factors that affect the performance of the proposed method to obtain the best experimental setup. Finally, we compare the test results with several variations and state-of-the-art methods to verify the effectiveness of our method.

Conclusion

In this paper, we presented a novel keyframe extraction framework for VS. On the one hand, unweighted directed graphs representing each video frame are exploited to maintain the detailed structural information in frame features. On the other hand, a structural difference analysis is adopted to identify the potential differences between continuous frames, such that various types of shot transitions, e.g., hard cuts, dissolves, wipes and fades, can be detected based on hypothesis testing. By

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is financially supported in part by the National Natural Science Foundation of China (61602286, 61976127), in part by the Shandong Key Research and Development Program (2018GGX101003), and in part by the Shandong Province Higher Educational Science and Technology Program (J16LN09).

Chunlei Chai received her bachelor’s degree from Qufu Normal University, Shandong, China, in 2018 and is now a master’s candidate at Shandong Normal University, Jinan, China. Her research interests include computer vision, multimedia processing, and pattern recognition.

References (50)

T. Hussain et al.
A comprehensive survey of multi-view video summarization
Pattern Recogn.
(2021)
Z. Ji et al.
Query-aware sparse coding for web multi-video summarization
Inf. Sci.
(2019)
M. Fei et al.
Creating memorable video summaries that satisfy the user’s intention for taking the videos
Neurocomputing
(2018)
Q. Xu et al.
Browsing and exploration of video sequences: A new scheme for key frame extraction and 3d visualization using entropy based jensen divergence
Inf. Sci.
(2014)
G. Lu et al.
A novel framework of change-point detection for machine monitoring
Mech Syst. Signal Process.
(2017)
S.E.F. De Avila et al.
Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method
Pattern Recogn. Lett.
(2011)
S. Kannappan et al.
Dfp-alc: Automatic video summarization using distinct frame patch index and appearance based linear clustering
Pattern Recogn. Lett.
(2019)
S. Mei et al.
Video summarization via minimum sparse reconstruction
Pattern Recogn. Lett.
(2015)
I. Mademlis et al.
A salient dictionary learning framework for activity video summarization via key-frame extraction
Inf. Sci.
(2018)
M. Ma et al.
Video summarization via block sparse dictionary selection
Neurocomputing
(2020)

R. Hannane et al.

Mskvs: Adaptive mean shift-based keyframe extraction for video summarization and a new objective verification approach

J. Vis. Commun. Image Represent.

(2018)

G. Litjens et al.

A survey on deep learning in medical image analysis

Medical Image Anal.

(2017)

Z. Gao et al.

Key-frame selection for video summarization: an approach of multidimensional time series analysis

Multidimension. Syst. Signal Process.

(2018)

E. Asadi, N.M. Charkari, Video summarization using fuzzy c-means clustering, in: 20th Iranian Conference on Electrical...

J. Wu et al.

A novel clustering method for static video summarization

Multimedia Tools Appl.

(2017)

N.D. Doulamis et al.

Efficient summarization of stereoscopic video sequences

IEEE Trans. Circuits Syst. Video Technol.

(2000)

A.D. Doulamis et al.

Efficient video summarization based on a fuzzy video content representation

R. Hannane et al.

An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram

Int. J. Multimedia Inform. Retrieval

(2016)

C. Cotsaces et al.

Video shot detection and condensed representation. a review

IEEE Signal Processing Magazine

(2006)

H. Zhang et al.

Automatic partitioning of full-motion video

Multimedia Syst.

(1993)

M. Asim et al.

A key frame based video summarization using color features

J.-H. Lee et al.

Automatic video summarizing tool using mpeg-7 descriptors for personal video recorder

IEEE Trans. Consum. Electron.

(2003)

A.M. Amel et al.

Video shot boundary detection using motion activity descriptor

Telecommunications

(2010)

L. Ciabattoni et al.

Statistical spectral analysis for fault diagnosis of rotating machines

IEEE Trans. Industr. Electron.

(2018)

T. Wang et al.

Graph-based change detection for condition monitoring of rotating machines: Techniques for graph similarity

IEEE Trans. Reliab.

(2018)

Cited by (13)

Progressive reinforcement learning for video summarization
2024, Information Sciences
Video summarization addresses generating video summaries to help watchers grasp the content of a video without watching it entirely. Many methods have engaged in automatic video summarization. Although these methods have performed well, they still suffer from limited training data and sparse reward problems. We propose a Progressive Reinforcement Learning Video Summarization structure (PRLVS) with an unsupervised reward. The reward measures the information and quality the selected frames convey without annotations. Striving to earn higher rewards, our PRLVS adopts a “T”-type human thinking paradigm: choosing some key frames and checking if their adjacent frames are better than them. To simulate this paradigm, we decompose the flat strategy into a hierarchical strategy consisting of a horizontal policy and a vertical policy. These two policies are optimized alternatively, which densifies the reward while reducing the exploration space. Their cooperation also makes the agent capture the context information of the whole video at every step. Extensive experimental results on two benchmark databases (i.e., SumMe, TVSum) show that our PRLVS outperforms the comparisons and approaches the supervised methods, which indicates that it is significant to integrate our unsupervised reward into the progressive reinforcement learning structure to address limited annotation and sparse reward problems.
Exploring deep learning approaches for video captioning: A comprehensive review
2023, e-Prime - Advances in Electrical Engineering, Electronics and Energy
While humans can easily describe visual data at varying levels of detail, the same task presents a significant challenge for machines. This challenge becomes even more complex when dealing with video data. The process of understanding a video and generating descriptive text for it is known as video captioning. Video captioning requires not only understanding the visual content but also producing human-like descriptions that accurately capture its semantics. Achieving this level of understanding requires the collaborative efforts of both the computer vision and natural language processing research communities. The captions produced through video captioning serve as valuable resources that can be further leveraged for various applications such as video search, accessibility for visually impaired people, and human-robot interaction. Deep learning strategies have emerged as powerful tools in addressing the complexities of video captioning. By leveraging large scale annotated video caption datasets and sophisticated neural network architectures, deep learning approaches have made significant advances in this challenging task. In the existing literature, numerous techniques, benchmark datasets, and evaluation metrics have been developed, emphasizing the necessity for a comprehensive examination to concentrate research efforts in this rapidly evolving field. This paper provides a survey of deep learning based methods for video captioning, highlighting their key components, challenges, and recent advancements.
On neighborhood inverse sum indeg index of molecular graphs with chemical significance
2023, Information Sciences
Chemical graph theory is an interdisciplinary field that analyses the molecular structure of a chemical compound as a graph and investigates related mathematical queries by employing graph theoretical and computational techniques. The topological index is an important tool in this area that associates a numerical value with a graph structure. It can be interpreted as a real-valued function that represents the physico-chemical information of a chemical compound. One of the most recently reported neighbourhood degree sum-based indices is the neighbourhood inverse sum indeg index $(NI)$ . In this work, we first investigate the application potential of the $NI$ index by exploring its predictive potential and isomer discrimination ability. Following that, some fascinating mathematical features of $NI$ are revealed. Extremal values of $NI$ are estimated for the class of all trees and unicyclic graphs. Furthermore, some crucial upper bounds on $NI$ are set up in terms of well-known graph parameters, including graph order,independence number, and vertex connectivity. Associations between $NI$ and existing indices are also demonstrated.
Popularity sensitive and domain-aware summarization for web tables
2023, Information Sciences
Citation Excerpt :
Unstructured data summarization. Besides tabular data, summarization techniques are frequently utilized on unstructured data including document[26,27,29,28], picture[30], video[31]. For example, the work [26] combines sentence compression and summarization-oriented optimization in order to preserve both informativeness and grammatical correctness of the summary sentences.
Table summarization can be of great help, which generates a concise and informative overview of a table to assist users to understand the table easily and unambiguously. A high-quality summary needs to have two desirable properties: presenting notable entities in the table and achieving broad coverage and high diversity on domains. However, notability and domain are often neglected in table summarization. Thus in this paper, we present a framework of domain-aware table summarization that is able to: (1) identify notable entities using a popularity-sensitive notability evaluation algorithm, (2) find core domains with a measurement of domain centrality, (3) and output the final high-quality summary using a three-phase clustering based algorithm. The experimental results show that our summarization method outperforms state-of-the-art methods by 9.62%, 2.78% and 6.77% on metrics coverage, diversity, and notability, respectively. We also conduct a user study to demonstrate that people can improve the accuracy of understanding tables by 17% with the help of our summarization technique.
Toward a perceptive pretraining framework for Audio-Visual Video Parsing
2022, Information Sciences
Citation Excerpt :
AVVP [10] has a wide potential application in downstream video understanding tasks. ( such as monitoring analysis, video summarization [21–23], and retrieval [24–26]). It is a newly introduced multi-modal task that detects and localizes events within the audio and visual streams.
Audio-Visual Video Parsing (AVVP) is a new multi-modal weakly supervised task which aims to detect and localize events leveraging the partial alignment of audio and visual streams and weak labels. We identified two significant challenges in the AVVP: Cross-mode semantic misalignment and Contextual audio-visual dataset bias. For challenge 1, the existing methods tend to leverage the temporal similarity of the features. However, it is inappropriate for our AVVP task because multi-modal features with the same label do not always have the same semantics. Thus, we propose an instance-adaptive multi-modal time series max-margin loss (MTSM) which uses the temporal information to align features adaptively. Furthermore, to restrict the inescapable noise introduced during the feature fusion, we reuse the expression of MTSM in the single-mode. For the second challenge, we argue that bias mitigation should seek help from model generalization. Thus, we propose collocating pre-trained models: either” traverse” or based on domain-adaptation. First, we prove a hypothesis and then propose a method based on the Alternating Direction Method of Multipliers(ADMM) to decouple the optimal pre-trained model collocation solution, which reduces the time consumption. Experiments show that our method outperforms the contrastive methods.
Static video summarization with multi-objective constrained optimization
2024, Journal of Ambient Intelligence and Humanized Computing

View all citing articles on Scopus

Guoliang Lu received his bachelor’s and master’s degrees from Shandong University, Jinan, China, in 2006 and 2009, respectively, and his Ph.D. degree from the Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan, in March 2013. He is currently an associate professor at Shandong University. His research interests include computer vision, visual servo control, signal processing and machine monitoring.

Ruyun Wang is currently working towards a bachelor’s degree at the School of Information Science and Engineering, Shandong Normal University, Jinan, China. Her research interests include computer vision and multimedia processing.

Chen Lyu received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2015. He is currently an associate professor with the School of Information Science and Engineering, Shandong Normal University, Jinan, China. His research interests include computer vision, multimedia information processing and artificial intelligence.

Lei Lyu received his Ph.D. degree in computer application technology from the University of Chinese Academy of Sciences in 2013. He is currently an associate professor with the School of Information Science and Engineering, Shandong Normal University, Jinan, China. His research interests include computer vision and software engineering.

Peng Zhang received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2013. He is currently an associate professor with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China. His research interests mainly include enterprise-level big data processing, distributed systems and service computing.

Hong Liu, Professor and Ph.D., is the supervisor of the School of Information Science and Engineering at Shandong Normal University. She received her Ph.D. degree in engineering from the Institute of Computing Technology, Chinese Academy of Science, Beijing, China, in 1998. She is an academic leader in computer science and technology. Her research is the cross study of distributed artificial intelligence, software engineering, and computer-aided design, including the research of multi-agent systems and co-evolutionary computing technology.

View full text

Graph-based structural difference analysis for video summarization

Abstract

Introduction

Section snippets

Related work

Video representation based on graph modelling

Shot boundary detection based on structural difference analysis

Keyframe extraction based on median graphs

Algorithm of the proposed method

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Pattern Recogn.

Inf. Sci.

Neurocomputing

Inf. Sci.

Mech Syst. Signal Process.

Pattern Recogn. Lett.

Pattern Recogn. Lett.

Pattern Recogn. Lett.

Inf. Sci.

Neurocomputing

J. Vis. Commun. Image Represent.

Medical Image Anal.

Key-frame selection for video summarization: an approach of multidimensional time series analysis

Multidimension. Syst. Signal Process.

A novel clustering method for static video summarization

Multimedia Tools Appl.

Efficient summarization of stereoscopic video sequences

IEEE Trans. Circuits Syst. Video Technol.

Efficient video summarization based on a fuzzy video content representation

An efficient method for video shot boundary detection and keyframe extraction using sift-point distribution histogram

Int. J. Multimedia Inform. Retrieval

Video shot detection and condensed representation. a review

IEEE Signal Processing Magazine

Automatic partitioning of full-motion video

Multimedia Syst.

A key frame based video summarization using color features

Automatic video summarizing tool using mpeg-7 descriptors for personal video recorder

IEEE Trans. Consum. Electron.

Video shot boundary detection using motion activity descriptor

Telecommunications

Statistical spectral analysis for fault diagnosis of rotating machines

IEEE Trans. Industr. Electron.

Graph-based change detection for condition monitoring of rotating machines: Techniques for graph similarity

IEEE Trans. Reliab.