Multi-video summarization with query-dependent weighted archetypal analysis

doi:10.1016/j.neucom.2018.12.038

Neurocomputing

Volume 332, 7 March 2019, Pages 406-416

https://doi.org/10.1016/j.neucom.2018.12.038 Get rights and content

Abstract

Given the tremendous growth of web videos, video summarization is becoming increasingly important to improve user’s browsing experience. Since most existing methods focus on generating an informative summarization from a single video and often fail to generate satisfying results for multiple videos, we propose an unsupervised framework for summarizing a set of topic-related videos. We develop a Multi-Video Summarization via Multi-modal Weighted Archetypal Analysis (MVS-MWAA) method to extract a concise summarization that is both representative and informative. To ensure the summarization query-dependent, we design a multi-modal graph to guide the generation of the weight in WAA, which we call query-dependent weight. Specifically, the multi-modal graph fuses information of video frames, tags, and query-dependent web images. Furthermore, we present a Ranking from Bottom to Top (RBT) approach to make it understandable. Extensive experimental results demonstrate that our approach clearly outperforms the state-of-the-art methods.

Introduction

With the development of multimedia technology and the popularity of handheld devices, there is an urgent need for an efficient technique to index and manage the increasing volume of unstructured videos. For example, people always desire to capture the main story in a video or several videos, especially news videos, as quick as possible. As one of the promising techniques, video summarization [1], [2], [3], [4] aims at condensing a long video or lots of short videos into a compact form [5], [6], which has drawn much attention in recent years.

Video summarization can be static or dynamic. Typically, a static summarization is formed with a number of keyframes, while a dynamic one is composed of a successive of video clips. In this paper, we focus on static summarization. In addition, according to the number of videos to be summarized, video summarization can be categorized into Single-Video Summarization (SVS) and Multi-Video Summarization (MVS). Most existing methods focus on SVS, whose purpose is to summarize a long video into a compact form [7], [8], [9], [10]. Recently, with the popularity of online news videos and personal videos, MVS receives increasing attention [11], [12], [13]. MVS in this paper refers to the query-dependent summarization, whose aim is to condense a large number of searched videos into a concise summarization. It enables users to quickly browse and comprehend the main idea of massive videos from the same query, thus is able to appeal more potential users. Generally, MVS is more challenging than SVS mainly because of the following three sides. (1) Since these videos are from the same query, they usually have high content redundancy. (2) These videos have plenty of irrelevant content, which demands MVS should be query-aware to narrow down the search intention gap. (3) SVS is usually displayed by the chronological sequence in the original video. However, MVS needs to deal with and analyze a large number of short videos with a few minutes. Accordingly, it is difficult to set an easy to understand representation order since the keyframes are from different videos.

In recent years, some important progress has been made in the research of MVS. However, generating a summarization from a series of topic-related videos is still a challenging problem. Some studies summarize videos of specific genres by utilizing some genre specific information [14], [15]. For example, [14] proposes to apply meta-data sensor information related to geographical area to summarize multiple sensor-rich topic-related videos. However, the utilization of this genre specific information in turn restrict this type of method in a wider range of application. Therefore, some recent attempts try to exploit the user search intents to narrow down the search intention gap for query-dependent summarization. For instance, Wang et al. [16] propose a method for event driven web videos summarization by tag localization and key-shot mining to generate satisfactory results, where the searched images are used to estimate the similarities between them and the keyframe of each shot in key-shot identification. In [17], the authors propose a multi-task deep visual-semantic embedding modal, where the query-dependent video thumbnails are generated based on both visual and side information (e.g., title, description, and query). Besides, Yao et al. [18] apply supervised learning method on video summarization. They propose a novel pairwise deep ranking model to learn the relationship between highlight and non-highlight video segments. And a two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for video highlight detection. Then the video summarization is extracted by highlight scores which are obtained by training the highlight detection model. However, how to efficiently reflect the search intend in MVS is still an open and challenging problem.

Recently, we have seen a proliferation of Archetypal Analysis (AA) algorithm in different fields, such as in economics [19], pattern recognition [20], document summarization [21], and computer vision [22]. It represents each individual in a dataset as a mixture of individuals of pure type or archetypes [23]. Recently, it is also used in video summarization [24]. Specifically, the authors propose a novel Co-Archetypal Analysis (CAA) algorithm, which learns canonical visual concepts shared between video and web images by finding a joint-factorial representation of two datasets. The frame-level importance is measured based on the learned factorial representation of the video and then combined into shot-level scores, by which the summarization is generated with a fix length.

In this paper, we present an alternate way to utilize AA to MVS. We propose a query-dependent MVS method by using weighted AA algorithm. Different from the idea of CAA in [24], we explore the web images searched by the same query and the tags around the videos as the query-dependent information to guide the AA algorithm in video summarization to be query-dependent. Fig. 1 depicts the framework of this paper.

The main contributions of this paper lie in the following three aspects:

(1)
A novel MVS method with Weighted Archetypal Analysis (WAA) is proposed, which is called Multi-Video Summarization via Multi-modal Weighted AA (MVS-MWAA).
(2)
To ensure the summarization query-dependent, we design a multi-modal graph to guide the generation of the weight in WAA, which we call query-dependent weight. Specifically, the multi-modal graph algorithm exploits not only video data, but its tags and query-independent web images.
(3)
To make the summarization logical and readable, a novel Ranking from Bottom to Top (RBT) method is developed.

The rest of this paper is organized as follows. Section 2 reviews the related work. The brief concept of WAA is introduced in Section 3. Section 4 describes the details of the proposed MVS-MWAA method. Section 5 introduces the proposed RBT method for MVS presentation. Experiments are then presented and analyzed in Section 6. Section 7 concludes the paper.

Section snippets

Related work

Recently, MVS has attracted more attention and much great progress has been made. Existing work can be roughly divided into three categories: graph based approaches, multi-modal fusion based approaches, and decomposition based approaches.

Graph based approaches. The graph modal is beneficial to explore the relationship among a large number of video frames. For example, Yeo et al. [25] explore complete multipartite graph to model semantic relationship between the extracted subsequences in

A brief review of weighted archetypal analysis

Archetypal analysis (AA) [36] represents each individual in a dataset as a mixture of individuals of pure, not necessarily observed, types or archetypes. The archetypes themselves are limited to being mixtures of the individuals in the dataset and lie on the dataset boundary. Generally, AA modal can be regarded as a technique fusing the ideas of clustering approaches and low-rank approximation, which completely assembles the advantages of clustering and the flexibility of matrix factorization.

The Proposed MVS-MWAA framework

The proposed MVS-MWAA method employs the candidate keyframes, web images and the tags around each video to extract meaningful segments from multiple videos. The framework of MVS-MWAA is illustrated in Fig. 1. It consists of four components, i.e., the multi-modal graph construction modal, query-dependent WAA modal, summarization generation modal and final summarization presentation modal with RBT algorithm. Algorithm 1 outlines the procedure of the proposed MVS-MWAA approach. The technical

MVS presentation

A good summarization is expected to have higher logicality and readability so that it is easy to be understood for users. In SVS, the generated keyframes are logically presented based on the video play order. However, the summarization obtained from multiple videos does not have this chronological order, so it is difficult to give users a satisfactory presentation. Therefore, we develop a Ranking from Bottom to Top (RBT) method to provide a user-friendly summarization representation based on

Experimental settings

Most of the existing MVS datasets are either publicly unavailable or in small scale. To the best of our knowledge, the recently introduced MVS1K dataset [33] is the largest publicly available annotated dataset. It has 936 videos from 10 queries, with 113,516 seconds duration. Table 1 shows its details and its illustration is represented in Fig. 2. We use the same settings in [33]. Particularly, the visual feature is a 4352D vector, composed by a 4096D VGGNet-19 CNN feature [38] and a 256D HSV

Conclusions

This paper proposes a query-dependent MVS-MWAA approach that meets the MVS criteria: representativeness, conciseness and informativeness. In this unsupervised framework, we jointly use the information of video frames, searched web images and tags to explore the relationship among the candidate keyframes with a multi-modal graph. Then, to generate a representative and concise summarization, we exploit query-dependent WAA to cluster all candidate keyframes into archetypes with distinct

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants (61472273, 61632018, and 61771329), the National Basic Research Program of China (Grant No. 2014CB340400), and the Nokia.

Zhong Ji received the Ph.D. degree in signal and information processing from the Tianjin University, Tianjin, China, in 2008.

He is currently an Associate Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include multimedia understanding, computer vision, and deep learning. He has published more than 50 scientific papers.

References (42)

LiJ. et al.
Detecting shot boundary with sparse coding for video summarization
Neurocomputing
(2017)
MeiS. et al.
Video summarization via minimum sparse reconstruction
Pattern Recognit.
(2015)
SongX. et al.
Event-based large scale surveillance video summarization
Neurocomputing
(2016)
R. Kannan et al.
What do you wish to see? A summarization system for movies based on user preferences
Inf. Process. Manag.
(2015)
E. Canhasi et al.
Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization
Expert Syst. Appl.
(2014)
M. Mørup et al.
Archetypal analysis for machine learning and data mining
Neurocomputing
(2012)
JiZ. et al.
Hypergraph dominant set based multi-video summarization
Singal Processing
(2018)
M.J. Eugster et al.
Weighted and robust archetypal analysis
Comput. Stat. Data Anal.
(2011)
S.E.F. De Avila et al.
Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method
Pattern Recognit. Lett.
(2011)
LiX. et al.
A general framework for edited video and raw video summarization
IEEE Trans.Image Process.
(2017)

ZhangK. et al.

Video summarization with long short-term memory

Proceedings of the European Conference on Computer Vision

(2016)

SunH. et al.

Glancenets - efficient convolutional neural networks with adaptive hard example mining

Sci. China Inf. Sci.

(2018)

PangY. et al.

Cascade learning by optimally partitioning

IEEE Trans. Cybern.

(2016)

M. Gygli et al.

Creating summaries from user videos

Proceedings of the European Conference on Computer Vision

(2014)

HeY. et al.

Graph coloring based surveillance video synopsis

Neurocomputing

(2016)

ZhangW. et al.

Web video thumbnail recommendation with content-aware analysis and query-sensitive matching

Multimed. Tools Appl.

(2014)

NieL. et al.

Perceptual attributes optimization for multivideo summarization

IEEE Trans. Cybern.

(2016)

LiH. et al.

Localizing relevant frames in web videos using topic model and relevance filtering

Mach. Vis. Appl.

(2014)

ZhangY. et al.

Multi-video summary and skim generation of sensor-rich videos in geo-space

Proceedings of the ACM Sigmm Conference on Multimedia Systems

(2012)

LiY. et al.

Multimedia maximal marginal relevance for multi-video summarization

Multimed. Tools Appl.

(2016)

WangM. et al.

Event driven web video summarization by tag localization and key-shot identification

IEEE Trans. Multimed.

(2012)

Cited by (20)

A review on video summarization techniques
2023, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Panda et al. (2017) discussed a sparse optimization framework to jointly summarize a set of videos by examining the complementarity within the videos, which considers “interestingness” prior in the sparse representative selection and proposed a diversity regularizer in the optimization framework. In Ji et al. (2017), Vasudevan et al. (2017) and Ji et al. (2019b), the authors developed a fusion-based method. In Zhang et al. (2019) a deep reinforcement learning-based model i.e.,summarization network (SummNet) is proposed for generating summaries associated with user queries.
The exponential growth of technology has resulted in a profusion of advanced imaging devices and eases internet accessibility, leading to an increase in the creation and use of multimedia content. Analyzing representative or meaningful information from such massive data is a time-consuming task that impacts the efficiency of various video processing applications, including video searching, retrieval, indexing, sharing, and many more. In literature, numerous video summarization techniques which extract key-frames or key-shots from the original video to generate a concise yet informative summary have been proposed to address these issues. This paper presents a discussion of the state-of-the-art video summarization techniques along with limitations and challenges. The paper examines summarization techniques in a holistic manner based upon the distinct attributes of evolving video data types on the basis of parameters such as the number of views, dimensions, modality, and content. Such a categorization framework enables us to critically analyze the recent progress, future directions, limitations, datasets, application domains etc., in a better comprehensible manner.
A multi-flexible video summarization scheme using property-constraint decision tree
2022, Neurocomputing
Citation Excerpt :
Representativeness is an indicator for measuring the degree of representation of the VS on the source video [12,13]. Content coverage is an indicator for measuring VS coverage degree to the source video in terms of content [14,15]. Redundancy [1,16], diversity [10,12], and uniqueness [17] are all measure of the difference between frames in VS. Redundancy is used to constrain the difference between any two frames in the set of keyframes.
Video summarization (VS) technology presents a multi-tendency development. However, there are still certain challenges in shortening a long video into several diverse and concise versions, such as generating various versions and variable lengths of summaries depending on user requirements. In this context, it is critical to extract the spatial–temporal features of video frames. To address these issues, this paper proposes a multi-flexible video summarization scheme using a property-constraint decision.
Due to the difficulty of directly transforming user requirements into summary algorithm factors, this paper employs property-constraints as a bridge between user requirements and the video summary algorithm. The specific implementation method is to aggregate and analyze existing research results on summary-oriented properties. It begins with the establishment of a property-constraints library, followed by the translating and mapping of user requirements into VS property-constraints, and ultimately constructing a property-constraints tree for flexible decision versions of VS. In addition, a hybrid cascade bidirectional network (CB-ConvLSTM) based on the Convolutional LSTM Network (ConvLSTM) is designed to extract the spatial–temporal features of the video. On this basis, the VS generator is configured. The goal of this scheme, which combines the VS property constraints and CB-ConvLSTM, is to “analyze once, satisfy multiple factors, generate multiple levels”. Verification experiments and a comparative analysis are conducted on benchmark datasets to evaluate the algorithm in this paper. The results indicate that the proposed algorithm is highly rational, effective, and usable.
Recurrent generative adversarial networks for unsupervised WCE video summarization[Formula presented]
2021, Knowledge-Based Systems
Citation Excerpt :
That needs a new key frame extraction approach for this special medical video domain which we call it medical multimedia video. In addition to that, as we all know, video summarization is also a challenging problem in the traditional multimedia user video field and related technologies have gained increasing attention, leading to various methods proposed to help efficiently browse, manage and retrieve video contents and facilitate large-scale video distilling [49–55]. These are typically based on learning techniques to summarize video, including unsupervised and supervised methods.
Wireless capsule endoscopy (WCE) produces amounts of redundant images in one examination, which is very laborious and time-consuming for a physician to review these. It has been extremely needed for a technique that automatically produces a shortened and informative WCE video summary from its original video. This paper considers unsupervised WCE video summarization, and casts it as a sequence-to-sequence learning problem. Our key idea is to learn a deep summarizer network to minimize information loss between training videos and their summaries, in an unsupervised way. To this end, we propose a hybrid yet effective unsupervised WCE video summarization method using long short-term memory (LSTM), variational autoencoder (VAE), pointer network (Ptr-Net), generative adversarial network (GAN), and de-redundancy mechanism (DM) etc. techniques. The proposed model termed Adv-Ptr-Der-SUM adopts a generative adversarial framework, consisting of a summarizer and a discriminator. The summarizer is the VAE-based LSTM architecture with Ptr-Net and DM that aims to learn the conditional probability of output sequence and provide a compact summary. The discriminator is another LSTM aimed at distinguishing between the original video and reconstructed video from the summarizer. The summarizer and discriminator are adversarially trained to optimize the summarizer and produce optimal WCE video summary. Extensive experiments on our WCE-2019-Video dataset show that our model can outperform other video summarization approaches by a large margin in both supervised and unsupervised settings. Also, the proposed model is applied to two public multimedia benchmark datasets, verifying its effectiveness and generality, and demonstrating that it can achieve a competitive result.
A unified model for egocentric video summarization: an instance-based approach
2021, Computers and Electrical Engineering
Citation Excerpt :
The summarization problem is viewed as computing dominant sets in a hypergraph model. Ji et al. [16] also performs multi-video summarization via weighted archetypal analysis to identify video summaries which are informative and representative. Multimodal graphs are employed for weight generation.
Video summarization generates compact representations of videos in the form of summaries. The proposed framework is a unified model for instance-driven egocentric video summarization addressing generic and query-based summarization along with multi-video summarization. The model employs deep learning for object detection and semantic web technologies in the form of ontologies for query inferences. Combining user preferences in the form of object queries has aided in producing summaries that are subjective in nature. Quantitative evaluations performed on two novel datasets namely, ‘vehicle expo’ and ‘academic inspection’ prove that the proposed framework produces remarkable results with the employment of instance-driven modules for summarization. Additional experimental analysis for shot boundary detection have been conducted based on proposed method and conventional methods establishing the significance of the instance-based model. Moreover, qualitative evaluations further ensure that the summaries are concise, representative, diverse and semantically relevant further substantiating the need for instance-driven models in video summarization.
Deep attentive and semantic preserving video summarization
2020, Neurocomputing
Citation Excerpt :
According to the number of videos to be summarized, the approaches fall into two categories: Single-Video Summarization (SVS) and Multi-Video Summarization (MVS). Particularly, MVS is also called query-dependent summarization, whose purpose is to condense a large number of searched videos into a concise summarization [6]21. By contrast, SVS is to condense one single video.
Video summarization shortens a lengthy video into a succinct version, whose challenges mainly originate from the difficulties of discovering the inherent relations between the original video and its summary, meanwhile minimizing the semantic information loss. Supervised approaches, especially those in deep learning framework, have demonstrated their effectiveness in video summarization. However, these approaches mainly focus on one of the challenges, and seldom pay close attention to both challenges simultaneously. To this end, we propose to pay close attention to this deficiency by incorporating the ideas of both the encoder-decoder attention and semantic preserving loss in a deep Seq2Seq framework for video summarization. Moreover, we also introduce Huber loss to replace the popular mean square error loss to enhance the robustness of the model to outliers. Extensive experiments on two benchmark video summarization datasets demonstrate that the proposed approach consistently outperforms the state-of-the-art ones.
Cross-modal guidance based auto-encoder for multi-video summarization
2020, Pattern Recognition Letters
Multi-Video Summarization (MVS) aims at condensing a great number of videos with the same search query to obtain a compact storyboard, which improves user’s browsing experience. Recently, a line of research resorts to a mass of extra web images to specify the search intent for a query. However, these web images are indirect and noisy. In contrast, the tag information for queried videos is more accurate and concise than those web images, which can also be directly employed as side information. To this end, we proposed to employ tag information, i.e., titles and descriptions, as the side information for the generation of summarization. Specifically, we employed sparse auto-encoder as the main model to generate the final summarization, where the input was the multiple videos, the output was the keyframes set. Meanwhile, we fused the visual and tag information to guide visual features, which constrained the sparse auto-encoder to select the important candidate keyframes. The proposed approach is called Cross-modal guidance based Auto-Encoder for MVS approach (CAE-MVS). Extensive experimental results demonstrate that the CAE-MVS clearly outperforms the state-of-the-art methods.

View all citing articles on Scopus

Zhong Ji received the Ph.D. degree in signal and information processing from the Tianjin University, Tianjin, China, in 2008.

Yuanyuan Zhang is a Master student in the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her research interests include video summarization and computer vision.

Yanwei Pang received the Ph.D. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, in 2004.

He is currently a Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include object detection and recognition, and image processing. His current research interests include object detection and recognition, vision in bad weather, and image processing. He has published more than 100 scientific papers.

Xuelong Li is a full professor with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, China.

Jing Pan received her B.S degree in Mechanical Engineering from the North China Institute of Technology (now North University of China), Taiyuan, China, in 2002, and her M.S degree in Precision Instrument and Mechanism from the University of Science and Technology of China, Hefei, China, in 2007. She is currently an associate professor with the School of Electronic Engineering, Tianjin University of Technology and Education, Tianjin, China. Meanwhile, she is pursuing her Ph.D. degree in the Tianjin University, China. Her research interests include computer vision and pattern recognition.

View full text

Multi-video summarization with query-dependent weighted archetypal analysis

Abstract

Introduction

Section snippets

Related work

A brief review of weighted archetypal analysis

The Proposed MVS-MWAA framework

MVS presentation

Experimental settings

Conclusions

Acknowledgements

Neurocomputing

Pattern Recognit.

Neurocomputing

Inf. Process. Manag.

Expert Syst. Appl.

Neurocomputing

Singal Processing

Comput. Stat. Data Anal.

Pattern Recognit. Lett.

A general framework for edited video and raw video summarization

IEEE Trans.Image Process.

Video summarization with long short-term memory

Proceedings of the European Conference on Computer Vision

Glancenets - efficient convolutional neural networks with adaptive hard example mining

Sci. China Inf. Sci.

Cascade learning by optimally partitioning

IEEE Trans. Cybern.

Creating summaries from user videos

Proceedings of the European Conference on Computer Vision

Graph coloring based surveillance video synopsis

Neurocomputing

Web video thumbnail recommendation with content-aware analysis and query-sensitive matching

Multimed. Tools Appl.

Perceptual attributes optimization for multivideo summarization

IEEE Trans. Cybern.

Localizing relevant frames in web videos using topic model and relevance filtering

Mach. Vis. Appl.

Multi-video summary and skim generation of sensor-rich videos in geo-space

Proceedings of the ACM Sigmm Conference on Multimedia Systems

Multimedia maximal marginal relevance for multi-video summarization

Multimed. Tools Appl.

Event driven web video summarization by tag localization and key-shot identification

IEEE Trans. Multimed.