Elsevier

Neurocomputing

Volume 332, 7 March 2019, Pages 406-416
Neurocomputing

Multi-video summarization with query-dependent weighted archetypal analysis

https://doi.org/10.1016/j.neucom.2018.12.038Get rights and content

Abstract

Given the tremendous growth of web videos, video summarization is becoming increasingly important to improve user’s browsing experience. Since most existing methods focus on generating an informative summarization from a single video and often fail to generate satisfying results for multiple videos, we propose an unsupervised framework for summarizing a set of topic-related videos. We develop a Multi-Video Summarization via Multi-modal Weighted Archetypal Analysis (MVS-MWAA) method to extract a concise summarization that is both representative and informative. To ensure the summarization query-dependent, we design a multi-modal graph to guide the generation of the weight in WAA, which we call query-dependent weight. Specifically, the multi-modal graph fuses information of video frames, tags, and query-dependent web images. Furthermore, we present a Ranking from Bottom to Top (RBT) approach to make it understandable. Extensive experimental results demonstrate that our approach clearly outperforms the state-of-the-art methods.

Introduction

With the development of multimedia technology and the popularity of handheld devices, there is an urgent need for an efficient technique to index and manage the increasing volume of unstructured videos. For example, people always desire to capture the main story in a video or several videos, especially news videos, as quick as possible. As one of the promising techniques, video summarization [1], [2], [3], [4] aims at condensing a long video or lots of short videos into a compact form [5], [6], which has drawn much attention in recent years.

Video summarization can be static or dynamic. Typically, a static summarization is formed with a number of keyframes, while a dynamic one is composed of a successive of video clips. In this paper, we focus on static summarization. In addition, according to the number of videos to be summarized, video summarization can be categorized into Single-Video Summarization (SVS) and Multi-Video Summarization (MVS). Most existing methods focus on SVS, whose purpose is to summarize a long video into a compact form [7], [8], [9], [10]. Recently, with the popularity of online news videos and personal videos, MVS receives increasing attention [11], [12], [13]. MVS in this paper refers to the query-dependent summarization, whose aim is to condense a large number of searched videos into a concise summarization. It enables users to quickly browse and comprehend the main idea of massive videos from the same query, thus is able to appeal more potential users. Generally, MVS is more challenging than SVS mainly because of the following three sides. (1) Since these videos are from the same query, they usually have high content redundancy. (2) These videos have plenty of irrelevant content, which demands MVS should be query-aware to narrow down the search intention gap. (3) SVS is usually displayed by the chronological sequence in the original video. However, MVS needs to deal with and analyze a large number of short videos with a few minutes. Accordingly, it is difficult to set an easy to understand representation order since the keyframes are from different videos.

In recent years, some important progress has been made in the research of MVS. However, generating a summarization from a series of topic-related videos is still a challenging problem. Some studies summarize videos of specific genres by utilizing some genre specific information [14], [15]. For example, [14] proposes to apply meta-data sensor information related to geographical area to summarize multiple sensor-rich topic-related videos. However, the utilization of this genre specific information in turn restrict this type of method in a wider range of application. Therefore, some recent attempts try to exploit the user search intents to narrow down the search intention gap for query-dependent summarization. For instance, Wang et al. [16] propose a method for event driven web videos summarization by tag localization and key-shot mining to generate satisfactory results, where the searched images are used to estimate the similarities between them and the keyframe of each shot in key-shot identification. In [17], the authors propose a multi-task deep visual-semantic embedding modal, where the query-dependent video thumbnails are generated based on both visual and side information (e.g., title, description, and query). Besides, Yao et al. [18] apply supervised learning method on video summarization. They propose a novel pairwise deep ranking model to learn the relationship between highlight and non-highlight video segments. And a two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for video highlight detection. Then the video summarization is extracted by highlight scores which are obtained by training the highlight detection model. However, how to efficiently reflect the search intend in MVS is still an open and challenging problem.

Recently, we have seen a proliferation of Archetypal Analysis (AA) algorithm in different fields, such as in economics [19], pattern recognition [20], document summarization [21], and computer vision [22]. It represents each individual in a dataset as a mixture of individuals of pure type or archetypes [23]. Recently, it is also used in video summarization [24]. Specifically, the authors propose a novel Co-Archetypal Analysis (CAA) algorithm, which learns canonical visual concepts shared between video and web images by finding a joint-factorial representation of two datasets. The frame-level importance is measured based on the learned factorial representation of the video and then combined into shot-level scores, by which the summarization is generated with a fix length.

In this paper, we present an alternate way to utilize AA to MVS. We propose a query-dependent MVS method by using weighted AA algorithm. Different from the idea of CAA in [24], we explore the web images searched by the same query and the tags around the videos as the query-dependent information to guide the AA algorithm in video summarization to be query-dependent. Fig. 1 depicts the framework of this paper.

The main contributions of this paper lie in the following three aspects:

  • (1)

    A novel MVS method with Weighted Archetypal Analysis (WAA) is proposed, which is called Multi-Video Summarization via Multi-modal Weighted AA (MVS-MWAA).

  • (2)

    To ensure the summarization query-dependent, we design a multi-modal graph to guide the generation of the weight in WAA, which we call query-dependent weight. Specifically, the multi-modal graph algorithm exploits not only video data, but its tags and query-independent web images.

  • (3)

    To make the summarization logical and readable, a novel Ranking from Bottom to Top (RBT) method is developed.

The rest of this paper is organized as follows. Section 2 reviews the related work. The brief concept of WAA is introduced in Section 3. Section 4 describes the details of the proposed MVS-MWAA method. Section 5 introduces the proposed RBT method for MVS presentation. Experiments are then presented and analyzed in Section 6. Section 7 concludes the paper.

Section snippets

Related work

Recently, MVS has attracted more attention and much great progress has been made. Existing work can be roughly divided into three categories: graph based approaches, multi-modal fusion based approaches, and decomposition based approaches.

Graph based approaches. The graph modal is beneficial to explore the relationship among a large number of video frames. For example, Yeo et al. [25] explore complete multipartite graph to model semantic relationship between the extracted subsequences in

A brief review of weighted archetypal analysis

Archetypal analysis (AA) [36] represents each individual in a dataset as a mixture of individuals of pure, not necessarily observed, types or archetypes. The archetypes themselves are limited to being mixtures of the individuals in the dataset and lie on the dataset boundary. Generally, AA modal can be regarded as a technique fusing the ideas of clustering approaches and low-rank approximation, which completely assembles the advantages of clustering and the flexibility of matrix factorization.

The Proposed MVS-MWAA framework

The proposed MVS-MWAA method employs the candidate keyframes, web images and the tags around each video to extract meaningful segments from multiple videos. The framework of MVS-MWAA is illustrated in Fig. 1. It consists of four components, i.e., the multi-modal graph construction modal, query-dependent WAA modal, summarization generation modal and final summarization presentation modal with RBT algorithm. Algorithm 1 outlines the procedure of the proposed MVS-MWAA approach. The technical

MVS presentation

A good summarization is expected to have higher logicality and readability so that it is easy to be understood for users. In SVS, the generated keyframes are logically presented based on the video play order. However, the summarization obtained from multiple videos does not have this chronological order, so it is difficult to give users a satisfactory presentation. Therefore, we develop a Ranking from Bottom to Top (RBT) method to provide a user-friendly summarization representation based on

Experimental settings

Most of the existing MVS datasets are either publicly unavailable or in small scale. To the best of our knowledge, the recently introduced MVS1K dataset [33] is the largest publicly available annotated dataset. It has 936 videos from 10 queries, with 113,516 seconds duration. Table 1 shows its details and its illustration is represented in Fig. 2. We use the same settings in [33]. Particularly, the visual feature is a 4352D vector, composed by a 4096D VGGNet-19 CNN feature [38] and a 256D HSV

Conclusions

This paper proposes a query-dependent MVS-MWAA approach that meets the MVS criteria: representativeness, conciseness and informativeness. In this unsupervised framework, we jointly use the information of video frames, searched web images and tags to explore the relationship among the candidate keyframes with a multi-modal graph. Then, to generate a representative and concise summarization, we exploit query-dependent WAA to cluster all candidate keyframes into archetypes with distinct

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants (61472273, 61632018, and 61771329), the National Basic Research Program of China (Grant No. 2014CB340400), and the Nokia.

Zhong Ji received the Ph.D. degree in signal and information processing from the Tianjin University, Tianjin, China, in 2008.

He is currently an Associate Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include multimedia understanding, computer vision, and deep learning. He has published more than 50 scientific papers.

References (42)

  • ZhangK. et al.

    Video summarization with long short-term memory

    Proceedings of the European Conference on Computer Vision

    (2016)
  • SunH. et al.

    Glancenets - efficient convolutional neural networks with adaptive hard example mining

    Sci. China Inf. Sci.

    (2018)
  • PangY. et al.

    Cascade learning by optimally partitioning

    IEEE Trans. Cybern.

    (2016)
  • M. Gygli et al.

    Creating summaries from user videos

    Proceedings of the European Conference on Computer Vision

    (2014)
  • HeY. et al.

    Graph coloring based surveillance video synopsis

    Neurocomputing

    (2016)
  • ZhangW. et al.

    Web video thumbnail recommendation with content-aware analysis and query-sensitive matching

    Multimed. Tools Appl.

    (2014)
  • NieL. et al.

    Perceptual attributes optimization for multivideo summarization

    IEEE Trans. Cybern.

    (2016)
  • LiH. et al.

    Localizing relevant frames in web videos using topic model and relevance filtering

    Mach. Vis. Appl.

    (2014)
  • ZhangY. et al.

    Multi-video summary and skim generation of sensor-rich videos in geo-space

    Proceedings of the ACM Sigmm Conference on Multimedia Systems

    (2012)
  • LiY. et al.

    Multimedia maximal marginal relevance for multi-video summarization

    Multimed. Tools Appl.

    (2016)
  • WangM. et al.

    Event driven web video summarization by tag localization and key-shot identification

    IEEE Trans. Multimed.

    (2012)
  • Cited by (20)

    • A review on video summarization techniques

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Panda et al. (2017) discussed a sparse optimization framework to jointly summarize a set of videos by examining the complementarity within the videos, which considers “interestingness” prior in the sparse representative selection and proposed a diversity regularizer in the optimization framework. In Ji et al. (2017), Vasudevan et al. (2017) and Ji et al. (2019b), the authors developed a fusion-based method. In Zhang et al. (2019) a deep reinforcement learning-based model i.e.,summarization network (SummNet) is proposed for generating summaries associated with user queries.

    • A multi-flexible video summarization scheme using property-constraint decision tree

      2022, Neurocomputing
      Citation Excerpt :

      Representativeness is an indicator for measuring the degree of representation of the VS on the source video [12,13]. Content coverage is an indicator for measuring VS coverage degree to the source video in terms of content [14,15]. Redundancy [1,16], diversity [10,12], and uniqueness [17] are all measure of the difference between frames in VS. Redundancy is used to constrain the difference between any two frames in the set of keyframes.

    • Recurrent generative adversarial networks for unsupervised WCE video summarization[Formula presented]

      2021, Knowledge-Based Systems
      Citation Excerpt :

      That needs a new key frame extraction approach for this special medical video domain which we call it medical multimedia video. In addition to that, as we all know, video summarization is also a challenging problem in the traditional multimedia user video field and related technologies have gained increasing attention, leading to various methods proposed to help efficiently browse, manage and retrieve video contents and facilitate large-scale video distilling [49–55]. These are typically based on learning techniques to summarize video, including unsupervised and supervised methods.

    • A unified model for egocentric video summarization: an instance-based approach

      2021, Computers and Electrical Engineering
      Citation Excerpt :

      The summarization problem is viewed as computing dominant sets in a hypergraph model. Ji et al. [16] also performs multi-video summarization via weighted archetypal analysis to identify video summaries which are informative and representative. Multimodal graphs are employed for weight generation.

    • Deep attentive and semantic preserving video summarization

      2020, Neurocomputing
      Citation Excerpt :

      According to the number of videos to be summarized, the approaches fall into two categories: Single-Video Summarization (SVS) and Multi-Video Summarization (MVS). Particularly, MVS is also called query-dependent summarization, whose purpose is to condense a large number of searched videos into a concise summarization [6]21. By contrast, SVS is to condense one single video.

    View all citing articles on Scopus

    Zhong Ji received the Ph.D. degree in signal and information processing from the Tianjin University, Tianjin, China, in 2008.

    He is currently an Associate Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include multimedia understanding, computer vision, and deep learning. He has published more than 50 scientific papers.

    Yuanyuan Zhang is a Master student in the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her research interests include video summarization and computer vision.

    Yanwei Pang received the Ph.D. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, in 2004.

    He is currently a Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include object detection and recognition, and image processing. His current research interests include object detection and recognition, vision in bad weather, and image processing. He has published more than 100 scientific papers.

    Xuelong Li is a full professor with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, China.

    Jing Pan received her B.S degree in Mechanical Engineering from the North China Institute of Technology (now North University of China), Taiyuan, China, in 2002, and her M.S degree in Precision Instrument and Mechanism from the University of Science and Technology of China, Hefei, China, in 2007. She is currently an associate professor with the School of Electronic Engineering, Tianjin University of Technology and Education, Tianjin, China. Meanwhile, she is pursuing her Ph.D. degree in the Tianjin University, China. Her research interests include computer vision and pattern recognition.

    View full text