Scalable gastroscopic video summarization via similar-inhibition dictionary selection

https://doi.org/10.1016/j.artmed.2015.08.006Get rights and content

Highlights

  • We design a dictionary selection model via the similar-inhibition constraint.

  • We propose a scalable gastroscopic video summarization algorithm.

  • We build the first gastroscopic video summarization dataset with 30 videos.

Abstract

Objective

This paper aims at developing an automated gastroscopic video summarization algorithm to assist clinicians to more effectively go through the abnormal contents of the video.

Methods and materials

To select the most representative frames from the original video sequence, we formulate the problem of gastroscopic video summarization as a dictionary selection issue. Different from the traditional dictionary selection methods, which take into account only the number and reconstruction ability of selected key frames, our model introduces the similar-inhibition constraint to reinforce the diversity of selected key frames. We calculate the attention cost by merging both gaze and content change into a prior cue to help select the frames with more high-level semantic information. Moreover, we adopt an image quality evaluation process to eliminate the interference of the poor quality images and a segmentation process to reduce the computational complexity.

Results

For experiments, we build a new gastroscopic video dataset captured from 30 volunteers with more than 400k images and compare our method with the state-of-the-arts using the content consistency, index consistency and content-index consistency with the ground truth. Compared with all competitors, our method obtains the best results in 23 of 30 videos evaluated based on content consistency, 24 of 30 videos evaluated based on index consistency and all videos evaluated based on content-index consistency.

Conclusions

For gastroscopic video summarization, we propose an automated annotation method via similar-inhibition dictionary selection. Our model can achieve better performance compared with other state-of-the-art models and supplies more suitable key frames for diagnosis. The developed algorithm can be automatically adapted to various real applications, such as the training of young clinicians, computer-aided diagnosis or medical report generation.

Introduction

More and more people are suffering from stomach diseases, and the trend is rising [1]. As an effective technique to show the interior of a stomach directly, gastroscopy has been widely used for clinical examination, especially for the early detection of gastric cancer. Usually, the entire procedure lasts approximately 20 min, and a video containing approximately 15,000 frames is captured. However, the visual inspection of such a large number of frames is a challenging task, even for the most experienced clinicians. To more easily browse through such a video archive, a clinician records approximately 20–50 images manually during the examination for diagnosis and later generates a medical report. Nonetheless, the manual annotation may have the following shortcomings:

  • Because of the need to perform multiple tasks simultaneously, clinicians may miss some important information for the final diagnosis.

  • Due to the lack of enough experience, some junior clinicians cannot guarantee accuracy when analyzing the massive data continuously. Especially when the operation is not timely, clinicians may select poor quality images.

  • After the completion of the manual operation, the number of selected frames is fixed, which cannot meet the needs of different scenarios and may increase the time cost for re-analysis.

In fact, the above process is a typical video summarization procedure, i.e., selecting some frames with the most important and meaningful semantic content from a full-length video sequence [2], [3], [4], [5]. Therefore, in this paper, we intend to design a computer-aided gastroscopic video summarization algorithm to overcome these problems and assist clinicians to more effectively go through the abnormal contents of the video. The computer-aided system based on our algorithm can be adopted in real applications, such as the training of young clinicians, computer-aided diagnosis or medical report generation.

For video summarization [6], [7], [8], [9], [10], most state-of-the-art methods mainly focus on the summarization of structured videos, such as sports, cartoons or surveillance videos. In comparison, the automatic summarization of unstructured data, e.g., gastroscopic videos, is much more challenging. First, gastroscopic videos contain deformable and low-texture context, which makes it more difficult to extract semantic information. Second, due to the complexity of the inner human cavity and arbitrary movement of the camera, some gastroscopic images are of poor quality, which makes an accurate video summarization difficult. Finally, the objective of gastroscopic video summarization is for diagnosis, so the result of video summarization should highlight the suspected regions. Some previous models, e.g., the group sparsity dictionary selection model [2] in our previous work, cannot handle the above challenges very well. For gastroscopic videos, the result cannot encompass all video content, and some similar frames are also frequently selected as key frames. Therefore, we design a new similar-inhibition dictionary selection model by adopting the similar-inhibition constraint to select elements with more diversity between each other. Based on the similar-inhibition constraint, the video structure information will be taken into account to cover as much video content as possible in comparison with traditional sparse dictionary selection models. Furthermore, we also integrate an attention prior into the group sparsity term to reduce the gap between low-level features and high-level concepts. The main contributions of this paper reside in three aspects:

  • We design a new dictionary selection model by adopting the similar-inhibition constraint, which reinforces the diversity of the selected subset.

  • By taking into account the attention prior, we propose a scalable gastroscopic video summarization algorithm via similar-inhibition dictionary selection, which can select key frames with the most semantic information efficiently.

  • We collect and build a new gastroscopic video summarization dataset from 30 volunteers with approximately 432,000 frames, and we annotate the ground truth for evaluation as well. To the best of our knowledge, this dataset is the first gastroscopic video summarization dataset.

The rest of this paper is arranged as follows. Section 2 discusses the related works. In Section 3, we present the formulation of the problem. Section 4 describes the implementation of our video summarization. Section 5 presents various experiments and comparisons. Finally, Section 6 concludes the paper.

Section snippets

Related works

The problem of video summarization has attracted significant attention, especially over the past few years, and [9], [11] propose detailed reviews of existing techniques. To capture the content changes in a video sequence, most existing approaches first segment the whole video into shots using shot detection methods and then select key frames from each shot [12], [13]. The simplest method is to select the first/middle/last frame of each shot as key frames. Shahraray and Gibbon [14] propose a

Similar-inhibition dictionary selection model

The purpose of video summarization is to select the most representative frames from the underlying video source that represent the video contents properly. In this paper, we formulate the problem of gastroscopic video summarization as a dictionary selection issue, i.e., to select an optimal subset from the original video frames via dictionary learning under various constraints. The video sequence can be represented as an initial dictionary B=[b1,b2,...,bN]d×N (N is the number of frames and d

Implementation of our method

In this section, we provide the implementation of our gastroscopic video summarization and the overview of our framework is illustrated in Fig. 1. First, we evaluate the gastroscopic image quality via a supervised framework and detect non-informative frames from the gastroscopic video sequence. Second, the video is segmented into shots in an efficient way depending on the dramatic changes between consecutive frames. Finally, with the help of our new similar-inhibition dictionary selection

Experiments

In this section, we build a new gastroscopic video summarization dataset and validate our method by comparing it with the state-of-the-arts. When video summarization algorithms can be roughly divided into three categories [62], i.e., sequential algorithms, clustering-based algorithms and optimization-based algorithms, in our paper, we select several algorithms from each category for a fair comparison, such as evenly spaced key frames (ESKF) [63], the k-means-based method [20], k-medoids-based

Conclusions

To better navigate gastroscopic video content for diagnosis and future research, a new scheme of gastroscopic video summarization has been proposed in this paper. By representing each video frame as a feature vector, we convert the gastroscopic video summarization problem into a sparse dictionary selection problem under three terms, namely, reconstruction error, group sparsity and similar-inhibition. Moreover, we compute an attention score by merging two cues, i.e., gaze and content change, and

Acknowledgements

This work is supported by the National Science and Technology Support Program (2012BAI14B03), NSFC (61105013, 61375014, 61533015) and also the Foundation of Chinese Scholarship Council.

References (65)

  • M. Hafner et al.

    Computer-assisted pit-pattern classification in different wavelet domains for supporting dignity assessment of colonic polyps

    Pattern Recogn

    (2009)
  • C.S. Bell et al.

    Image partitioning and illumination in image-based pose detection for teleoperated flexible endoscopes

    Artif Intell Med

    (2013)
  • P. Szczypinski et al.

    Texture and color based image segmentation and pathology detection in capsule endoscopy videos

    Comput Method Program Biomed

    (2014)
  • D.K. Iakovidis et al.

    Reduction of capsule endoscopy reading times by unsupervised image mining

    Comput Med Imaging Gr

    (2010)
  • M. Komosinski et al.

    Evolutionary weighting of image features for diagnosing of CNS tumors

    Artif Intell Med

    (2000)
  • L. Nanni et al.

    Local binary patterns variants as texture descriptors for medical image analysis

    Artif Intell Med

    (2010)
  • N. Ejaz et al.

    Efficient visual attention based framework for extracting key frames from videos

    Signal Process Image Commun

    (2013)
  • Q. Xu et al.

    Browsing and exploration of video sequences: a new scheme for key frame extraction and 3d visualization using entropy based jensen divergence

    Inf Sci

    (2014)
  • A. Jemal et al.

    Global cancer statistics

    CA Cancer J Clin

    (2011)
  • Y. Cong et al.

    Towards scalable summarization of consumer videos via sparse dictionary selection

    IEEE Trans Multimed

    (2012)
  • M.M. Yeung et al.

    Video visualization for compact presentation and fast browsing of pictorial content

    IEEE Trans Circuit Syst Video Technol

    (1997)
  • A. Ekin et al.

    Automatic soccer video analysis and summarization

    IEEE Trans Image Process

    (2003)
  • M. Yufei et al.

    A generic framework of user attention model and its application in video summarization

    IEEE Trans Multimed

    (2005)
  • Z. Cernekova et al.

    Information theory-based shot cut/fade detection and video summarization

    IEEE Trans Circuit Syst Video Technol

    (2006)
  • B.T. Truong et al.

    Video abstraction: a systematic review and classification

    ACM Trans Multimed Comput Commun Appl (TOMCCAP)

    (2007)
  • F. Chen et al.

    Resource allocation for personalized video summarization

    IEEE Trans Multimed

    (2013)
  • A. Hanjalic

    Shot-boundary detection: unraveled and resolved?

    IEEE Trans Circuit Syst Video Technol

    (2002)
  • M. Wang et al.

    Event driven web video summarization by tag localization and key-shot identification

    IEEE Trans Multimed

    (2012)
  • B. Shahraray et al.

    Automatic generation of pictorial transcripts of video programs

  • C. Panagiotakis et al.

    Equivalent key frames selection based on iso-content principles

    IEEE Trans Circuit Syst Video Technol

    (2009)
  • D. Liu et al.

    Within and between shot information utilisation in video key frame extraction

    J Inf Knowl Manag

    (2011)
  • Y. Zhuang et al.

    Adaptive key frame extraction using unsupervised clustering

  • Cited by (31)

    • Recurrent generative adversarial networks for unsupervised WCE video summarization[Formula presented]

      2021, Knowledge-Based Systems
      Citation Excerpt :

      But these methods almost not adopt deep learning techniques and attention mechanism to eliminate large of redundant frames. Inspired by the recent success of attention mechanism in video summarization [49,65,66,91] and coverage mechanism [72,75] for solving the repetition problem of generated text summarization, as well as driven by some much-needed methods to redundancy reduction for WCE or other medical videos [37,104,105], a de-redundancy mechanism (DM) is proposed to reduce the vast redundancy produced in one examination to generate a compact and informative summary. This de-redundancy mechanism is in fact an auxiliary attention mechanism to keep track of the attention history and help attention model adjust future attention.

    • Automatic reduction of wireless capsule endoscopy reviewing time based on factorization analysis

      2020, Biomedical Signal Processing and Control
      Citation Excerpt :

      In another work [22], the same authors segmented and classified informative regions containing gastrointestinal polyps in WCE images. Wang et al. [23] developed an upper GI endoscopy video summarization algorithm as a sparse dictionary selection issue. The attention cost by merging both gaze and content change is computed and keyframes are selected.

    • Color-based template selection for detection of gastric abnormalities in video endoscopy

      2020, Biomedical Signal Processing and Control
      Citation Excerpt :

      Colors histogram features can be mixed with texture features for better representation of gastric lesions. Therefore, a method presented in [3] uses the color histogram and combined it with texture features. In this method, I component of the histogram from HSI color space, HV component histogram from HSV color space, the Norm of RGB histogram, RG histogram, opponent histogram, and hue histogram are used to represent the color features of endoscopy images.

    • Novel event analysis for human-machine collaborative underwater exploration

      2019, Pattern Recognition
      Citation Excerpt :

      Sooknanan et al. [45] enhance and summarize the Nephrops Habitats depending on underwater videos. Actually, visual summarization is a hot topic in multimedia domain, which intends to extract key frames or video skims from long video sequence to achieve knowledge condensing and knowledge search, e.g., egocentric video summarization [46], video summarization based on story driven[47], video summarization depending on large-scale web image priors [48], video summarization from consumer video [49], video summarization via group sparsity [50,51,52], multiview video summarization [53] and deep learning based video summarization [54]. In comparison with most state-of-the-arts only concerning on single task, e.g., marine fish tracking, our semi-automatic novel deep sea event analysis framework contains much more diverse functions including novel deep sea event detection, tracking and summarization, which intends to reduce the intensity of onboard crews work and improve the work accuracy and efficiency accordingly.

    • Keyframe extraction from laparoscopic videos based on visual saliency detection

      2018, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Other popular methods include keyframe selection based on frame clustering [14], and trajectory curve representation of the frame sequence [15]. In the domain of medical endoscopy, various keyframe extractions methods have been proposed across a number of application fields, such as in diagnostic hysteroscopy [16–18], gastrointestinal endoscopy [19], video capsule endoscopy [20–22] and endomicroscopy [23]. Although these approaches may potentially be applied in the field of MIS, surgical videos present significant differences compared to diagnostic videos, since the operator heavily interacts with the displayed anatomic organs (cutting, coagulation, clipping, etc.).

    View all citing articles on Scopus
    View full text