Scalable gastroscopic video summarization via similar-inhibition dictionary selection
Introduction
More and more people are suffering from stomach diseases, and the trend is rising [1]. As an effective technique to show the interior of a stomach directly, gastroscopy has been widely used for clinical examination, especially for the early detection of gastric cancer. Usually, the entire procedure lasts approximately 20 min, and a video containing approximately 15,000 frames is captured. However, the visual inspection of such a large number of frames is a challenging task, even for the most experienced clinicians. To more easily browse through such a video archive, a clinician records approximately 20–50 images manually during the examination for diagnosis and later generates a medical report. Nonetheless, the manual annotation may have the following shortcomings:
- •
Because of the need to perform multiple tasks simultaneously, clinicians may miss some important information for the final diagnosis.
- •
Due to the lack of enough experience, some junior clinicians cannot guarantee accuracy when analyzing the massive data continuously. Especially when the operation is not timely, clinicians may select poor quality images.
- •
After the completion of the manual operation, the number of selected frames is fixed, which cannot meet the needs of different scenarios and may increase the time cost for re-analysis.
In fact, the above process is a typical video summarization procedure, i.e., selecting some frames with the most important and meaningful semantic content from a full-length video sequence [2], [3], [4], [5]. Therefore, in this paper, we intend to design a computer-aided gastroscopic video summarization algorithm to overcome these problems and assist clinicians to more effectively go through the abnormal contents of the video. The computer-aided system based on our algorithm can be adopted in real applications, such as the training of young clinicians, computer-aided diagnosis or medical report generation.
For video summarization [6], [7], [8], [9], [10], most state-of-the-art methods mainly focus on the summarization of structured videos, such as sports, cartoons or surveillance videos. In comparison, the automatic summarization of unstructured data, e.g., gastroscopic videos, is much more challenging. First, gastroscopic videos contain deformable and low-texture context, which makes it more difficult to extract semantic information. Second, due to the complexity of the inner human cavity and arbitrary movement of the camera, some gastroscopic images are of poor quality, which makes an accurate video summarization difficult. Finally, the objective of gastroscopic video summarization is for diagnosis, so the result of video summarization should highlight the suspected regions. Some previous models, e.g., the group sparsity dictionary selection model [2] in our previous work, cannot handle the above challenges very well. For gastroscopic videos, the result cannot encompass all video content, and some similar frames are also frequently selected as key frames. Therefore, we design a new similar-inhibition dictionary selection model by adopting the similar-inhibition constraint to select elements with more diversity between each other. Based on the similar-inhibition constraint, the video structure information will be taken into account to cover as much video content as possible in comparison with traditional sparse dictionary selection models. Furthermore, we also integrate an attention prior into the group sparsity term to reduce the gap between low-level features and high-level concepts. The main contributions of this paper reside in three aspects:
- •
We design a new dictionary selection model by adopting the similar-inhibition constraint, which reinforces the diversity of the selected subset.
- •
By taking into account the attention prior, we propose a scalable gastroscopic video summarization algorithm via similar-inhibition dictionary selection, which can select key frames with the most semantic information efficiently.
- •
We collect and build a new gastroscopic video summarization dataset from 30 volunteers with approximately 432,000 frames, and we annotate the ground truth for evaluation as well. To the best of our knowledge, this dataset is the first gastroscopic video summarization dataset.
The rest of this paper is arranged as follows. Section 2 discusses the related works. In Section 3, we present the formulation of the problem. Section 4 describes the implementation of our video summarization. Section 5 presents various experiments and comparisons. Finally, Section 6 concludes the paper.
Section snippets
Related works
The problem of video summarization has attracted significant attention, especially over the past few years, and [9], [11] propose detailed reviews of existing techniques. To capture the content changes in a video sequence, most existing approaches first segment the whole video into shots using shot detection methods and then select key frames from each shot [12], [13]. The simplest method is to select the first/middle/last frame of each shot as key frames. Shahraray and Gibbon [14] propose a
Similar-inhibition dictionary selection model
The purpose of video summarization is to select the most representative frames from the underlying video source that represent the video contents properly. In this paper, we formulate the problem of gastroscopic video summarization as a dictionary selection issue, i.e., to select an optimal subset from the original video frames via dictionary learning under various constraints. The video sequence can be represented as an initial dictionary (N is the number of frames and d
Implementation of our method
In this section, we provide the implementation of our gastroscopic video summarization and the overview of our framework is illustrated in Fig. 1. First, we evaluate the gastroscopic image quality via a supervised framework and detect non-informative frames from the gastroscopic video sequence. Second, the video is segmented into shots in an efficient way depending on the dramatic changes between consecutive frames. Finally, with the help of our new similar-inhibition dictionary selection
Experiments
In this section, we build a new gastroscopic video summarization dataset and validate our method by comparing it with the state-of-the-arts. When video summarization algorithms can be roughly divided into three categories [62], i.e., sequential algorithms, clustering-based algorithms and optimization-based algorithms, in our paper, we select several algorithms from each category for a fair comparison, such as evenly spaced key frames (ESKF) [63], the k-means-based method [20], k-medoids-based
Conclusions
To better navigate gastroscopic video content for diagnosis and future research, a new scheme of gastroscopic video summarization has been proposed in this paper. By representing each video frame as a feature vector, we convert the gastroscopic video summarization problem into a sparse dictionary selection problem under three terms, namely, reconstruction error, group sparsity and similar-inhibition. Moreover, we compute an attention score by merging two cues, i.e., gaze and content change, and
Acknowledgements
This work is supported by the National Science and Technology Support Program (2012BAI14B03), NSFC (61105013, 61375014, 61533015) and also the Foundation of Chinese Scholarship Council.
References (65)
- et al.
Wevos-visom: an ensemble summarization algorithm for enhanced data visualization
Neurocomputing
(2012) - et al.
Topic aspect-oriented summarization via group selection
Neurocomputing
(2015) - et al.
Video summarisation: a conceptual framework and survey of the state of the art
J Vis Commun Image Represent
(2008) - et al.
Iterative key frame selection in the rate-constraint environment
Signal Process Image Commun
(2003) - et al.
Video abstraction based on the visual attention model and online clustering
Signal Process Image Commun
(2013) - et al.
Image collection summarization via dictionary learning for sparse representation
Pattern Recogn
(2013) - et al.
A decision support system to facilitate management of patients with acute gastrointestinal bleeding
Artif Intell Med
(2008) - et al.
Computer-aided small bowel tumor detection for capsule endoscopy
Artif Intell Med
(2011) - et al.
Scattering features for lung cancer detection in fibered confocal fluorescence microscopy images
Artif Intell Med
(2014) - et al.
Graph based construction of textured large field of view mosaics for bladder cancer diagnosis
Pattern Recogn
(2012)
Computer-assisted pit-pattern classification in different wavelet domains for supporting dignity assessment of colonic polyps
Pattern Recogn
Image partitioning and illumination in image-based pose detection for teleoperated flexible endoscopes
Artif Intell Med
Texture and color based image segmentation and pathology detection in capsule endoscopy videos
Comput Method Program Biomed
Reduction of capsule endoscopy reading times by unsupervised image mining
Comput Med Imaging Gr
Evolutionary weighting of image features for diagnosing of CNS tumors
Artif Intell Med
Local binary patterns variants as texture descriptors for medical image analysis
Artif Intell Med
Efficient visual attention based framework for extracting key frames from videos
Signal Process Image Commun
Browsing and exploration of video sequences: a new scheme for key frame extraction and 3d visualization using entropy based jensen divergence
Inf Sci
Global cancer statistics
CA Cancer J Clin
Towards scalable summarization of consumer videos via sparse dictionary selection
IEEE Trans Multimed
Video visualization for compact presentation and fast browsing of pictorial content
IEEE Trans Circuit Syst Video Technol
Automatic soccer video analysis and summarization
IEEE Trans Image Process
A generic framework of user attention model and its application in video summarization
IEEE Trans Multimed
Information theory-based shot cut/fade detection and video summarization
IEEE Trans Circuit Syst Video Technol
Video abstraction: a systematic review and classification
ACM Trans Multimed Comput Commun Appl (TOMCCAP)
Resource allocation for personalized video summarization
IEEE Trans Multimed
Shot-boundary detection: unraveled and resolved?
IEEE Trans Circuit Syst Video Technol
Event driven web video summarization by tag localization and key-shot identification
IEEE Trans Multimed
Automatic generation of pictorial transcripts of video programs
Equivalent key frames selection based on iso-content principles
IEEE Trans Circuit Syst Video Technol
Within and between shot information utilisation in video key frame extraction
J Inf Knowl Manag
Adaptive key frame extraction using unsupervised clustering
Cited by (31)
Attention-guided multi-granularity fusion model for video summarization
2024, Expert Systems with ApplicationsRecurrent generative adversarial networks for unsupervised WCE video summarization[Formula presented]
2021, Knowledge-Based SystemsCitation Excerpt :But these methods almost not adopt deep learning techniques and attention mechanism to eliminate large of redundant frames. Inspired by the recent success of attention mechanism in video summarization [49,65,66,91] and coverage mechanism [72,75] for solving the repetition problem of generated text summarization, as well as driven by some much-needed methods to redundancy reduction for WCE or other medical videos [37,104,105], a de-redundancy mechanism (DM) is proposed to reduce the vast redundancy produced in one examination to generate a compact and informative summary. This de-redundancy mechanism is in fact an auxiliary attention mechanism to keep track of the attention history and help attention model adjust future attention.
Automatic reduction of wireless capsule endoscopy reviewing time based on factorization analysis
2020, Biomedical Signal Processing and ControlCitation Excerpt :In another work [22], the same authors segmented and classified informative regions containing gastrointestinal polyps in WCE images. Wang et al. [23] developed an upper GI endoscopy video summarization algorithm as a sparse dictionary selection issue. The attention cost by merging both gaze and content change is computed and keyframes are selected.
Color-based template selection for detection of gastric abnormalities in video endoscopy
2020, Biomedical Signal Processing and ControlCitation Excerpt :Colors histogram features can be mixed with texture features for better representation of gastric lesions. Therefore, a method presented in [3] uses the color histogram and combined it with texture features. In this method, I component of the histogram from HSI color space, HV component histogram from HSV color space, the Norm of RGB histogram, RG histogram, opponent histogram, and hue histogram are used to represent the color features of endoscopy images.
Novel event analysis for human-machine collaborative underwater exploration
2019, Pattern RecognitionCitation Excerpt :Sooknanan et al. [45] enhance and summarize the Nephrops Habitats depending on underwater videos. Actually, visual summarization is a hot topic in multimedia domain, which intends to extract key frames or video skims from long video sequence to achieve knowledge condensing and knowledge search, e.g., egocentric video summarization [46], video summarization based on story driven[47], video summarization depending on large-scale web image priors [48], video summarization from consumer video [49], video summarization via group sparsity [50,51,52], multiview video summarization [53] and deep learning based video summarization [54]. In comparison with most state-of-the-arts only concerning on single task, e.g., marine fish tracking, our semi-automatic novel deep sea event analysis framework contains much more diverse functions including novel deep sea event detection, tracking and summarization, which intends to reduce the intensity of onboard crews work and improve the work accuracy and efficiency accordingly.
Keyframe extraction from laparoscopic videos based on visual saliency detection
2018, Computer Methods and Programs in BiomedicineCitation Excerpt :Other popular methods include keyframe selection based on frame clustering [14], and trajectory curve representation of the frame sequence [15]. In the domain of medical endoscopy, various keyframe extractions methods have been proposed across a number of application fields, such as in diagnostic hysteroscopy [16–18], gastrointestinal endoscopy [19], video capsule endoscopy [20–22] and endomicroscopy [23]. Although these approaches may potentially be applied in the field of MIS, surgical videos present significant differences compared to diagnostic videos, since the operator heavily interacts with the displayed anatomic organs (cutting, coagulation, clipping, etc.).