Event patches: Mining effective parts for event detection and understanding

doi:10.1016/j.sigpro.2018.03.005

Signal Processing

Volume 149, August 2018, Pages 82-87

https://doi.org/10.1016/j.sigpro.2018.03.005 Get rights and content

Highlights

•
We proposed the idea that only parts of the video are effective for event detection and strive to mine the definite Event Patches that contribute to defining an event.
•
We proposed a concatenated video segments representation (CVSeg) method.
•
We proposed a Sparse Coding based Video Patches Mining (SVPM) method, which successfully obtain event patches that related to the key visual concepts of each event.
•
The performance of event detection achieves an impressive improvement by using the mined Event Patches.

Abstract

Event detection, which targets the detection of complex events among numerous videos, has attracted growing interest recently. Previous approaches suffered from huge computation costs in multiple feature extraction and classification process. Lately, a discriminative CNN video representation method for event detection is proposed to obtain promising performances. However, this method samples the video frames uniformly for global video representations, without considering that some video parts might be redundant or noisy for the task. Though a multirate sampling solution is proposed later in consideration of the video content motion speed variation, it remains uncertain and unclear that which video part is more important to define the event. In this paper, we propose to mine effective parts for event detection and understanding. After video segmentation, we try to mine the definite parts (Event Patches) that mostly contribute to defining an event. We evaluate our event patches on the TRECVID MED 2011 dataset. Compared with CNN video representation method, which has been recognized as the best video representation for event detection, our method improves the Mean Average Precision (mAP) from 74.7% to 76.2%.

Introduction

Event detection targets the detection of complex events such as “Birthday party” among numerous long video sequences, has attracted growing interest recently from both academia and industry [1], [2], [3], [4], [5]. It is still a quite challenging video analysis task due to the tremendous intra-class variations of events. Landing a fishing, for example, can be done under different scenes with different fishing tools.

Recently, deep neural networks (DNNs), especially convolutional neural networks (CNNs) [6], [7], have demonstrated their remarkable power in learning feature representations, which leads to the record-breaking improvements on almost all the computer vision tasks, e.g. image classification [7], [8], [9], [10], object detection [11], [12], [13], [14], saliency detection [15], [16], [17], [18], visual tracking [19], [20], semantic segmentation [21], [22], [23], and super-resolution [24], [25]. Among these tasks, some treat CNN as a black-box feature extractor and use these deep features for their goals, while some others conduct in-depth studies on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet. Karpathy et al. [26] detected objects in every image with Region Convolutional Neural Networks (RCNN), and utilized a 4096-dimensional activation of the fully connected layer for images representation. Their visual-semantic alignment model produces state-of-the-art results. In [19], properties of CNN features are studied in depth, which helps to the design of effective CNN-based visual trackers.

However, previous approaches [27], [28], [29], [30], [31], [32] in video and event analysis suffered from huge computation costs in multiple feature extraction and classification process. Instead, Xu et al. [33] proposed a discriminative CNN video representation for event detection. They represented video with a global VLAD [34], [35] encoding of CNN descriptors and used a set of latent concept descriptors as the frame descriptors. Their method results in a new state-of-the-art performance in event detection. Though great successes have been achieved, Xu’s method samples all the video frames with the same rate for global video representations and events detection, without considering motions in the video might vary a lot in speed and semantic importance. Researches in human activity prediction works [36] demonstrate our view, which is an active research topic that predicts human activities from only partially observed videos. As videos for event detection differ in many aspects, such as video length, size, situation, viewpoints, the information from video parts is patchy [37], indicating that only parts of them are effective and others might be redundant or noisy for detection. Later, a multirate sampling solution [38] is proposed to consider the video content motion speed variation. However, it remains uncertain and unclear that whether a definite video part is important or redundant, is it contributes to defining the event or not. Researchers in [39] proposed a temporal segment network for action recognition, which divides the videos into three temporal segments, pools snippets from them, and models the long-range temporal structure of action videos. Their work inspires us to exploit videos segments semantic and temporal information for a better event understanding.

In this work, we strive to mine the effective parts that mostly contribute to defining events and utilize them for event detection. We name these parts the Event Patches. Our approach aims to mine appropriate video representations and use them for a better evaluation of event detection. Concretely, the contributions of this work are threefold:

•
We proposed the idea that only parts of the video are effective for event detection and strive to mine the definite Event Patches in videos for event detection. Unlike previous methods on event detections that samples the video frames uniformly for global video representations without considering that some video parts might be redundant or noisy, we consider the event patches mined by our method should play more important roles for the task.
•
We proposed to utilize deep features and VLAD encoding for concatenated video segments representation, then a Sparse Coding based Video Patches Mining (SVPM) method was proposed for event patches mining and event detection. For each event, its event patches often relate to the same or several visual concepts. Despite the high intra-class variation of event videos, our method can successfully obtain the most reasonable “epitomes” of events.
•
We validated the mined Event Patches on event detection tasks. Compared with the previous CNN video representation method, our approach achieves an impressive improvement.

The remainder of this paper is organized as follows. Section 2 reviews the status of related work. Then we introduce the proposed video segments representation methods and our event patches mining method based sparse coding in Section 3. Section 4 presents the datasets and experimental results. Finally, Section 5 concludes the paper.

Section snippets

Related work

Event detection. Based on the difficulty, event detection can be categorized into simple event detection and multimedia event detection (MED).

Simple event detection includes the detection of news events [40], sports events [41], unusual surveillance events or those with repetitive patterns [42], [43]. Compared to the MED videos, these events are usually well-defined and describable by short video sequences.

Multimedia event detection was first introduced in the TRECVID competition by NIST for

Method

We follow the work of Xu et al. [33] using CNN representations of videos for event detection. Their work has demonstrated to be the best single feature for event detection. In their work, the event detection performance achieves best when the PCA dimension is 256 and VLAD centers K is set to 256. In order to make comparisons with the method, we follow their work to sample every five frames in videos and use the same assignments for PCA and K. Also, as their work used global representations for

Experiments

To evaluate the effectiveness of the proposed method, we conduct experiments on TRECVID MED 2011 dataset¹.

Conclusions

In this paper, we tried to mine effective parts mostly contribute to defining events and utilized them for event detection. To accomplish this work, the idea of Event Patches that only parts of the video are effective for event detection is introduced. We firstly divided videos into segments and proposed a concatenated video segments representation (CVSeg) based on deep features and VLAD encoding. Event detection performances of our CVSeg method and previous CNN video representation method are

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 61472103, 61772158, 61701273 and 61702136) and the Project Funded by China Postdoctoral Science Foundation (No. 2017M610897).

References (52)

Y. Gao et al.
Multimedia social event detection in microblog
MultiMedia Modeling
(2015)
O. Russakovsky et al.
Imagenet large scale visual recognition challenge
Int. J. Comput. Vis.
(2015)
S. Oh et al.
A large-scale benchmark dataset for event recognition in surveillance video
IEEE Conference on Computer Vision and Pattern Recognition
(2011)
P. Turaga et al.
Machine recognition of human activities: a survey
IEEE Trans. Circuits Syst. Video Technol.
(2008)
S. Zhao et al.
Real-time multimedia social event detection in microblog
IEEE Trans. Cybern.
(2017)
Y. LeCun et al.
Gradient-based learning applied to document recognition
Proc. IEEE
(1998)
K. He et al.
Deep residual learning for image recognition
IEEE Conference on Computer Vision and Pattern Recognition
(2016)
A. Krizhevsky et al.
Imagenet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems
(2012)
K. Simonyan et al.
Very deep convolutional networks for large-scale image recognition
CoRR
(2014)
S. Zhao et al.
Continuous probability distribution prediction of image emotions via multitask shared sparse regression
IEEE Trans. Multimed.
(2017)

A. Karpathy et al.

Deep visual-semantic alignments for generating image descriptions

IEEE Conference on Computer Vision and Pattern Recognition

(2015)

Cited by (0)

View full text

Event patches: Mining effective parts for event detection and understanding

Highlights

Abstract

Introduction

Section snippets

Related work

Method

Experiments

Conclusions

Acknowledgments

Imagenet large scale visual recognition challenge

Int. J. Comput. Vis.

A large-scale benchmark dataset for event recognition in surveillance video

IEEE Conference on Computer Vision and Pattern Recognition

Machine recognition of human activities: a survey

IEEE Trans. Circuits Syst. Video Technol.

Real-time multimedia social event detection in microblog

IEEE Trans. Cybern.

Gradient-based learning applied to document recognition

Proc. IEEE

Deep residual learning for image recognition

IEEE Conference on Computer Vision and Pattern Recognition

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems

Very deep convolutional networks for large-scale image recognition

CoRR

Continuous probability distribution prediction of image emotions via multitask shared sparse regression

IEEE Trans. Multimed.

Rich feature hierarchies for accurate object detection and semantic segmentation

IEEE Conference on Computer vision and Pattern Recognition

Joint deep learning for pedestrian detection

IEEE International Conference on Computer Vision

Deepid-net: Deformable deep convolutional neural networks for object detection

IEEE Conference on Computer Vision and Pattern Recognition

Deep contrast learning for salient object detection

IEEE Conference on Computer Vision and Pattern Recognition

Deep networks for saliency detection via local estimation and global search

IEEE Conference on Computer Vision and Pattern Recognition

Saliency detection by multi-context deep learning

IEEE Conference on Computer Vision and Pattern Recognition

Recurrent attentional networks for saliency detection

IEEE Conference on Computer Vision and Pattern Recognition

A video saliency detection model in compressed domain

IEEE Trans. Circuits Syst. Video Technol.

Visual tracking with fully convolutional networks

IEEE International Conference on Computer Vision

Learning multi-domain convolutional neural networks for visual tracking

IEEE Conference on Computer Vision and Pattern Recognition

Fully convolutional networks for semantic segmentation

IEEE Conference on Computer Vision and Pattern Recognition

Approximating discrete probability distribution of image emotions by multi-modal features fusion

International Joint Conference on Artificial Intelligence

Joint content replication and request routing for social video distribution over cloud cdn: a community clustering method

IEEE Trans. Circuits Syst. Video Technol.

Image super-resolution using deep convolutional networks

IEEE Trans. Pattern Anal. Mach. Intell.

Accurate image super-resolution using very deep convolutional networks

IEEE Conference on Computer Vision and Pattern Recognition

Deep visual-semantic alignments for generating image descriptions

IEEE Conference on Computer Vision and Pattern Recognition