Elsevier

Image and Vision Computing

Volume 42, October 2015, Pages 13-21
Image and Vision Computing

Complex event recognition using constrained low-rank representation

https://doi.org/10.1016/j.imavis.2015.06.007Get rights and content

Highlights

  • A novel low-rank model for complex event representation

  • Semantic cues are induced in our model by constraining it to follow human annotation.

  • We demonstrate extensive experiments on TRECVID MED 11 and MED 12.

  • We compare our method to seven recent methods and achieve state of the art.

Abstract

Complex event recognition is the problem of recognizing events in long and unconstrained videos. In this extremely challenging task, concepts have recently shown a promising direction where core low-level events (referred to as concepts) are annotated and modeled using a portion of the training data, then each complex event is described using concept scores, which are features representing the occurrence confidence for the concepts in the event. However, because of the complex nature of the videos, both the concept models and the corresponding concept scores are significantly noisy. In order to address this problem, we propose a novel low-rank formulation, which combines the precisely annotated videos used to train the concepts, with the rich concept scores. Our approach finds a new representation for each event, which is not only low-rank, but also constrained to adhere to the concept annotation, thus suppressing the noise, and maintaining a consistent occurrence of the concepts in each event. Extensive experiments on large scale real world dataset TRECVID Multimedia Event Detection 2011 and 2012 demonstrate that our approach consistently improves the discriminativity of the concept scores by a significant margin.

Introduction

The increasing popularity of digital cameras has been creating a tremendous growth in social media websites like YouTube. Along with the increased number of user-uploaded videos, the need to automatically detect and recognize the type of activities occurring in these videos has become crucial. However, in such unconstrained videos, automatic content understanding is a very challenging task due to the large intra-class variation, dynamic and heterogeneous background, and different capturing conditions. Therefore, this problem has recently gained significant attention.

Most activity recognition methods are developed for constrained and short videos (5–10 s) as in [3], [4], [5], [6]. These videos contain simple and well-defined human actions such as waving, running, and jumping. In contrast, in this paper, we consider more practical videos with realistic events, complicated contents, and significantly variable lengths. Refer to Fig. 2. The standard activity recognition methods do not incorporate evidences for detecting a particular action/event when deal with such unconstrained videos. To this end, the most recent approaches resorted to using low-level events called “concepts” as an intermediate representation [2], [1]. In that, a complex event is described using concept scores, which is the occurrence confidence for the concepts in the video. For example, the event Birthday Party can be described as the occurrence of singing, laughing, blowing candles, jumping … etc.

In the context of concept-based event representation, substantial consequences arise as a result of the complex nature of these unconstrained videos. First: The examples used to train each concept have significant variations, and thus the resulting concept models are noisy. Second: The concept-content of each event may still vary among the samples of each event, mainly because of the variable temporal lengths and capturing conditions. Therefore, the obtained concept scores used to describe each event are also significantly noisy. Third: The automatically obtained concept representation strictly relies on local visual features, and lacks context and semantic cues, which humans naturally infer (refer to Fig. 1). In this paper, we address these consequences using a novel low-rank formulation, which combines the precisely annotated videos used to train the concepts, with the rich concept scores. Our approach is based on two principles: First, the videos of the same event should share similar concepts, and thus should have consistent responses to the concept detectors. Therefore, the matrix containing the concept scores of each event must be of low-rank. Second, since the videos used to train the concept models were manually and carefully annotated, the resulting low-rank matrix should also follow the annotation. For example, concepts like person falling or person flipping may falsely fire in the Birthday Party event, and concepts like person opening present or person dancing may not fire in some videos of events like Birthday Party where these concepts actually occurred. Therefore, by enforcing our constraints, such hurdles can be avoided.

Fig. 3 summarizes the steps involved in our method. We split the training data into two sets: (1) the event-level annotated data, which has only event labels and (2) the concept-level annotated data, which has both event-level and concept-level labels. We use the concept-level annotated data to train concept detectors, which we run on the event-level annotated data and obtain their concept scores. Consequently, we stack the concept scores for each event in a matrix and find their low-rank representation such that it also follows the basis of the concept annotation. The resulting training data combines the two sets in one rich and consistent training set.

The low-rank constraint has been vigorously employed in different computer vision problems such as tracking [7], feature fusion [8], face recognition [9], and saliency detection [10]. However, to the best of our knowledge, low-rank estimation of concept scores has never been used before. More importantly, our formulation is more general than the standard RPCA [11] in that we allow the estimated low-rank matrix to follow a prior pattern (the annotation in our scenario). On the other hand, since we exploit the low-rank constraint, our method is more robust against noisy concepts and cluttered background than [2], [1], [12], and significantly outperforms the state-of-the-art, as we demonstrate in the experiments.

The main contribution of this paper is a novel low-rank formulation, through which we find a new representation for each event, which is not only low-rank, but also constrained by the concept annotation, thus suppressing the noise, and maintaining a consistent occurrence of the concepts in each event. Our constrained low-rank representation is not restricted to a certain type of features; which allows us to employ a combination of state-of-the-art features, including STIP [6], DTF-MBH [3], and DTF-HOG [3].

The rest of the paper is organized as follows: Section 2 reviews the related work. Section 3 describes the process of computing the constrained low-rank event representation. In Section 4, we describe how to find the optimal solution for the low-rank optimization. The experimental results are presented in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Related work

Compared to the traditional action recognition, complex event recognition is more challenging, mainly because the complex events have significantly longer lengths and diverse contents. Early methods for complex event recognition used low-level features such as SIFT [13], MBH [3], and MFCC [6], and showed promising results as in [14], [15], [16]. Additionally, pooling of these low-level features was proposed in [17], where features such as SIFT and color were fused in order to improve the

Low-rank complex events

Given training event samples X=xk with event labels Y=yk, we manually annotate a portion of the training data with 93 predefined low-level events, which occur frequently. These low-level events are called concepts, and they are similar to the concepts used in [1]. Compared to [1] which used 62 action concepts, we selected more concepts in order to cover the events in MED 12 as well. This generates two training subsets: M, which has only event-level annotation (the labels), and Z, which has both

Optimizing the constrained low-rank problem

Our method decomposes the matrix containing the examples of an event by extracting the noise such that the resulting matrix is both low-rank and follows the concept-annotation. This is achieved using Eqs. (1), (2) as discussed in the previous section. When solving Eq. (1), it is convenient to consider the Lagrange form of the problem:minAi,EiRankAi+λEi0+τ2AiMiUiUiTF2s.t.Mi=Ai+Ei,where λ and τ are the weighting parameters. The optimization of Eq. (3) is not directly tractable since the matrix

Experiments

We extensively experimented on our method using a subset of the most challenging multimedia event datasets, TRECVID MED 2011 and 2012, which exhibit a wide range of challenges including camera motion, cluttered background and illumination changes. Inarguably, one of the biggest challenges in these datasets is the significantly varying video length, which ranges from 30 s to over 30 min. The frame rate also ranges from 12 to 30 fps, and the resolution ranges from 320 × 480 to 1280 × 2000. We report our

Conclusion

We presented a novel, simple, and easily implementable method for complex event recognition. We first divide the training data into two sets, one where we annotate the concepts manually, and another where we detect the concepts automatically with models trained using the first set. Consequently, we exploit the inherent low-rank structure in the examples of an event, and combine the two training sets in one set which is not only low-rank but also encouraged to follow the annotation. Thus,

Acknowledgment

This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center contract number D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements,

References (38)

  • H. Izadinia et al.

    Recognizing complex events using large margin joint low-level event model

    ECCV

    (2012)
  • Y. Yang et al.

    Complex events detection using data-driven concepts

    ECCV

    (2012)
  • H. Wang et al.

    Action recognition by dense trajectories

  • X. Liu et al.

    Automatic concept detector refinement for large-scale video semantic annotation

  • P. Dollar et al.

    Behavior recognition via sparse spatio-temporal features

  • I. Laptev et al.

    Space–time interest points

  • B.G. Tianzhu Zhang et al.

    Low-rank sparse learning for robust visual tracking

    ECCV

    (2012)
  • G. Ye et al.

    Robust late fusion with multi-task low rank minimization

    CVPR

    (2012)
  • C. Chen et al.

    Low-rank matrix recovery with structural incoherence for robust face recognition

    CVPR

    (2011)
  • X. Shen et al.

    A Unified approach to salient object detection via low rank matrix recovery

    CVPR

    (2012)
  • E.J. Candes et al.

    Robust principal component analysis?

  • A. Tamrakar et al.

    Evaluation of low-level features and their combinations for complex event detection in open source videos

  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

  • C. Schuldt et al.

    Recognizing human actions: a local SVM approach

  • J. Liu et al.

    Recognizing realistic actions from videos in the wild

  • H. Wang et al.

    Evaluation of local spatio-temporal features for action recognition

  • C. Liangliang et al.

    Scene aligned pooling for complex video recognition

  • A. Loui et al.

    Kodak's consumer video benchmark data set: concept definition and annotation

  • T. Althoff et al.

    Detection bank: an object detection based video representation for multimedia event recognition

    ACM MM

    (2012)
  • This paper has been recommended for acceptance by Ivan Laptev.

    View full text