Large scale continuous visual event recognition using max-margin Hough transformation framework

https://doi.org/10.1016/j.cviu.2012.11.008Get rights and content

Abstract

In this paper we propose a novel method for continuous visual event recognition (CVER) on a large scale video dataset using max-margin Hough transformation framework. Due to high scalability, diverse real environmental state and wide scene variability direct application of action recognition/detection methods such as spatio-temporal interest point (STIP)-local feature based technique, on the whole dataset is practically infeasible. To address this problem, we apply a motion region extraction technique which is based on motion segmentation and region clustering to identify possible candidate “event of interest” as a preprocessing step. On these candidate regions a STIP detector is applied and local motion features are computed. For activity representation we use generalized Hough transform framework where each feature point casts a weighted vote for possible activity class centre. A max-margin frame work is applied to learn the feature codebook weight. For activity detection, peaks in the Hough voting space are taken into account and initial event hypothesis is generated using the spatio-temporal information of the participating STIPs. For event recognition a verification Support Vector Machine is used. An extensive evaluation on benchmark large scale video surveillance dataset (VIRAT) and as well on a small scale benchmark dataset (MSR) shows that the proposed method is applicable on a wide range of continuous visual event recognition applications having extremely challenging conditions.

Highlights

► In this paper we address activity detection in large scale video dataset. ► A novel region extraction method is applied to reduce initial action search space. ► Max-margin Hough transformation framework is used for activity detection. ► A Verification SVM is applied to obtain the final score of detected event hypothesis. ► State-of-the-art result is reported on both large and small scale benchmark datasets.

Introduction

Visual event recognition i.e. recognition of semantic spatio-temporal visual patterns such as “waving”, “boxing”, “getting into vehicle” and “running” etc. is a fundamental Computer Vision problem. An enormous amount of work on this topic can be found in literature survey [1], [2], [3], [4]. Recently, research in this field is directing towards continuous visual event recognition (CVER) where the goal is to both recognize an event and to localize the corresponding space–time volume from large continuous video [5] like in object detection in images where only spatial location is important. This area is more closely related to the real world video surveillance analytics need than the current research which aims to classify a prerecorded video clip of a single event. Accurate CVER would have direct and far reaching impact in surveillance, video-guided human behaviour analysis, assistive technology and video archive analysis.

The task of CVER, i.e. the activity detection on large scale real world video surveillance dataset, is an extremely challenging task and current state-of-the methods for 2D small scale action recognition become infeasible to apply. One of the main challenges for CVER is the scalability, e.g. a CVER dataset like VIRAT dataset [5] contains 23 event types distributed throughout 29 h of video. The other difficulties are due to (i) natural appearance since the events are recorded in a real world scenario, (ii) huge spatial and temporal coverage which affects the video resolution, e.g. the human heights within videos range 25–200 pixels constituting 2.4–20% of the heights of the recorded videos with an average being about 7%, (iii) diverse event types and (iv) huge variability in view-points, scenes and subjects (see Fig. 1) [5].

Among all these above mentioned difficulties, action detection in video (both small and large scale) is a challenging problem mainly due to the scalability of its search space. Without knowing the location, temporal duration and the spatial scale (spatial resolution of the activity) of the action, an exhaustive search is a NP-hard problem. For example, a 1 min video sequence of size 160 × 20 × 1800 contains more than 1014 spatial sub-volumes of various sizes and locations [6]. To solve this issue there are methods like discriminative sub-volume search [6], unsupervised random forest indexing [7]. Although promising, these works always use small scale video datasets like KTH1 and MSR action datasets [6] where the challenges present in CVER, mentioned above, are absent.

To solve the search space complexity of CVER, it is necessary to apply a motion region identifier to roughly detect the motion region of interest where the events to be searched may appear. Oh et al. [5] apply a multi-object tracking using frame difference, and the obtained tracks are divided into detection units resulting over 20 K units, as a preprocessing step. This division of detection unit by a fixed amount always misses some events that are having different duration. On the other hand, in our approach first a motion segmentation method similar to [8] is applied to obtain the primary candidate region set. The obtained regions are further joined using a region clustering technique based on action heuristics. Finally, we obtain on an average 3 K candidate regions as opposed to 20 K by [5] with a higher recall rate. This has a major impact towards the search space reduction and on achieving faster event detection in large scale.

Our method for event detection is related to several ideas recurring in the literature. Firstly, we use STIP detector which is successfully applied in 2D action recognition problems [9], [10], [11], [12], [13]. Several local features are computed such as histogram of oriented optical flow (HOF) [14], histogram of oriented gradient in 3D (HOG3D) [15] and extended SURF (ESURF) [16] at the detected STIPs. We use the idea of local appearance codebook [17] including bag-of-word approach [18], [19] to group the detected features into a set of visual words that represent an event class.

The next idea is to use the generalized Hough transformation (GHT) framework for object detection in images into event detection in videos. Originally developed for detecting straight line [20], Hough transforms are generalized to use for detecting generic parametric shapes [21]. Recently, GHT scheme is successfully used for detecting object class instances tracking and action recognition [22], [23], [24], [25], [26], [27].

The concept of GHT usually refers to any detection process based on an additive aggregation of evidence, Hough votes, coming from local image/video elements. Such aggregation is performed in a parametric space called as Hough space, where each point corresponds to the existence of an instance in a particular configuration. The Hough space may be a product set of different locations, scales and aspects, etc. The detection process is then reduced to finding maxima peaks in the sum of all Hough votes in the Hough space domain, where the location of each peak gives the configuration of a particular detected object/event instance.

The implicit shape model of Leibe et al. [23] and the max-margin hough transformation of Maji and Malik [25] serve a baseline for our work. These works mainly focus on object detection. During training, they augment each visual words in the codebook with the spatial distribution of the displacements between the object centre and the respective visual word location. Using max-margin setup the weights of each visual words are learned. At the detection time, these spatial distributions are converted into Hough votes within the Hough transformation framework. The weights of the visual words are also used for extra information to the Hough votes.

To incorporate this idea into CVER, we need to extend the dimensionality of the voting space since now each STIP will vote for a parallelepiped centre i.e. the event centre. To make it easier to understand, we scale each candidate event into a normalized cube and during training the interest point (feature) distributions along the cube centre is learned for each event class. The scale information is also saved so that by using a simple reverse conversion the normalized cube can be transformed into the actual event parallelepiped. After obtaining a set of visual words from detected event features, a max-margin frame work similar to [25] is applied for learning weights of each visual words for each event class. For a test candidate region, the detected interest points (features) are matched with the event class visual words and weighted votes for the possible event centre are obtained in the Hough voting space. The votes corresponding to the peaks of Hough space reveal the possible hypothesis of the detected events in the actual video. Finally, a verification Support Vector Machine (SVM) designed for the particular event class is used to obtain the recognition score.

The main advantage of using a GHT framework is, it avoids the need of exhaustive search like in sliding window technique, which is infeasible to apply in CVER. GHT directly works on the STIPs and the local features that are extracted from the candidate motion regions. An instant probabilistic score can be obtained on the activity centre and based on which an activity hypothesis can be generated. By using a verification SVM a more robust recognition is obtained, once the activity hypothesis is generated by GHT.

To test our approach we use large scale CVER dataset, VIRAT, proposed by [5]. Our result shows state-of-the-art performance on this dataset. To show the wide applicability of our method, we choose small scale video search dataset MSR [6] and also obtain above state-of-the-art result.

Section snippets

Related work

Action categorization/recognition and detection are important research topics and a large number of work have been found in the literature [1], [2], [3], [4]. One type of approaches uses motion trajectories to represent actions and it requires target tracking [28], [29]. Another type of approaches uses background subtraction to obtain a sequence of silhouettes or body contours to model actions [2], [30]. Recently, action categorization use local spatio-temporal features computed on the detected

Region clustering based motion segmentation

To tackle the scalability issue of CVER it is important to reduce the action search space. Towards this goal, we apply a motion segmentation technique to identify roughly the motion regions where the event of interest may appear.

This step is important as it is reducing the search space. Due to the higher video resolution it is practically infeasible to apply any state-of-the-art STIP detector like [9], [10], [11], [12], [13], [16], [31]. But after the region extraction process, the candidate

Max-margin Hough transform framework for event detection

The general idea to apply a Hough transformation framework [23] into an action detection problem is to compute the probabilistic score which is obtained by adding up the votes from D-dimensional feature vectors extracted from a candidate video event in a Hough space HRH. In our case we apply a spatio-temporal interest point (STIP) detector [9] on candidate event (Fig. 5) and and the feature vector is the concatenation of HOF, HOG3D and ESURF.

So formally, let A be a candidate event having

Experimental results

To validate our proposed approach, experiments on two benchmark datasets are performed: VIRAT dataset [5] is used for large scale event detection and Microsoft Research Action (MSR) Dataset II,3 [49], [6] is used for small scale activity detection.

VIRAT video dataset: In our experiments we use the Release 1.04 of the dataset which was publish in the CVPR’11 activity

Conclusion

In this paper we present a novel approach for event detection in large scale activity dataset using max-margin Hough transformation framework. We tackle the large search space by applying a region extraction algorithm which is based on motion segmentation and region clustering. This algorithm is simple, fast and obtains better recall compared to tracking based approaches. For activity detection, generalized Hough transformation technique is applied which is popular in the field of object

Acknowledgements

This work has been supported by the Spanish Research Programs Consolider-Ingenio 2010: MIPRCV (CSD200700018); Avanza I + D ViCoMo (TSI-020400-2009-133); EU Project VIDI-Video IST-045547; along with the Spanish Projects TIN2009-14501-C02-01 and TIN2009-14501-C02-02. Moreover, Bhaskar Chakrabort acknowledges the support from the Generalitat de Catalunya through an AGAUR FI predoctoral Grant (IUE/2658/2007).

References (57)

  • R. Poppe

    A survey on vision-based human action recognition

    Image Vis. Comput.

    (2010)
  • D. Ballard

    Generalizing the hough transform to detect arbitrary shapes

    Patteren Recognition

    (1981)
  • A. Galata et al.

    Learning variable-length markov models of behavior

    Comput. Vis. Image Undersanding

    (2001)
  • J. Aggarwal et al.

    Human activity analysis: a review

    ACM Comput. Surv.

    (2011)
  • T. Moeslund et al.

    A survey of advances in vision-based human motion capture and analysis

    Comput. Vis. Image Und.

    (2006)
  • P. Turaga et al.

    Machine recognition of human activities: a survey

    IEEE Trans. Circ. Syst. Vid. Technol.

    (2008)
  • S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J.T. Lee, S. Mukherjee, J.K. Aggarwal, H. Lee, L. Davis, E. Swears,...
  • J. Yuan, Z. Liu, Y. Wu, Discriminative subvolume search for efficient action detection, in: CVPR’09: Proceedings of the...
  • G. Yu, J. Yuan, Z. Liu, Unsupervised random forest indexing for fast action search, in: CVPR’11: Proceedings of the...
  • C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: CVPR’99: Proceedings of the...
  • B. Chakraborty, M.B. Holte, T.B. Moeslund, J. Gonzáles, A selective spatio-temporal interest point detector for human...
  • P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in:...
  • I. Laptev

    On space-time interest points

    Int. J. Comput. Vis.

    (2005)
  • J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos in the wild, in: CVPR,...
  • C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: ICPR’04: Proceedings of the...
  • R. Chaudhry, A. Ravichandran, G.D. Hager, R. Vidal, Histograms of oriented optical flow and binet-cauchy kernels on...
  • N. Buch, J. Orwell, S.A. Velastin, 3D extended histogram of oriented gradients (3dhog) for classification of road users...
  • G. Willems et al.

    An efficient dense and scale-invariant spatio-temporal interest point detector

  • J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: ICCV’03: Proceedings...
  • J. Niebles et al.

    Unsupervised learning of human actions categories using spatial-temporal words

    Int. J. Comput. Vis.

    (2008)
  • [19] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bary, Visual catego-rization with bags of keypoints, in: Workshop...
  • R. Duda et al.

    Use of hough transformation to detect lines and curves in pictures

    Commun. ACM

    (1972)
  • J. Gall et al.

    Hough forests for object detection, tracking, and action recognition

    IEEE Trans. Pattern Anal. Mach. Intellegence

    (2011)
  • B. Leibe et al.

    Robust object detection with interleaved categorization and segmentation

    Int. J. Comput. Vis.

    (2008)
  • J. Leiblt, C. Schimd, K. Schertler, View-point independent object class detection using 3d feature maps, in: CVPR’08:...
  • S. Maji, J. Malik, Object detection using a max-margin hough transform, in: CVPR’09: Proceedings of the IEEE Conference...
  • B. Ommer, J. Malik, Multi-scale object detection by clustering lines, in: ICCV’09: Proceedings of the International...
  • A. Opelt et al.

    Learning an alphabet of shape and appearence for multi-class object detection

    Int. J. Comput. Vis.

    (2009)
  • Cited by (8)

    • Reliable shot identification for complex event detection via visual-semantic embedding

      2021, Computer Vision and Image Understanding
      Citation Excerpt :

      To the first issue, early researches usually focus on low-lever visual features of appearance and motion in a video, such as Scale Invariant Feature Transform (SIFT) (Lowe, 2004), Laptev’s Space-Time Interest Points (STIP) (Laptev, 2005), and Improved Dense Trajectory (IDT) (Wang et al., 2013; Wang and Schmid, 2014; Stein and McKenna, 2017). However, these handcrafted features are practically infeasible (Chakraborty et al., 2013). Leveraging on recent success in deep learning, convolution neural networks (CNN) features (Karpathy et al., 2014a) have been exploited and have yielded impressive performance.

    • Feature extraction of overlapping hevea leaves: A comparative study

      2018, Information Processing in Agriculture
      Citation Excerpt :

      Hough transforms produced results of high accuracy and performance in real-time video streams. Similar work was presented by Chakraborty [27]; he extracted large-scale events in video streams using Hough transformation. A maximum margin algorithm was implemented to examine the weights of the visual vocabularies.

    • Tile quality detection system based on an object imaging method

      2017, Optik
      Citation Excerpt :

      Then ARM packs the image data and sends it to DSP. Secondly, after DSP detects the data, DSP will process the data with a gray level transformation, edge detection [14], Otsu's algorithm and Hough transform [15,16], and then DSP will send the image data to ARM. Thirdly, ARM packs the image data and sends to DSP again.

    View all citing articles on Scopus
    View full text