Large scale continuous visual event recognition using max-margin Hough transformation framework

doi:10.1016/j.cviu.2012.11.008

Computer Vision and Image Understanding

Volume 117, Issue 10, October 2013, Pages 1356-1368

https://doi.org/10.1016/j.cviu.2012.11.008 Get rights and content

Abstract

In this paper we propose a novel method for continuous visual event recognition (CVER) on a large scale video dataset using max-margin Hough transformation framework. Due to high scalability, diverse real environmental state and wide scene variability direct application of action recognition/detection methods such as spatio-temporal interest point (STIP)-local feature based technique, on the whole dataset is practically infeasible. To address this problem, we apply a motion region extraction technique which is based on motion segmentation and region clustering to identify possible candidate “event of interest” as a preprocessing step. On these candidate regions a STIP detector is applied and local motion features are computed. For activity representation we use generalized Hough transform framework where each feature point casts a weighted vote for possible activity class centre. A max-margin frame work is applied to learn the feature codebook weight. For activity detection, peaks in the Hough voting space are taken into account and initial event hypothesis is generated using the spatio-temporal information of the participating STIPs. For event recognition a verification Support Vector Machine is used. An extensive evaluation on benchmark large scale video surveillance dataset (VIRAT) and as well on a small scale benchmark dataset (MSR) shows that the proposed method is applicable on a wide range of continuous visual event recognition applications having extremely challenging conditions.

Highlights

► In this paper we address activity detection in large scale video dataset. ► A novel region extraction method is applied to reduce initial action search space. ► Max-margin Hough transformation framework is used for activity detection. ► A Verification SVM is applied to obtain the final score of detected event hypothesis. ► State-of-the-art result is reported on both large and small scale benchmark datasets.

Introduction

Visual event recognition i.e. recognition of semantic spatio-temporal visual patterns such as “waving”, “boxing”, “getting into vehicle” and “running” etc. is a fundamental Computer Vision problem. An enormous amount of work on this topic can be found in literature survey [1], [2], [3], [4]. Recently, research in this field is directing towards continuous visual event recognition (CVER) where the goal is to both recognize an event and to localize the corresponding space–time volume from large continuous video [5] like in object detection in images where only spatial location is important. This area is more closely related to the real world video surveillance analytics need than the current research which aims to classify a prerecorded video clip of a single event. Accurate CVER would have direct and far reaching impact in surveillance, video-guided human behaviour analysis, assistive technology and video archive analysis.

The task of CVER, i.e. the activity detection on large scale real world video surveillance dataset, is an extremely challenging task and current state-of-the methods for 2D small scale action recognition become infeasible to apply. One of the main challenges for CVER is the scalability, e.g. a CVER dataset like VIRAT dataset [5] contains 23 event types distributed throughout 29 h of video. The other difficulties are due to (i) natural appearance since the events are recorded in a real world scenario, (ii) huge spatial and temporal coverage which affects the video resolution, e.g. the human heights within videos range 25–200 pixels constituting 2.4–20% of the heights of the recorded videos with an average being about 7%, (iii) diverse event types and (iv) huge variability in view-points, scenes and subjects (see Fig. 1) [5].

Among all these above mentioned difficulties, action detection in video (both small and large scale) is a challenging problem mainly due to the scalability of its search space. Without knowing the location, temporal duration and the spatial scale (spatial resolution of the activity) of the action, an exhaustive search is a NP-hard problem. For example, a 1 min video sequence of size 160 × 20 × 1800 contains more than 10¹⁴ spatial sub-volumes of various sizes and locations [6]. To solve this issue there are methods like discriminative sub-volume search [6], unsupervised random forest indexing [7]. Although promising, these works always use small scale video datasets like KTH¹ and MSR action datasets [6] where the challenges present in CVER, mentioned above, are absent.

To solve the search space complexity of CVER, it is necessary to apply a motion region identifier to roughly detect the motion region of interest where the events to be searched may appear. Oh et al. [5] apply a multi-object tracking using frame difference, and the obtained tracks are divided into detection units resulting over 20 K units, as a preprocessing step. This division of detection unit by a fixed amount always misses some events that are having different duration. On the other hand, in our approach first a motion segmentation method similar to [8] is applied to obtain the primary candidate region set. The obtained regions are further joined using a region clustering technique based on action heuristics. Finally, we obtain on an average 3 K candidate regions as opposed to 20 K by [5] with a higher recall rate. This has a major impact towards the search space reduction and on achieving faster event detection in large scale.

Our method for event detection is related to several ideas recurring in the literature. Firstly, we use STIP detector which is successfully applied in 2D action recognition problems [9], [10], [11], [12], [13]. Several local features are computed such as histogram of oriented optical flow (HOF) [14], histogram of oriented gradient in 3D (HOG3D) [15] and extended SURF (ESURF) [16] at the detected STIPs. We use the idea of local appearance codebook [17] including bag-of-word approach [18], [19] to group the detected features into a set of visual words that represent an event class.

The next idea is to use the generalized Hough transformation (GHT) framework for object detection in images into event detection in videos. Originally developed for detecting straight line [20], Hough transforms are generalized to use for detecting generic parametric shapes [21]. Recently, GHT scheme is successfully used for detecting object class instances tracking and action recognition [22], [23], [24], [25], [26], [27].

The concept of GHT usually refers to any detection process based on an additive aggregation of evidence, Hough votes, coming from local image/video elements. Such aggregation is performed in a parametric space called as Hough space, where each point corresponds to the existence of an instance in a particular configuration. The Hough space may be a product set of different locations, scales and aspects, etc. The detection process is then reduced to finding maxima peaks in the sum of all Hough votes in the Hough space domain, where the location of each peak gives the configuration of a particular detected object/event instance.

The implicit shape model of Leibe et al. [23] and the max-margin hough transformation of Maji and Malik [25] serve a baseline for our work. These works mainly focus on object detection. During training, they augment each visual words in the codebook with the spatial distribution of the displacements between the object centre and the respective visual word location. Using max-margin setup the weights of each visual words are learned. At the detection time, these spatial distributions are converted into Hough votes within the Hough transformation framework. The weights of the visual words are also used for extra information to the Hough votes.

To incorporate this idea into CVER, we need to extend the dimensionality of the voting space since now each STIP will vote for a parallelepiped centre i.e. the event centre. To make it easier to understand, we scale each candidate event into a normalized cube and during training the interest point (feature) distributions along the cube centre is learned for each event class. The scale information is also saved so that by using a simple reverse conversion the normalized cube can be transformed into the actual event parallelepiped. After obtaining a set of visual words from detected event features, a max-margin frame work similar to [25] is applied for learning weights of each visual words for each event class. For a test candidate region, the detected interest points (features) are matched with the event class visual words and weighted votes for the possible event centre are obtained in the Hough voting space. The votes corresponding to the peaks of Hough space reveal the possible hypothesis of the detected events in the actual video. Finally, a verification Support Vector Machine (SVM) designed for the particular event class is used to obtain the recognition score.

The main advantage of using a GHT framework is, it avoids the need of exhaustive search like in sliding window technique, which is infeasible to apply in CVER. GHT directly works on the STIPs and the local features that are extracted from the candidate motion regions. An instant probabilistic score can be obtained on the activity centre and based on which an activity hypothesis can be generated. By using a verification SVM a more robust recognition is obtained, once the activity hypothesis is generated by GHT.

To test our approach we use large scale CVER dataset, VIRAT, proposed by [5]. Our result shows state-of-the-art performance on this dataset. To show the wide applicability of our method, we choose small scale video search dataset MSR [6] and also obtain above state-of-the-art result.

Section snippets

Related work

Action categorization/recognition and detection are important research topics and a large number of work have been found in the literature [1], [2], [3], [4]. One type of approaches uses motion trajectories to represent actions and it requires target tracking [28], [29]. Another type of approaches uses background subtraction to obtain a sequence of silhouettes or body contours to model actions [2], [30]. Recently, action categorization use local spatio-temporal features computed on the detected

Region clustering based motion segmentation

To tackle the scalability issue of CVER it is important to reduce the action search space. Towards this goal, we apply a motion segmentation technique to identify roughly the motion regions where the event of interest may appear.

This step is important as it is reducing the search space. Due to the higher video resolution it is practically infeasible to apply any state-of-the-art STIP detector like [9], [10], [11], [12], [13], [16], [31]. But after the region extraction process, the candidate

Max-margin Hough transform framework for event detection

The general idea to apply a Hough transformation framework [23] into an action detection problem is to compute the probabilistic score which is obtained by adding up the votes from D-dimensional feature vectors extracted from a candidate video event in a Hough space $H \subseteq R^{H}$ . In our case we apply a spatio-temporal interest point (STIP) detector [9] on candidate event (Fig. 5) and and the feature vector is the concatenation of HOF, HOG3D and ESURF.

So formally, let $A$ be a candidate event having

Experimental results

To validate our proposed approach, experiments on two benchmark datasets are performed: VIRAT dataset [5] is used for large scale event detection and Microsoft Research Action (MSR) Dataset II,³ [49], [6] is used for small scale activity detection.

VIRAT video dataset: In our experiments we use the Release 1.0⁴ of the dataset which was publish in the CVPR’11 activity

Conclusion

In this paper we present a novel approach for event detection in large scale activity dataset using max-margin Hough transformation framework. We tackle the large search space by applying a region extraction algorithm which is based on motion segmentation and region clustering. This algorithm is simple, fast and obtains better recall compared to tracking based approaches. For activity detection, generalized Hough transformation technique is applied which is popular in the field of object

Acknowledgements

This work has been supported by the Spanish Research Programs Consolider-Ingenio 2010: MIPRCV (CSD200700018); Avanza I + D ViCoMo (TSI-020400-2009-133); EU Project VIDI-Video IST-045547; along with the Spanish Projects TIN2009-14501-C02-01 and TIN2009-14501-C02-02. Moreover, Bhaskar Chakrabort acknowledges the support from the Generalitat de Catalunya through an AGAUR FI predoctoral Grant (IUE/2658/2007).

References (57)

R. Poppe
A survey on vision-based human action recognition
Image Vis. Comput.
(2010)
D. Ballard
Generalizing the hough transform to detect arbitrary shapes
Patteren Recognition
(1981)
A. Galata et al.
Learning variable-length markov models of behavior
Comput. Vis. Image Undersanding
(2001)
J. Aggarwal et al.
Human activity analysis: a review
ACM Comput. Surv.
(2011)
T. Moeslund et al.
A survey of advances in vision-based human motion capture and analysis
Comput. Vis. Image Und.
(2006)
P. Turaga et al.
Machine recognition of human activities: a survey
IEEE Trans. Circ. Syst. Vid. Technol.
(2008)
S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J.T. Lee, S. Mukherjee, J.K. Aggarwal, H. Lee, L. Davis, E. Swears,...
J. Yuan, Z. Liu, Y. Wu, Discriminative subvolume search for efficient action detection, in: CVPR’09: Proceedings of the...
G. Yu, J. Yuan, Z. Liu, Unsupervised random forest indexing for fast action search, in: CVPR’11: Proceedings of the...
C. Stauffer, W.E.L. Grimson, Adaptive background mixture models for real-time tracking, in: CVPR’99: Proceedings of the...

B. Chakraborty, M.B. Holte, T.B. Moeslund, J. Gonzáles, A selective spatio-temporal interest point detector for human...

P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in:...

I. Laptev

On space-time interest points

Int. J. Comput. Vis.

(2005)

J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos in the wild, in: CVPR,...

C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: ICPR’04: Proceedings of the...

R. Chaudhry, A. Ravichandran, G.D. Hager, R. Vidal, Histograms of oriented optical flow and binet-cauchy kernels on...

N. Buch, J. Orwell, S.A. Velastin, 3D extended histogram of oriented gradients (3dhog) for classification of road users...

G. Willems et al.

An efficient dense and scale-invariant spatio-temporal interest point detector

J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: ICCV’03: Proceedings...

J. Niebles et al.

Unsupervised learning of human actions categories using spatial-temporal words

Int. J. Comput. Vis.

(2008)

[19] G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bary, Visual catego-rization with bags of keypoints, in: Workshop...

R. Duda et al.

Use of hough transformation to detect lines and curves in pictures

Commun. ACM

(1972)

J. Gall et al.

Hough forests for object detection, tracking, and action recognition

IEEE Trans. Pattern Anal. Mach. Intellegence

(2011)

B. Leibe et al.

Robust object detection with interleaved categorization and segmentation

Int. J. Comput. Vis.

(2008)

J. Leiblt, C. Schimd, K. Schertler, View-point independent object class detection using 3d feature maps, in: CVPR’08:...

S. Maji, J. Malik, Object detection using a max-margin hough transform, in: CVPR’09: Proceedings of the IEEE Conference...

B. Ommer, J. Malik, Multi-scale object detection by clustering lines, in: ICCV’09: Proceedings of the International...

A. Opelt et al.

Learning an alphabet of shape and appearence for multi-class object detection

Int. J. Comput. Vis.

(2009)

Cited by (8)

Reliable shot identification for complex event detection via visual-semantic embedding
2021, Computer Vision and Image Understanding
Citation Excerpt :
To the first issue, early researches usually focus on low-lever visual features of appearance and motion in a video, such as Scale Invariant Feature Transform (SIFT) (Lowe, 2004), Laptev’s Space-Time Interest Points (STIP) (Laptev, 2005), and Improved Dense Trajectory (IDT) (Wang et al., 2013; Wang and Schmid, 2014; Stein and McKenna, 2017). However, these handcrafted features are practically infeasible (Chakraborty et al., 2013). Leveraging on recent success in deep learning, convolution neural networks (CNN) features (Karpathy et al., 2014a) have been exploited and have yielded impressive performance.
Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.
Feature extraction of overlapping hevea leaves: A comparative study
2018, Information Processing in Agriculture
Citation Excerpt :
Hough transforms produced results of high accuracy and performance in real-time video streams. Similar work was presented by Chakraborty [27]; he extracted large-scale events in video streams using Hough transformation. A maximum margin algorithm was implemented to examine the weights of the visual vocabularies.
Automation of rubber tree clone classification has inspired research into new methods of leaf feature extraction. In current practice, rubber clone inspectors has been using several leaf features to identify clone types. One of the unique features of rubber tree leaf is palmate leaflets. This characteristic generates different leaflet positions, where the leaves are overlapping or separated. In this research, we propose keypoint extraction and line detection methods to extract shape and axil (angle between petioles) features of leaflet positions. The results of keypoint extraction methods, namely, SIFT, Harris, and FAST, were compared and discussed for shape feature extraction. Next, Hough transformation and boundary-tracing methods were compared to identify the suitable axil detection method. The evaluation result demonstrates the proper keypoint extraction method for shape context and the clear advantages of Hough Transformation in accuracy of angle detection.
Tile quality detection system based on an object imaging method
2017, Optik
Citation Excerpt :
Then ARM packs the image data and sends it to DSP. Secondly, after DSP detects the data, DSP will process the data with a gray level transformation, edge detection [14], Otsu's algorithm and Hough transform [15,16], and then DSP will send the image data to ARM. Thirdly, ARM packs the image data and sends to DSP again.
Due to the disadvantages of traditional imaging methods, an imaging method based on object imaging has been proposed in this paper. This imaging method only images objects that are of interest and filters other objects. This paper mainly applies this imaging method to detect the quality of tiles by detection of special shapes. If the printing graphics of the tiles meet our requirements, a quality detection system will display the graphics needing quality detection on a monitor. Otherwise, no graphics are displayed on the monitor if the printing graphics do not meet requirements. The system includes: the use of communication between ARM and DSP under the DM6446 platform, the call to a multiple-algorithm, and design of a specific graphic imaging algorithm. The system removes the artificial testing steps in the process of quality detection, improves the speed and accuracy of quality detection and reduces the labor intensity required from inspectors. The preliminary studies show the feasibility and practicability of this system, and the preliminary experimental results show that an imaging method based on object imaging can be applied to quality detection of tiles and can solve existing problems for quality detection.
Reliable shot identification for complex event detection via visual-semantic embedding
2021, arXiv
Fruit detection in natural environment using partial shape matching and probabilistic Hough transform
2020, Precision Agriculture
Global Calibration Method of a Camera Using the Constraint of Line Features and 3D World Points
2016, Measurement Science Review

View all citing articles on Scopus

View full text

Large scale continuous visual event recognition using max-margin Hough transformation framework

Abstract

Highlights

Introduction

Section snippets

Related work

Region clustering based motion segmentation

Max-margin Hough transform framework for event detection

Experimental results

Conclusion

Acknowledgements

Image Vis. Comput.

Patteren Recognition

Comput. Vis. Image Undersanding

Human activity analysis: a review

ACM Comput. Surv.

A survey of advances in vision-based human motion capture and analysis

Comput. Vis. Image Und.

Machine recognition of human activities: a survey

IEEE Trans. Circ. Syst. Vid. Technol.

On space-time interest points

Int. J. Comput. Vis.

An efficient dense and scale-invariant spatio-temporal interest point detector

Unsupervised learning of human actions categories using spatial-temporal words

Int. J. Comput. Vis.

Use of hough transformation to detect lines and curves in pictures

Commun. ACM

Hough forests for object detection, tracking, and action recognition

IEEE Trans. Pattern Anal. Mach. Intellegence

Robust object detection with interleaved categorization and segmentation

Int. J. Comput. Vis.

Learning an alphabet of shape and appearence for multi-class object detection

Int. J. Comput. Vis.