Incremental learning of human activity models from videos

https://doi.org/10.1016/j.cviu.2015.10.018Get rights and content

Highlights

  • We incrementally learn the human activity models with the newly arriving instances using an ensemble of SVM classifiers. It can retain the already learned information and does not require the storage of previously seen examples.

  • We reduce the expensive manual labeling of the incoming instances from the video stream using active learning. We achieved similar performances comparing to the state-of-the-arts with less amount of manually labeled data.

  • We propose a framework to incrementally learn the context model of the activities and the object attributes that we represent using a CRF.

Abstract

Learning human activity models from streaming videos should be a continuous process as new activities arrive over time. However, recent approaches for human activity recognition are usually batch methods, which assume that all the training instances are labeled and present in advance. Among such methods, the exploitation of the inter-relationship between the various objects in the scene (termed as context) has proved extremely promising. Many state-of-the-art approaches learn human activity models continuously but do not exploit the contextual information. In this paper, we propose a novel framework that continuously learns both of the appearance and the context models of complex human activities from streaming videos. We automatically construct a conditional random field (CRF) graphical model to encode the mutual contextual information among the activities and the related object attributes. In order to reduce the amount of manual labeling of the incoming instances, we exploit active learning to select the most informative training instances with respect to both of the appearance and the context models to incrementally update these models. Rigorous experiments on four challenging datasets demonstrate that our framework outperforms state-of-the-art approaches with significantly less amount of manually labeled data.

Introduction

Human activity recognition is a challenging and widely studied problem in computer vision. It has many practical applications such as video surveillance, video annotation, video indexing, active gaming, human computer interaction, assisted living for elderly, etc. Even though enormous amount of research has been conducted in this area, it still remains a hard problem due to large intra-class variance among the activities, large variability in spatio-temporal scale, variability of human pose, periodicity of human action, low quality video, clutter, occlusion, etc.

With few exceptions, most of the state-of-the-art approaches [1] to human activity recognition in video are based on one or more of the following four assumptions: (a) It requires an intensive training phase, where every training example is assumed to be available; (b) Every training example is assumed to be labeled; (c) At least one example of every activity class is assumed to be seen beforehand, i.e., no new activity type will arrive after training; (d) A video clip contains only one activity, where the exact spatio-temporal extent of the activity is known. However, these assumptions are too strong and not realistic in many real world scenarios such as streaming and surveillance videos. In these cases, new unlabeled activities are coming continuously and the spatio-temporal extent of these activities are usually unknown in advance.

Recent successes in object and activity recognition take the advantages of the fact that, in nature, objects tend to co-exist with other objects in a particular environment. This is often termed as context and plays an important role in human visual system for object recognition [2]. Similarly, most of the human activities in the real world are inter-related and the surroundings of these activities can provide significant visual clue for their recognition. Several research works [3], [4], [5], [6], [7], [8] considered the use of context from different perspectives to recognize complex human activities and showed significant performance improvement over the approaches that do not use context. However, these approaches are batch methods that require large amount of manually labeled data and are not able to continuously update their models in order to adapt to the dynamic environment. Even though few research works such as [9], [10], [11] learn human activity models incrementally from streaming videos, they do not utilize contextual information, which can lead to superior performance.

Motivated by the above, the main goal of this work is twofold: to classify new unknown activities in streaming videos, and also leverage upon them to continuously improve the existing activity recognition models. In order to achieve this goal, we develop an incremental activity learning framework that will use new activities identified in the incoming video to incrementally improve the existing models by leveraging relevant machine learning techniques, most notably active learning. The proposed model not only utilizes the appearance features of the individual activity segments but also takes the advantages of interrelationships among the activities in a sequence and their interactions with the objects.

The detailed framework of our proposed incremental activity recognition algorithm is shown in Fig. 1. Since, we do not have any prior information about the spatio-temporal extent of the activities in the continuous video, our approach begins with video segmentation and localization of the activities using a motion segmentation algorithm. Each of the atomic motion segments are considered as the activity segments from which we collect spatio-temporal local feature STIP [12]. These features are widely used in action recognition and achieve satisfactory performance in state-of-the-art challenging datasets. We construct a single feature vector using these local features by exploiting the method described in [13]. Then, we learn a prior model using few labeled training activities in hand. In this work, we propose to use an ensemble of linear Support Vector Machine (SVM) classifiers as the prior model. Note that we do not assume that the prior model is exhaustive in terms of covering all activity classes or in modeling the variations within the class. It is only used as a starting point for the incremental learning framework.

We start incremental learning with the above mentioned prior model and update it during each run of incremental training. When a newly segmented activity arrives, we apply the current model to get a tentative label with a confidence score. However, it is not practical and rational to use all of the newly segmented activities as the training examples for the next run of incremental training. This is because it is costly to get a label for all of them from a human annotator, and not all of them posses distinguishing properties for effective update of the current model. We only select a subset of them and rectify the tentative labels by our proposed active learning system. In order to learn the activity model incrementally, we employ an ensemble of linear SVMs. When we have sufficient new training examples labeled by the active learning system, we train a new set of SVM classifiers and consequently, update the current model by adding these new SVM classifiers to the ensemble with appropriate weights.

For the incremental learning with context features, we use a conditional random field (CRF) graphical model in order to represent the interrelationships among the activity segments and the associated object attributes segmented from a video sequence. The nodes of the CRF represent the activities and the object attributes and the edges represent the interrelationships among them. Confidence scores of the activities from the ensemble of SVM classifiers are used as the activity nodes potential, whereas scores obtained from the object detectors are used as the object nodes potentials. Various spatio-temporal relationships such as co-occurrence of activities and objects are used as the edge potentials. We run inference on the CRF in order to obtain the posterior activity labeling with confidence scores. These confidence scores are used in the active learning system consisting of strong and weak teachers to rectify the labels. Hence, these labels are used to update the edge potentials.

In this work we propose a novel framework to incrementally learn the activity models from streaming videos, which is achieved through an active learning system. The main contributions are as follows -

  • We incrementally learn the human activity models with the newly arriving instances using an ensemble of SVM classifiers. It can retain the already learned information and does not require the storage of previously seen examples.

  • We reduce the expensive manual labeling of the incoming instances from the video stream using active learning. We achieved similar performances comparing to the state-of-the-arts with less amount of manually labeled data.

  • We propose a framework to incrementally learn the context model of the activities and the object attributes that we represent using a CRF.

Section snippets

Related works

Activity Recognition. We would like to refer to the paper [1] for a comprehensive review on the sate-of-the-art approaches to human activity recognition. Based on the level of abstraction used to represent an activity, state-of-the-art approaches can be classified into three general categories such as low-level [12], mid-level [10], and high-level [14] feature based methods. However, as discussed in Section 1, most of these state-of-the-art approaches suffer from the inability to model

Incremental learning of individual activity classes

We now provide a detailed overview of our proposed incremental activity modeling framework for the appearance model. We assume that we have a set of activities segmented from a video sequence and we have extracted a set of features {xi:i=1,2,3,,n} from these activity segments. Details of activity segmentation and feature extraction are discussed in the experiment section. In this section we mainly focus on learning activity models without using the contextual information or the interactions

Incremental learning of contextual relationships

We model the inter-relationships among the activity instances and the object attributes using a CRF graphical model. An illustrative example of the CRF with four activity nodes is shown in Fig. 2. It is an undirected graph G=(V,E), with a set of nodes V={A,C,X,Z}, and a set of edges E={AA,AC,AX,CZ}, where As are the activity nodes, Cs are the object attributes, (i.e. context features), P and D are the activity classifier and the objects detectors respectively, and X and Z are the observed

Active learning and teacher selection

Previously, we described the appearance and the context models and the approach we use to update them incrementally with an assumption that we have the labels of the incoming instances. However, in a streaming video scenario incoming instances are unlabeled. Now, we describe our active learning system where we carefully select the most useful instances to be labeled by a human annotator. The main goal is to reduce the amount of expensive manual labeling while retaining the same level of

Experiments

We perform experiments on four challenging datasets to evaluate and compare the performances of our framework. These datasets are KTH [30], UCF11 [31], VIRAT [32], and UCLA-Office [33]. In the first two datasets - KTH and UCF11, activities are temporally segmented, it means that each video segment contains only one activity, whereas in VIRAT and UCLA-Office datasets video sequences are long and contain more than one activities. That’s why, for the last two datasets, we use a video segmentation

Conclusion and future works

In this work, we proposed a framework for incremental activity modeling. Our framework took advantage of state-of-the-art machine learning tools and active learning to learn activity models incrementally over time with reduced amount of manually labeled data. We also exploit the contextual information and learn it incrementally so that it helps to recognize activities more efficiently over time. We performed rigorous experiments on four challenging datasets. Results show the robustness of our

References (42)

  • R. Poppe

    A survey on vision-based human action recognition

    Image Vis. Comput.

    (2010)
  • A. Oliva et al.

    The role of context in object recognition

    Trends Cogn. Sci.

    (2007)
  • B. Yao et al.

    Modeling mutual context of object and human pose in human-object interaction activities

    Proceedings of the CVPR

    (2010)
  • Z. Wang et al.

    Bilinear programming for human activity recognition with unknown mrf graphs

    Proceedings of the CVPR

    (2013)
  • T. Lan et al.

    Beyond actions: Discriminative models for contextual group activities

    Proceedings of the NIPS

    (2010)
  • W. Choi et al.

    Learning context for collective activity recognition

    Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition

    (2011)
  • N. Nayak et al.

    Exploiting spatio-temporal scene structure for wide-area activity analysis in unconstrained environments

    IEEE Trans. Inf. Forens. Sec.

    (2013)
  • Y. Zhu et al.

    Context-aware modeling and recognition of activities in video

    Proceedings of the CVPR

    (2013)
  • K. Reddy et al.

    Incremental action recognition using feature-tree

    Proceedings of the ICCV

    (2009)
  • R. Minhas et al.

    Incremental learning in human action recognition based on snippets

    IEEETrans. Circ. Syst. Video Technol.

    (2012)
  • M. Hasan et al.

    Incremental activity modeling and recognition in streaming videos

    Proceedings of the CVPR

    (2014)
  • I. Laptev

    On space-time interest points

    Int. J. Comput. Vis.

    (2005)
  • M. Hasan et al.

    Continuous learning of human activity models using deep nets

    Proceedings of the ECCV

    (2014)
  • S. Sadanand et al.

    Action bank: a high-level representation of activity in video

    Proceedings of the CVPR

    (2012)
  • R. Polikar et al.

    Learn++: an incremental learning algorithm for supervised neural networks

    IEEE TSMC Part:C

    (2001)
  • H. He et al.

    Incremental learning from stream data

    IEEE TNN

    (2011)
  • W. Brendel et al.

    Learning spatiotemporal graphs of human activities

    Proceedings of the ICCV

    (2011)
  • Z. Si et al.

    Unsupervised learning of event and-or grammar and semantics from video

    Proceedings of the ICCV

    (2011)
  • A. Quattoni et al.

    Hidden-state conditional random fields

    IEEE TPAMI

    (2007)
  • B. Settles

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    Morgan & Claypool Publishers

    (2012)
  • J.M. Buhmann et al.

    Active learning for semantic segmentation with expected change

    Proceedings of the CVPR

    (2012)
  • Cited by (0)

    This work was supported in part by ONR grant N00014-15-C-5113 and NSF grant IIS-1316934.

    View full text