Recognizing activities in multiple views with fusion of frame judgments

https://doi.org/10.1016/j.imavis.2014.01.006Get rights and content

Highlights

  • Data fusion based method for activity recognition using multiple views

  • Straightforward architecture to incorporate new cameras or new features

  • Performance generally increases when there are more cameras and features.

  • Comparable performance with that produced by reconstruction

  • Detailed experiments to answer different system considerations

Abstract

This paper focuses on activity recognition when multiple views are available. In the literature, this is often performed using two different approaches. In the first one, the systems build a 3D reconstruction and match that. However, there are practical disadvantages to this methodology since a sufficient number of overlapping views is needed to reconstruct, and one must calibrate the cameras. A simpler alternative is to match the frames individually. This offers significant advantages in the system architecture (e.g., it is easy to incorporate new features and camera dropouts can be tolerated). In this paper, the second approach is employed and a novel fusion method is proposed. Our fusion method collects the activity labels over frames and cameras, and then fuses activity judgments as the sequence label. It is shown that there is no performance penalty when a straightforward weighted voting scheme is used. In particular, when there are enough overlapping views to generate a volumetric reconstruction, our recognition performance is comparable with that produced by volumetric reconstructions. However, if the overlapping views are not adequate, the performance degrades fairly gracefully, even in cases where test and training views do not overlap.

Introduction

There is a broad range of applications for systems that can recognize human activity in video. Medical applications include methods to monitor patient activity for keeping track of progress in stroke patients; or for keeping demented patients secure. Safety applications include detecting unusual or suspicious behavior, or detecting pedestrians to avoid accidents. The problem remains difficult due to important reasons. There is no canonical taxonomy of human activities. Changes in illumination direction and viewing direction cause massive changes in what people look like. Individuals can look very different from one another, and the same activity performed by different people can vary widely in appearance.

Generally, we expect that having multiple views makes recognizing human activity easier. There is support for this viewpoint in the literature (e.g., see Section 2). However, these results tend not to take into account various desirable engineering features for distributed multi-camera systems. In such systems, we may not be able to get accurate geometric calibrations of the cameras with respect to one another (e.g., if the cameras are dropped into a terrain). Cameras might drop in or out at any time, and we need a simple architecture that can opportunistically exploit whatever data is available. We will not be able to set cameras at fixed locations with respect to the moving people, meaning that training data might be obtained from different view directions than test data.

In this paper, we describe an architecture to label activities using multiple views. Fig. 1 shows the main structure of our architecture. We assume that there are one or more cameras, and that each camera can compute one or more blocks of features representing each frame. Breaking features into blocks allows us to insert new sets of features without disrupting the overall architecture. In the first step, each block of features for each frame of each camera is used for a nearest neighbor query, independent of all other cameras, frames or blocks (Section 4).

In the second step, the resulting matches are combined with a weighting scheme. Because the viewing direction of any camera with respect to the body is unknown, some frames (or feature blocks) might be ambiguous. We expect that having a second view should disambiguate some frames, so it makes sense to combine matches over cameras. However, close matches are very likely to be right. This suggests using a scheme that allows (a) several weakly confident matches that share a label to support one another and (b) strongly confident matches to dominate (see Fig. 2). This stage reports a distribution of similarity weights over labels, but conceals the number of cameras or of features used to obtain it, so that later decision stages can abstract away these details (Section 4.1). Finally, we use temporal smoothing, to estimate the action in a short sequence (Section 4.2).

Our architecture requires no volume reconstruction and makes engineering easy in new sets of features. When a set of features in a camera is confident, it dominates the labeling process for that frame. Similarly, the frames in a sequence that are confident dominate the decision for a sequence. Our experiments (Section 5) demonstrate that our method performs at the state of the art. We show results for several types of features. It is straightforward to incorporate new cameras or new features into our method. Performance generally improves when there are more cameras and more features. Our method is robust to differences in view direction; training and test cameras do not need to overlap. Discriminative views can be exploited opportunistically. Performance degrades when the test and training data do not share viewing directions. Camera drop in or drop out is handled easily with little penalty. There is no need to synchronize and calibrate cameras.

The main point of our paper is to show that, when one has multiple views of a person, straightforward data fusion methods give comparable recognition performance with that produced by 3D reconstruction in the context of a radically simpler system architecture with significant advantages.

Section snippets

Background

The activity recognition literature is rich, broad reviews of the topic appear in [6], [7], [8], [9], [10], [11]. We confine our review to covering the main trends in features types, and in methods that recognize activities from viewpoints that are not in the training set.

Video features

Each camera frame can be represented by many different types of features. In this section, we define our choice of features, but many others can be also used as our architecture is able to work with blocks of features at one time. While selecting features, we want feature construction to be simple yet robust. For practical reasons, it should not require camera calibration, or point correspondence. Selected features should be able to work with minor segmentation errors, and ideally they should

Label fusion

Our goal is to design an architecture that works efficient with many cameras and features while being robust to changes occurring in the system such as camera drop out, addition of new features or new cameras without disrupting the overall system. In our work, labels as evidences are pooled over cameras and over frames to compute confident label for sequence. Particularly, we fuse information at three levels to produce three kinds of labels. First, different blocks of features in a single

Dataset

To test our approach, we need a dataset consisting of videos in multiple viewpoints performed by actors in free orientations. We use the publicly available INRIA Xmas Motion Acquisition Sequence (IXMAS) dataset [16] (This has been widely used by many others [18], [21], [34], [20], [26]). It contains 13 actions (Nk) performed 3 times (No) in different orientations by 12 actors (Np). The action sequences are recorded in 5 cameras (Nc). [37] reports 10% average cross camera accuracy if one trains

Conclusion

Our method combines the confidence estimate of multiple cameras and various appearance features. This helps in solving mislabeled frames with a more accurate activity recognition. In our experiments, we show that having a second view outperforms single view systems with a considerable amount. We have shown that, by not reconstructing in 3D, we have gained significant advantages in system architecture without losing much in performance. Views need not be calibrated to one another; training and

Acknowledgments

S. Pehlivan was supported in part by the research fellowship of Scientific and Technical Research Council of Turkey while studying as a visiting scholar at University of Illinois at Urbana-Champaign.

References (46)

  • M. Blank et al.

    Actions as space-time shapes

  • T. Syeda-Mahmood et al.

    Recognizing action events from multiple viewpoints

  • Q. Cai et al.

    Human motion analysis: a review

    J. Comput. Vision Image Underst.

    (1999)
  • W. Hu et al.

    A survey on visual surveillance of object motion and behaviors

    IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.

    (2004)
  • D. Forsyth et al.

    Computational studies of human motion I: tracking and animation

    Found. Trends Comp. Graph Vis.

    (2006)
  • P. Turaga et al.

    Machine recognition of human activities: a survey

    IEEE Trans. Circuits Syst. Video Technol.

    (2008)
  • L.R. Rabiner

    A tutorial on hidden Markov models and selected applications in speech recognition

    Proc. IEEE

    (1989)
  • C. Rao et al.

    View-invariant representation and recognition of actions

    Int. J. Comput. Vis.

    (2002)
  • C. Schuldt et al.

    Recognizing human actions: a local SVM approach

  • F. Lv et al.

    Single view human action recognition using key pose matching and viterbi path searching

  • I. Laptev et al.

    Retrieving actions in movies

  • J. Liu et al.

    Learning human actions via information maximization

  • D. Weinland et al.

    Action recognition from arbitrary views using 3D exemplars

  • This paper has been recommended for acceptance by Xiaogang Wang.

    View full text