Recognizing activities in multiple views with fusion of frame judgments☆
Graphical abstract
Introduction
There is a broad range of applications for systems that can recognize human activity in video. Medical applications include methods to monitor patient activity for keeping track of progress in stroke patients; or for keeping demented patients secure. Safety applications include detecting unusual or suspicious behavior, or detecting pedestrians to avoid accidents. The problem remains difficult due to important reasons. There is no canonical taxonomy of human activities. Changes in illumination direction and viewing direction cause massive changes in what people look like. Individuals can look very different from one another, and the same activity performed by different people can vary widely in appearance.
Generally, we expect that having multiple views makes recognizing human activity easier. There is support for this viewpoint in the literature (e.g., see Section 2). However, these results tend not to take into account various desirable engineering features for distributed multi-camera systems. In such systems, we may not be able to get accurate geometric calibrations of the cameras with respect to one another (e.g., if the cameras are dropped into a terrain). Cameras might drop in or out at any time, and we need a simple architecture that can opportunistically exploit whatever data is available. We will not be able to set cameras at fixed locations with respect to the moving people, meaning that training data might be obtained from different view directions than test data.
In this paper, we describe an architecture to label activities using multiple views. Fig. 1 shows the main structure of our architecture. We assume that there are one or more cameras, and that each camera can compute one or more blocks of features representing each frame. Breaking features into blocks allows us to insert new sets of features without disrupting the overall architecture. In the first step, each block of features for each frame of each camera is used for a nearest neighbor query, independent of all other cameras, frames or blocks (Section 4).
In the second step, the resulting matches are combined with a weighting scheme. Because the viewing direction of any camera with respect to the body is unknown, some frames (or feature blocks) might be ambiguous. We expect that having a second view should disambiguate some frames, so it makes sense to combine matches over cameras. However, close matches are very likely to be right. This suggests using a scheme that allows (a) several weakly confident matches that share a label to support one another and (b) strongly confident matches to dominate (see Fig. 2). This stage reports a distribution of similarity weights over labels, but conceals the number of cameras or of features used to obtain it, so that later decision stages can abstract away these details (Section 4.1). Finally, we use temporal smoothing, to estimate the action in a short sequence (Section 4.2).
Our architecture requires no volume reconstruction and makes engineering easy in new sets of features. When a set of features in a camera is confident, it dominates the labeling process for that frame. Similarly, the frames in a sequence that are confident dominate the decision for a sequence. Our experiments (Section 5) demonstrate that our method performs at the state of the art. We show results for several types of features. It is straightforward to incorporate new cameras or new features into our method. Performance generally improves when there are more cameras and more features. Our method is robust to differences in view direction; training and test cameras do not need to overlap. Discriminative views can be exploited opportunistically. Performance degrades when the test and training data do not share viewing directions. Camera drop in or drop out is handled easily with little penalty. There is no need to synchronize and calibrate cameras.
The main point of our paper is to show that, when one has multiple views of a person, straightforward data fusion methods give comparable recognition performance with that produced by 3D reconstruction in the context of a radically simpler system architecture with significant advantages.
Section snippets
Background
The activity recognition literature is rich, broad reviews of the topic appear in [6], [7], [8], [9], [10], [11]. We confine our review to covering the main trends in features types, and in methods that recognize activities from viewpoints that are not in the training set.
Video features
Each camera frame can be represented by many different types of features. In this section, we define our choice of features, but many others can be also used as our architecture is able to work with blocks of features at one time. While selecting features, we want feature construction to be simple yet robust. For practical reasons, it should not require camera calibration, or point correspondence. Selected features should be able to work with minor segmentation errors, and ideally they should
Label fusion
Our goal is to design an architecture that works efficient with many cameras and features while being robust to changes occurring in the system such as camera drop out, addition of new features or new cameras without disrupting the overall system. In our work, labels as evidences are pooled over cameras and over frames to compute confident label for sequence. Particularly, we fuse information at three levels to produce three kinds of labels. First, different blocks of features in a single
Dataset
To test our approach, we need a dataset consisting of videos in multiple viewpoints performed by actors in free orientations. We use the publicly available INRIA Xmas Motion Acquisition Sequence (IXMAS) dataset [16] (This has been widely used by many others [18], [21], [34], [20], [26]). It contains 13 actions (Nk) performed 3 times (No) in different orientations by 12 actors (Np). The action sequences are recorded in 5 cameras (Nc). [37] reports 10% average cross camera accuracy if one trains
Conclusion
Our method combines the confidence estimate of multiple cameras and various appearance features. This helps in solving mislabeled frames with a more accurate activity recognition. In our experiments, we show that having a second view outperforms single view systems with a considerable amount. We have shown that, by not reconstructing in 3D, we have gained significant advantages in system architecture without losing much in performance. Views need not be calibrated to one another; training and
Acknowledgments
S. Pehlivan was supported in part by the research fellowship of Scientific and Technical Research Council of Turkey while studying as a visiting scholar at University of Illinois at Urbana-Champaign.
References (46)
A survey on vision-based human action recognition
Image Vis. Comput.
(2010)- et al.
A survey of vision-based methods for action representation, segmentation and recognition
Comput. Vis. Image Underst.
(2011) - et al.
Human action-recognition using mutual invariants
Comput. Vis. Image Underst.
(2005) - et al.
Free viewpoint action recognition using motion history volumes
Comput. Vis. Image Underst.
(2006) - et al.
Matching actions in presence of camera motion
Comput. Vis. Image Underst.
(2006) - et al.
Histogram of oriented rectangles: a new pose descriptor for human action recognition
Image Vis. Comput.
(2009) - et al.
A new pose-based representation for recognizing actions from multiple cameras
Comput. Vis. Image Underst.
(2011) - et al.
The recognition of human movement using temporal templates
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - et al.
Recognizing action at a distance
- et al.
Space-time interest points
Actions as space-time shapes
Recognizing action events from multiple viewpoints
Human motion analysis: a review
J. Comput. Vision Image Underst.
A survey on visual surveillance of object motion and behaviors
IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.
Computational studies of human motion I: tracking and animation
Found. Trends Comp. Graph Vis.
Machine recognition of human activities: a survey
IEEE Trans. Circuits Syst. Video Technol.
A tutorial on hidden Markov models and selected applications in speech recognition
Proc. IEEE
View-invariant representation and recognition of actions
Int. J. Comput. Vis.
Recognizing human actions: a local SVM approach
Single view human action recognition using key pose matching and viterbi path searching
Retrieving actions in movies
Learning human actions via information maximization
Action recognition from arbitrary views using 3D exemplars
Cited by (10)
Toward human activity recognition: a survey
2023, Neural Computing and ApplicationsHuman action recognition using distance transform and entropy based features
2021, Multimedia Tools and ApplicationsView-Invariant Feature Representation for Action Recognition under Multiple Views
2019, International Journal of Intelligent Engineering and SystemsLearning a mid-level representation for multiview action recognition
2018, Advances in MultimediaBehavior Recognition Based on Action Subspace and Weight Condition Random Field
2017, Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of ChinaA comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition
2017, Applied Sciences (Switzerland)
- ☆
This paper has been recommended for acceptance by Xiaogang Wang.