Temporal segmentation and assignment of successive actions in a long-term video
Graphical abstract
Highlights
► Characteristic frames are selected in a video instead of the entire sequence. ► Pairwise-frame representation is employed for actions modeling/segmenting. ► Computation time is decreased since we use a smaller number of frames instead. ► Similar poses appearing in different actions are identified correctly.
Introduction
Human activity analysis has been an attractive and popular research topic in recent two decades. Most previous works are concentrating on recognizing classes/categories of actions performed in an input video, independently of background. In these works, many significant progresses have been reported with satisfactory experimental results, but their experiments are mostly carried out under well-controlled situations such as seen in WEIZMANN (Blank et al., 2005) and KTH (Schuldt et al., 2004) datasets where short-term clips of single action (manually segmented/aligned) are provided. In real-world applications, however, human activity is observed in a continuous flow of multiple actions. Moreover, in general we cannot assume any prior knowledge of categories, temporal or spatial extents of performed action(s). A human activity is something like follows: a person steps into a room, picks up something to drink from a refrig, sits down on a sofa for a little break and stands up. Given such a video containing a variety of actions in a successive way (walking, picking up, sitting down and standing up, etc.), we have to segment it into individual actions as a natural demand as seen in action-based video index/classification, event recording and vision-surveilance management.
One common and standard approach is as follows: First, in the training phase, a set of features (e.g., interested point (Kovashka and Grauman, 2011), HoG (Thurau and Hlavác, 2008), optical flow (Fathi and Mori, 2008)) from each frame in the training sequences is extracted, and then individual actions are modeled using these features by some statistical or geometrical methods, e.g, HMM (Ahmad and Lee, 2008), SVM (Hoai et al., 2011). When a newly observed sequence is appeared in the evaluation phase, all frames of the sequence are firstly evaluated their probabilities according to the learned action models and segmentation result is obtained by solving a global optimization problem (Hoai et al., 2011, Lv and Nevatia, 2007) or a local optimization problem (Ogale et al., 2007, Jia and Yeung, 2008). This approach has succeeded in some practical problems (e.g., view-invariance (Weinland et al., 2007), activity modeling (Wang and Suter, 2007), fast matching (Shakhnarovich et al., 2003)). However, there are still some issues to be considered in order to increase its practical value (Poppe, 2010). In this study, we consider two aspects as below:
- (a)
It is redundant to use every frame in a video sequence, because neighboring frames are highly correlated (very similar) on the temporal domain. Moreover, such a frame-by-frame comparison is computationally expensive. In fact, it is sometimes reported that only a few frames in an input video are sufficient for action discrimination (Schindler and Van-Gool, 2008, Weinland and Boyer, 2008).
- (b)
Single-frame based representation is sufficient for modeling of human actions only where videos contain one single action. For videos containing more than one action, this approach would not work, since different actions can share very similar frames in part (Fig. 1).
- (a)
Given a long-term video sequence, just a few frames are selected by a martingale framework proposed in our prior work (Lu et al., 2012), which is executed without requiring any prior knowledge of possibly performed action(s). Such frames are called characteristic frames, here.
- (b)
For modeling/segmenting actions, pairwise-frame representation using characteristic frames is employed to describe the given video sequence.
The rest of this paper is organized as follows: Section 2 reviews the related works. In Section 3, selection procedure of characteristic frames is presented. Section 4 indicates how to model a human activity using characteristic frames. The detailed description on the temporal segmentation of successive actions is given in Section 5. Section 6 presents the experimental results on IXMAS dataset, followed by the discussion in Section 7. Section 8 concludes this paper and shows the future work.
Section snippets
Related work
Recent efforts on successive actions segmentation fall into five approaches roughly (Hoai et al., 2011). As the first approach, change-point detection based actions segmentation (Xuan and Murphy, 2007, Harchaoui et al., 2009) is the most popular and is based on change-point analysis with a sliding window along the time extent. Xuan and Murphy (2007) modeled the joint density of vector-valued observations using undirected Gaussian graphical models by which the location of change points are
Selection of characteristic frames in a video
An efficient way of selecting characteristic frames in a given video sequence has been proposed by the authors (Lu et al., 2012). That selection way is supported by two basic ideas. The first one is, an observed video sequence can be sufficiently characterized by few characteristic frames for describing basic actions; The other one is, by considering the input video sequence as a set of data streams in which successive frames are almost the same, the characteristics frames can be detected as
Supervised training for human activity model
We will describe a way to learn models for individual actions and transitions between two successive actions from a collection of videos including many successive actions. Given a video sequence of n frames with correct action labels , we extract m characteristic frames by above Martingale test. For simplicity, we regard as F. Then we couple the characteristic frames pairwise such asas well as the
Probability computation
With GMMs learned from training sequences, a newly observed video sequence X is processed as follows. The characteristic frames are selected firstly and then a series of pairs is generated according to the same procedure as used in the training phase. Then the posterior probability of GMM given is calculated by Bayes’ rule aswhere is the number of trained models of possible T
Database
The proposed framework was validated on the publicly available multi-view IXMAS database (Weinland et al., 2007) containing 180 video sequences () in total. In each sequence, one of 12 actors performed 15 actions in a successive way. This database includes 2D data (the resolution is 160 × 120 pixels) consisting of image sequences and 2D silhouette sequences.
Experiments setup and Implement
In the following two sub-sessions we firstly evaluated the efficiency of the way of selecting characteristic frames (Section
Discussion
In the proposed framework for temporal segmentation of successive actions, there is a little difficulty on how to determine the martingale threshold for selecting characteristic frames. As seen in Fig. 6, Fig. 8, it is difficult to choose the universally effective value of . From Fig. 8(a) and (b), we recommend the Martingale value of and block size B of pixels for practical usage, considering the possibly existing noise in a video sequence.
Many approaches proposed so far
Conclusion and future work
In this study, we have proposed a novel framework for temporal segmentation of successive actions, as summarized as follows: (1) Given a long-term video sequence, a smaller number of characteristic frames are selected firstly by a change detection algorithm using a Martingale nature, (2) pairwise-frame representation of consecutive characteristic frames is then employed to calculate the likelihood to trained actions models that are constructed for individual actions and transitive actions, and
References (39)
- et al.
Human action recognition using shape and CLG-motion flow from multi-view image sequences
Pattern Recognit.
(2008) A survey on vision-based human action recognition
Image Vision Comput.
(2010)- et al.
Recent developments in human motion analysis
Pattern Recognit.
(2003) - et al.
Free viewpoint action recognition using motion history volumes
Comput. Vision Image Understanding
(2006) - et al.
Free viewpoint action recognition using motion history volumes
Comput. Vision Image Understanding
(2006) - Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R., 2005. Actions as space–time shapes. In: Proc. ICCV,...
- et al.
Machine learning of event segmentation for news on demand
Comm. ACM
(2000) - et al.
Robust real-time periodic motion detection, analysis, and applications
Pattern Anal. Machine Intell.
(2000) - Fathi, A., Mori, G., 2008. Action recognition by learning mid-level motion features. In: Proc. CVPR, pp....
- Harchaoui, Z., Bach, F., Moulines, E., 2009. Kernel change point analysis. In: Neural Inf. Process....
A martingale framework for detecting changes in data streams by testing exchangeability
Pattern Anal. Machine Intell.
Multi-view human movement recognition based on fuzzy distances and linear discriminant analysis
Comput. Vision Image Understanding, Special issue on Semant. Understanding Human Behaviors Image Sequences
View-independent action recognition from temporal self-similarities
Pattern Anal. Machine Intell.
Cited by (9)
Accelerated action recognition and segmentation in videos
2021, 2020 10th International Symposium on Signal, Image, Video and Communications, ISIVC 2020Efficient hierarchical temporal segmentation method for facial expression sequences
2019, Turkish Journal of Electrical Engineering and Computer SciencesA discriminative structural model for joint segmentation and recognition of human actions
2018, Multimedia Tools and ApplicationsA novel double-layer framework for joint segmentation and recognition of multiple actions
2018, International Journal of Performability EngineeringReal-time action detection and temporal segmentation in continuous video
2017, Imaging Science JournalEfficient action recognition via local position offset of 3D skeletal body joints
2016, Multimedia Tools and Applications