Elsevier

Pattern Recognition Letters

Volume 34, Issue 15, 1 November 2013, Pages 1936-1944
Pattern Recognition Letters

Temporal segmentation and assignment of successive actions in a long-term video

https://doi.org/10.1016/j.patrec.2012.10.023Get rights and content

Abstract

Temporal segmentation of successive actions in a long-term video sequence has been a long-standing problem in computer vision. In this paper, we exploit a novel learning-based framework. Given a video sequence, only a few characteristic frames are selected by the proposed selection algorithm, and then the likelihood to trained models is calculated in a pair-wise way, and finally segmentation is obtained as the optimal model sequence to realize the maximum likelihood. The average accuracy on IXMAS dataset reached to 80.5% at frame level, using only 16.5% of all frames in computation time of 1.57 s per video which has 1160 frames on the average.

Graphical abstract

Highlights

► Characteristic frames are selected in a video instead of the entire sequence. ► Pairwise-frame representation is employed for actions modeling/segmenting. ► Computation time is decreased since we use a smaller number of frames instead. ► Similar poses appearing in different actions are identified correctly.

Introduction

Human activity analysis has been an attractive and popular research topic in recent two decades. Most previous works are concentrating on recognizing classes/categories of actions performed in an input video, independently of background. In these works, many significant progresses have been reported with satisfactory experimental results, but their experiments are mostly carried out under well-controlled situations such as seen in WEIZMANN (Blank et al., 2005) and KTH (Schuldt et al., 2004) datasets where short-term clips of single action (manually segmented/aligned) are provided. In real-world applications, however, human activity is observed in a continuous flow of multiple actions. Moreover, in general we cannot assume any prior knowledge of categories, temporal or spatial extents of performed action(s). A human activity is something like follows: a person steps into a room, picks up something to drink from a refrig, sits down on a sofa for a little break and stands up. Given such a video containing a variety of actions in a successive way (walking, picking up, sitting down and standing up, etc.), we have to segment it into individual actions as a natural demand as seen in action-based video index/classification, event recording and vision-surveilance management.

One common and standard approach is as follows: First, in the training phase, a set of features (e.g., interested point (Kovashka and Grauman, 2011), HoG (Thurau and Hlavác, 2008), optical flow (Fathi and Mori, 2008)) from each frame in the training sequences is extracted, and then individual actions are modeled using these features by some statistical or geometrical methods, e.g, HMM (Ahmad and Lee, 2008), SVM (Hoai et al., 2011). When a newly observed sequence is appeared in the evaluation phase, all frames of the sequence are firstly evaluated their probabilities according to the learned action models and segmentation result is obtained by solving a global optimization problem (Hoai et al., 2011, Lv and Nevatia, 2007) or a local optimization problem (Ogale et al., 2007, Jia and Yeung, 2008). This approach has succeeded in some practical problems (e.g., view-invariance (Weinland et al., 2007), activity modeling (Wang and Suter, 2007), fast matching (Shakhnarovich et al., 2003)). However, there are still some issues to be considered in order to increase its practical value (Poppe, 2010). In this study, we consider two aspects as below:

  • (a)

    It is redundant to use every frame in a video sequence, because neighboring frames are highly correlated (very similar) on the temporal domain. Moreover, such a frame-by-frame comparison is computationally expensive. In fact, it is sometimes reported that only a few frames in an input video are sufficient for action discrimination (Schindler and Van-Gool, 2008, Weinland and Boyer, 2008).

  • (b)

    Single-frame based representation is sufficient for modeling of human actions only where videos contain one single action. For videos containing more than one action, this approach would not work, since different actions can share very similar frames in part (Fig. 1).

To cope with these problems, the following two techniques are proposed in this study (Fig. 2):
  • (a)

    Given a long-term video sequence, just a few frames are selected by a martingale framework proposed in our prior work (Lu et al., 2012), which is executed without requiring any prior knowledge of possibly performed action(s). Such frames are called characteristic frames, here.

  • (b)

    For modeling/segmenting actions, pairwise-frame representation using characteristic frames is employed to describe the given video sequence.

Since we use a pairwise-frame representation instead of the single-frame based representation, the time differentiated information in two neighboring characteristic frames brings a higher level of discriminative information among actions. In addition, a smaller number of frames selected in the whole video sequence brings an efficiency.

The rest of this paper is organized as follows: Section 2 reviews the related works. In Section 3, selection procedure of characteristic frames is presented. Section 4 indicates how to model a human activity using characteristic frames. The detailed description on the temporal segmentation of successive actions is given in Section 5. Section 6 presents the experimental results on IXMAS dataset, followed by the discussion in Section 7. Section 8 concludes this paper and shows the future work.

Section snippets

Related work

Recent efforts on successive actions segmentation fall into five approaches roughly (Hoai et al., 2011). As the first approach, change-point detection based actions segmentation (Xuan and Murphy, 2007, Harchaoui et al., 2009) is the most popular and is based on change-point analysis with a sliding window along the time extent. Xuan and Murphy (2007) modeled the joint density of vector-valued observations using undirected Gaussian graphical models by which the location of change points are

Selection of characteristic frames in a video

An efficient way of selecting characteristic frames in a given video sequence has been proposed by the authors (Lu et al., 2012). That selection way is supported by two basic ideas. The first one is, an observed video sequence can be sufficiently characterized by few characteristic frames for describing basic actions; The other one is, by considering the input video sequence as a set of data streams in which successive frames are almost the same, the characteristics frames can be detected as

Supervised training for human activity model

We will describe a way to learn models for individual actions and transitions between two successive actions from a collection of videos including many successive actions. Given a video sequence F={f1,f2,,fn} of n frames with correct action labels {y1,y2,,yn}, we extract m characteristic frames Fc={fc1,fc2,,fcm}F(mn) by above Martingale test. For simplicity, we regard Fc as F. Then we couple the characteristic frames pairwise such asG={(f1,f2),(f2,f3),,(fm-1,fm)},as well as the

Probability computation

With GMMs learned from training sequences, a newly observed video sequence X is processed as follows. The characteristic frames {f1,f2,,fm} are selected firstly and then a series of pairs H={c1=(h1,h2),c2=(h2,h3),,cm-1=(hm-1,hm)} is generated according to the same procedure as used in the training phase. Then the posterior probability of GMM φt given ci(i=1,2,,m-1) is calculated by Bayes’ rule asp(φt|ci)=p(ci|φt)p(φt)t=1Tp(ci|φt)p(φt),where T is the number of trained models of possible T

Database

The proposed framework was validated on the publicly available multi-view IXMAS database (Weinland et al., 2007) containing 180 video sequences (36shots×5views) in total. In each sequence, one of 12 actors performed 15 actions in a successive way. This database includes 2D data (the resolution is 160 × 120 pixels) consisting of image sequences and 2D silhouette sequences.

Experiments setup and Implement

In the following two sub-sessions we firstly evaluated the efficiency of the way of selecting characteristic frames (Section

Discussion

In the proposed framework for temporal segmentation of successive actions, there is a little difficulty on how to determine the martingale threshold λ for selecting characteristic frames. As seen in Fig. 6, Fig. 8, it is difficult to choose the universally effective value of λ. From Fig. 8(a) and (b), we recommend the Martingale value of 1.25λ1.35 and block size B of 10×10 pixels for practical usage, considering the possibly existing noise in a video sequence.

Many approaches proposed so far

Conclusion and future work

In this study, we have proposed a novel framework for temporal segmentation of successive actions, as summarized as follows: (1) Given a long-term video sequence, a smaller number of characteristic frames are selected firstly by a change detection algorithm using a Martingale nature, (2) pairwise-frame representation of consecutive characteristic frames is then employed to calculate the likelihood to trained actions models that are constructed for individual actions and transitive actions, and

References (39)

  • Minh Hoai, Zhen-Zhong Lan, De la Torre F., 2011. Joint segmentation and classification of human actions in video. In:...
  • S.S. Ho et al.

    A martingale framework for detecting changes in data streams by testing exchangeability

    Pattern Anal. Machine Intell.

    (2010)
  • A. Iosifidis et al.

    Multi-view human movement recognition based on fuzzy distances and linear discriminant analysis

    Comput. Vision Image Understanding, Special issue on Semant. Understanding Human Behaviors Image Sequences

    (2011)
  • Jia, K., Yeung, D.Y., 2008. Human action recognition using local spatio–temporal discriminant embedding. In: Proc....
  • Junejo, I., Dexter, E., Laptev, I., Perez, P., 2008. Cross-view action recognition from temporal self-similarities. In:...
  • I. Junejo et al.

    View-independent action recognition from temporal self-similarities

    Pattern Anal. Machine Intell.

    (2011)
  • Kovashka, A., Grauman, K. 2011. Learning a hierarchy of discriminative space–time neighborhood features for human...
  • Laptev, I., Belongie, S.J., Perez, P., Wills, J., 2005. Periodic motion detection and segmentation via approximate...
  • Lewandowski, M., Makris, D., Nebel, J.C., 2010. View and style-independent action manifolds for human activity...
  • Cited by (9)

    View full text