Temporal segmentation and assignment of successive actions in a long-term video

doi:10.1016/j.patrec.2012.10.023

Pattern Recognition Letters

Volume 34, Issue 15, 1 November 2013, Pages 1936-1944

https://doi.org/10.1016/j.patrec.2012.10.023 Get rights and content

Abstract

Temporal segmentation of successive actions in a long-term video sequence has been a long-standing problem in computer vision. In this paper, we exploit a novel learning-based framework. Given a video sequence, only a few characteristic frames are selected by the proposed selection algorithm, and then the likelihood to trained models is calculated in a pair-wise way, and finally segmentation is obtained as the optimal model sequence to realize the maximum likelihood. The average accuracy on IXMAS dataset reached to 80.5% at frame level, using only 16.5% of all frames in computation time of 1.57 s per video which has 1160 frames on the average.

Graphical abstract

Highlights

► Characteristic frames are selected in a video instead of the entire sequence. ► Pairwise-frame representation is employed for actions modeling/segmenting. ► Computation time is decreased since we use a smaller number of frames instead. ► Similar poses appearing in different actions are identified correctly.

Introduction

Human activity analysis has been an attractive and popular research topic in recent two decades. Most previous works are concentrating on recognizing classes/categories of actions performed in an input video, independently of background. In these works, many significant progresses have been reported with satisfactory experimental results, but their experiments are mostly carried out under well-controlled situations such as seen in WEIZMANN (Blank et al., 2005) and KTH (Schuldt et al., 2004) datasets where short-term clips of single action (manually segmented/aligned) are provided. In real-world applications, however, human activity is observed in a continuous flow of multiple actions. Moreover, in general we cannot assume any prior knowledge of categories, temporal or spatial extents of performed action(s). A human activity is something like follows: a person steps into a room, picks up something to drink from a refrig, sits down on a sofa for a little break and stands up. Given such a video containing a variety of actions in a successive way (walking, picking up, sitting down and standing up, etc.), we have to segment it into individual actions as a natural demand as seen in action-based video index/classification, event recording and vision-surveilance management.

One common and standard approach is as follows: First, in the training phase, a set of features (e.g., interested point (Kovashka and Grauman, 2011), HoG (Thurau and Hlavác, 2008), optical flow (Fathi and Mori, 2008)) from each frame in the training sequences is extracted, and then individual actions are modeled using these features by some statistical or geometrical methods, e.g, HMM (Ahmad and Lee, 2008), SVM (Hoai et al., 2011). When a newly observed sequence is appeared in the evaluation phase, all frames of the sequence are firstly evaluated their probabilities according to the learned action models and segmentation result is obtained by solving a global optimization problem (Hoai et al., 2011, Lv and Nevatia, 2007) or a local optimization problem (Ogale et al., 2007, Jia and Yeung, 2008). This approach has succeeded in some practical problems (e.g., view-invariance (Weinland et al., 2007), activity modeling (Wang and Suter, 2007), fast matching (Shakhnarovich et al., 2003)). However, there are still some issues to be considered in order to increase its practical value (Poppe, 2010). In this study, we consider two aspects as below:

(a)
It is redundant to use every frame in a video sequence, because neighboring frames are highly correlated (very similar) on the temporal domain. Moreover, such a frame-by-frame comparison is computationally expensive. In fact, it is sometimes reported that only a few frames in an input video are sufficient for action discrimination (Schindler and Van-Gool, 2008, Weinland and Boyer, 2008).
(b)
Single-frame based representation is sufficient for modeling of human actions only where videos contain one single action. For videos containing more than one action, this approach would not work, since different actions can share very similar frames in part (Fig. 1).

To cope with these problems, the following two techniques are proposed in this study (Fig. 2):

(a)
Given a long-term video sequence, just a few frames are selected by a martingale framework proposed in our prior work (Lu et al., 2012), which is executed without requiring any prior knowledge of possibly performed action(s). Such frames are called characteristic frames, here.
(b)
For modeling/segmenting actions, pairwise-frame representation using characteristic frames is employed to describe the given video sequence.

Since we use a pairwise-frame representation instead of the single-frame based representation, the time differentiated information in two neighboring characteristic frames brings a higher level of discriminative information among actions. In addition, a smaller number of frames selected in the whole video sequence brings an efficiency.

The rest of this paper is organized as follows: Section 2 reviews the related works. In Section 3, selection procedure of characteristic frames is presented. Section 4 indicates how to model a human activity using characteristic frames. The detailed description on the temporal segmentation of successive actions is given in Section 5. Section 6 presents the experimental results on IXMAS dataset, followed by the discussion in Section 7. Section 8 concludes this paper and shows the future work.

Section snippets

Related work

Recent efforts on successive actions segmentation fall into five approaches roughly (Hoai et al., 2011). As the first approach, change-point detection based actions segmentation (Xuan and Murphy, 2007, Harchaoui et al., 2009) is the most popular and is based on change-point analysis with a sliding window along the time extent. Xuan and Murphy (2007) modeled the joint density of vector-valued observations using undirected Gaussian graphical models by which the location of change points are

Selection of characteristic frames in a video

An efficient way of selecting characteristic frames in a given video sequence has been proposed by the authors (Lu et al., 2012). That selection way is supported by two basic ideas. The first one is, an observed video sequence can be sufficiently characterized by few characteristic frames for describing basic actions; The other one is, by considering the input video sequence as a set of data streams in which successive frames are almost the same, the characteristics frames can be detected as

Supervised training for human activity model

We will describe a way to learn models for individual actions and transitions between two successive actions from a collection of videos including many successive actions. Given a video sequence $F = {f_{1}, f_{2}, \dots, f_{n}}$ of n frames with correct action labels ${y_{1}, y_{2}, \dots, y_{n}}$ , we extract m characteristic frames $F^{c} = {f_{c_{1}}, f_{c_{2}}, \dots, f_{c_{m}}} \subseteq F (m ⩽ n)$ by above Martingale test. For simplicity, we regard $F^{c}$ as F. Then we couple the characteristic frames pairwise such as $G = {(f_{1}, f_{2}), (f_{2}, f_{3}), \dots, (f_{m - 1}, f_{m})},$ as well as the

Probability computation

With GMMs learned from training sequences, a newly observed video sequence X is processed as follows. The characteristic frames ${f_{1}, f_{2}, \dots, f_{m}}$ are selected firstly and then a series of pairs $H = {c_{1} = (h_{1}, h_{2}), c_{2} = (h_{2}, h_{3}), \dots, c_{m - 1} = (h_{m - 1}, h_{m})}$ is generated according to the same procedure as used in the training phase. Then the posterior probability of GMM $φ_{t}$ given $c_{i} (i = 1, 2, \dots, m - 1)$ is calculated by Bayes’ rule as $p (φ_{t} | c_{i}) = \frac{p (c_{i} | φ_{t}) p (φ_{t})}{\sum_{t = 1}^{T^{*}} p (c_{i} | φ_{t}) p (φ_{t})},$ where $T^{*}$ is the number of trained models of possible T

Database

The proposed framework was validated on the publicly available multi-view IXMAS database (Weinland et al., 2007) containing 180 video sequences ( $36 shots \times 5 views$ ) in total. In each sequence, one of 12 actors performed 15 actions in a successive way. This database includes 2D data (the resolution is 160 × 120 pixels) consisting of image sequences and 2D silhouette sequences.

Experiments setup and Implement

In the following two sub-sessions we firstly evaluated the efficiency of the way of selecting characteristic frames (Section

Discussion

In the proposed framework for temporal segmentation of successive actions, there is a little difficulty on how to determine the martingale threshold $λ$ for selecting characteristic frames. As seen in Fig. 6, Fig. 8, it is difficult to choose the universally effective value of $λ$ . From Fig. 8(a) and (b), we recommend the Martingale value of $1.25 ⩽ λ ⩽ 1.35$ and block size B of $10 \times 10$ pixels for practical usage, considering the possibly existing noise in a video sequence.

Many approaches proposed so far

Conclusion and future work

In this study, we have proposed a novel framework for temporal segmentation of successive actions, as summarized as follows: (1) Given a long-term video sequence, a smaller number of characteristic frames are selected firstly by a change detection algorithm using a Martingale nature, (2) pairwise-frame representation of consecutive characteristic frames is then employed to calculate the likelihood to trained actions models that are constructed for individual actions and transitive actions, and

References (39)

M. Ahmad et al.
Human action recognition using shape and CLG-motion flow from multi-view image sequences
Pattern Recognit.
(2008)
R. Poppe
A survey on vision-based human action recognition
Image Vision Comput.
(2010)
L. Wang et al.
Recent developments in human motion analysis
Pattern Recognit.
(2003)
D. Weinland et al.
Free viewpoint action recognition using motion history volumes
Comput. Vision Image Understanding
(2006)
D. Weinland et al.
Free viewpoint action recognition using motion history volumes
Comput. Vision Image Understanding
(2006)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R., 2005. Actions as space–time shapes. In: Proc. ICCV,...
S. Boykin et al.
Machine learning of event segmentation for news on demand
Comm. ACM
(2000)
R. Cutler et al.
Robust real-time periodic motion detection, analysis, and applications
Pattern Anal. Machine Intell.
(2000)
Fathi, A., Mori, G., 2008. Action recognition by learning mid-level motion features. In: Proc. CVPR, pp....
Harchaoui, Z., Bach, F., Moulines, E., 2009. Kernel change point analysis. In: Neural Inf. Process....

Minh Hoai, Zhen-Zhong Lan, De la Torre F., 2011. Joint segmentation and classification of human actions in video. In:...

S.S. Ho et al.

A martingale framework for detecting changes in data streams by testing exchangeability

Pattern Anal. Machine Intell.

(2010)

A. Iosifidis et al.

Multi-view human movement recognition based on fuzzy distances and linear discriminant analysis

Comput. Vision Image Understanding, Special issue on Semant. Understanding Human Behaviors Image Sequences

(2011)

Jia, K., Yeung, D.Y., 2008. Human action recognition using local spatio–temporal discriminant embedding. In: Proc....

Junejo, I., Dexter, E., Laptev, I., Perez, P., 2008. Cross-view action recognition from temporal self-similarities. In:...

I. Junejo et al.

View-independent action recognition from temporal self-similarities

Pattern Anal. Machine Intell.

(2011)

Kovashka, A., Grauman, K. 2011. Learning a hierarchy of discriminative space–time neighborhood features for human...

Laptev, I., Belongie, S.J., Perez, P., Wills, J., 2005. Periodic motion detection and segmentation via approximate...

Lewandowski, M., Makris, D., Nebel, J.C., 2010. View and style-independent action manifolds for human activity...

Cited by (9)

Accelerated action recognition and segmentation in videos
2021, 2020 10th International Symposium on Signal, Image, Video and Communications, ISIVC 2020
Efficient hierarchical temporal segmentation method for facial expression sequences
2019, Turkish Journal of Electrical Engineering and Computer Sciences
A discriminative structural model for joint segmentation and recognition of human actions
2018, Multimedia Tools and Applications
A novel double-layer framework for joint segmentation and recognition of multiple actions
2018, International Journal of Performability Engineering
Real-time action detection and temporal segmentation in continuous video
2017, Imaging Science Journal
Efficient action recognition via local position offset of 3D skeletal body joints
2016, Multimedia Tools and Applications

View all citing articles on Scopus

View full text

Temporal segmentation and assignment of successive actions in a long-term video

Abstract

Graphical abstract

Highlights

Introduction

Section snippets

Related work

Selection of characteristic frames in a video

Supervised training for human activity model

Probability computation

Database

Experiments setup and Implement

Discussion

Conclusion and future work

Pattern Recognit.

Image Vision Comput.

Pattern Recognit.

Comput. Vision Image Understanding

Comput. Vision Image Understanding

Machine learning of event segmentation for news on demand

Comm. ACM

Robust real-time periodic motion detection, analysis, and applications

Pattern Anal. Machine Intell.

A martingale framework for detecting changes in data streams by testing exchangeability

Pattern Anal. Machine Intell.

Multi-view human movement recognition based on fuzzy distances and linear discriminant analysis

Comput. Vision Image Understanding, Special issue on Semant. Understanding Human Behaviors Image Sequences

View-independent action recognition from temporal self-similarities

Pattern Anal. Machine Intell.