Keywords

1 Introduction

Human action recognition is one of the important research tasks of various fields, such as, patient care, human-computer interaction, and smart surveillance [9, 10]. In this task, cross-view action recognition has been becoming a key problem due to the unpredictable recognition performance when we need recognize actions from unseen viewpoints. According to [2, 3] actions are best defined as patterns in four-dimensional space. However, in practice, video recordings of actions can be similarly defined as patterns in three-dimensional space of image-space and time. This issue causes cross-view action recognition to become more challenging.

In order to deal with this challenging issue, the majority of recent research is based on the idea of knowledge transfer and achieves good results [8, 13–19]. The aim of knowledge transfer is to find a view independent latent space in which action representations mapped from different viewpoints are directly comparable. Therefore, the performance of this approach largely depends on the discrimination power of the local features in practice.

Fig. 1.
figure 1

Illustration of our multi-projection-based framework for human action recognition.

In this study, we approach the cross-view action recognition from another interesting perspective: augmenting a sufficiently large number of viewpoints from a single viewpoint in which actions are performed. To address that, we leverage the advantages of depth data in comparison with intensity data, i.e., less sensitive to variations in illumination, appearance and texture. First we obtain 3D actions from depth sequences which are shot by depth cameras, e.g., Kinect camera. We then decompose each 3D action into a set of 2D actions corresponding to augmented viewpoints in 3D space. The decomposed actions, afterwards, are used to build dedicated features and classifiers. With this approach, we do not completely rely on the discrimination power of local features. In addition, we can exploit the state-of-the-art 2D techniques (e.g., spatio-temporal interest points [6], motion patterns [4, 5]) to effectively recognize 3D actions across different viewpoints.

Figure 1 illustrates our multi-projection-based framework. In the training phase, we extract local motion features from augmented viewpoints. Inspired by the success of the bag-of-words (BoW) model, we build a codebook corresponding to an augmented viewpoint by clustering dense trajectory motion features with K-means. We then describe an action sample in the augmented viewpoint by a histogram of codewords and build classifiers corresponding to the augmented viewpoint. In the testing phase, processes, such as, 3D action decomposition and feature extraction, are similar to the processes in the training phase. We use the built codebooks to generate action representations corresponding to the augmented viewpoints. And, classifiers built in the training phase are used to predict action labels.

We evaluate the proposed framework on the benchmark dataset, the N-UCLA3D dataset. Experimental results show two key points:

  1. 1.

    Augmented viewpoints provide more useful information to improve the performance of action recognition.

  2. 2.

    The discrimination performance of our method outperforms the state-of-the-art methods at cross-view action recognition with depth sequences.

The outline of this paper is as follows. We first provide a brief review of the related work in Sect. 2. We then describe our multi-projection-based framework in Sect. 3. Finally, experimental results, evaluation and conclusion are presented in Sect. 4 and in Sect. 5.

2 Related Work

Based on the data type, the literature on action recognition can be divided into two categories including 2D video-based and 3D video-based methods. In 2D videos, the majority of existing work focuses on single view action recognition, where actions in training and testing datasets are captured from the same view. In order to deal with viewpoint changes, one possible approach is to transfer knowledge. This approach builds an intermediate domain in which features extracted from different viewpoints are directly comparable. Works [13, 14] treat each viewpoint as a language and build corresponding vocabularies. Actions in different viewpoints are modeled by the corresponding vocabularies. Then the modeled actions are translated into an “action view interlingua”. Another method is to learn a transferable dictionary pair which includes a source dictionary and a target dictionary using shared action videos across the source and target views [15, 16]. Unlike the previous works, [17, 18] seek a set of linear transformations connecting source and target views, called “virtual views”. In a similar manner, [19] learns a non-linear knowledge transfer model that transfers knowledge from multiple views to a canonical view. However, such methods, in general, are either not adaptable or not enough effective when target actions are from unseen viewpoints.

To overcome this problem, [20] relies on unlabeled 3D human motion examples to learn a probabilistic model for feature transformations due to viewpoint changes. Although this method can be applied to action recognition from unseen viewpoints, data source, captured from motion capture systems, is so ideal that is applicable for realistic applications.

In 3D videos, more recently, [11] proposed a hierarchical compositional model to effectively express the geometry, appearance and motion variations across multiple view points. However, requirement related to 3D skeleton data for training is not always available. Another method [12] extended spatio-temporal interest point-based approach for 3D video. They proposed a Histogram of Oriented Principal Components descriptor that is well integrated with their spatio-temporal keypoint detection algorithm. However, interest point-based methods are often sensitive to changes in the surroundings. Therefore, the method effectiveness is directly affected since depth data is always unstable. In this paper, our proposed method uses the trajectory-based feature extracted from some selected viewpoints to generate dedicated representations and classifiers for action recognition.

Fig. 2.
figure 2

Illustration of the geometric mapping. (a) describes the mapping model with a mapping angle \(\phi \). (b) results the mappings of human pose corresponding to augmented viewpoints.

In comparison with the literature, this paper makes the following contributions:

  • We give a novel view for human action recognition based on augmenting data from 3D actions to enrich information for training and testing phrases.

  • In addition, we can apply the success of the state-of-the-art 2D techniques into 3D data and achieve good results.

  • The experimental results are outstanding, and they are applicable to real-world applications.

Fig. 3.
figure 3

Visualization for dense trajectories with action Stand up. This action is shot by camera 2. Examples correspond to 4 different viewpoints.

3 Multi-projection-Based Framework

In this section, we describe our framework in detail. We first present the mapping procedure to decompose each 3D action to a set of 2D actions. Secondly, we give a review of the dense trajectory feature extraction. We then present action representation in each augmented viewpoint with the BoW model. Finally we provide an effective evaluation procedure to accurately predict action labels.

3.1 3D Action Decomposition

Recognizing arbitrary human action is a challenging task as it has to take into account the variations in executing the same action across different viewpoints. We have used the 2D motion pattern-based features directly with data classifiers such as Support Vector Machines (SVM) to recognize actions from different viewpoints. The complementary property of the 2D motion pattern-based features, suggests that the recognition of 3D actions can be achieved by recognizing a subset of 2D actions. In this section, we propose a method for decomposing arbitrary input 3D action as a subset of 2D actions. With this method, we can leverage the effectiveness of the state-of-the-art techniques in 2D action recognition for cross-view action recognition with 3D videos.

Given a camera viewpoint, we generate a mapping model defined by only one parameter, the azimuthal angle \(\phi \). Figure 2 presents the mapping model and specific results in mapping. In this model, we define \(\phi = \pi /2\) as the camera viewpoint. We only consider viewpoints in \([0\ \pi ]\). After decomposing a 3D action, we adapt a trajectory-based feature extraction method to the 2D actions.

3.2 Feature Extraction

Dense trajectories [21] have indicated to be effective for action recognition. Our motivation for using dense trajectories is to capture discriminative motion patterns from 2D actions decomposed from 3D actions. To extract trajectories from videos, Wang et al. [21] proposed sampling on a dense grid. The sampling is performed at multiple scales. And tracking the dense points uses displacement information from a dense optical flow field. The tracking is implemented by using:

$$\begin{aligned} P_\mathrm {t+1} = P_\mathrm {t} + (\mathcal {M}*\omega )|_\mathrm {(\bar{x}_t,\bar{y}_t)}, \end{aligned}$$
(1)

where:

  • \(P_\mathrm {t+1}\): is the point tracked from the point \(P_{t}\).

  • \(\mathcal {M}\): is the kernel of median filtering.

  • \(\omega \): denotes the dense optical flow field.

  • \((\bar{x}_\mathrm {t},\bar{y}_\mathrm {t})\): is the rounded position of \(P_\mathrm {t}\).

Figure 3 visualizes dense trajectories of action Stand up from specific viewpoints. Once the trajectories have been extracted, two kinds of descriptor, i.e., a trajectory shape descriptor and a trajectory-aligned descriptor can be used. In our experiments, we only used the Motion Boundary Histogram (MBH), an trajectory-aligned descriptor, due to its effectiveness [21].

3.3 Decomposed Action Representation

So far we have augmented \(\mathrm {M}\) viewpoints \(\mathcal {V}_\mathrm {j}\). From each viewpoint \(\mathcal {V}_\mathrm {j}\), we have corresponding decomposed actions. We represent the actions by using the BoW model. We first extract dense trajectory-based features (i.e., MBH descriptors) [21] from the actions. Then, we build a codebook \(\mathcal {CB}_\mathrm {j}\) corresponding to viewpoint \(\mathcal {V}_\mathrm {j}\). We only use the development data to build the codebook \(\mathcal {CB}_\mathrm {j}\). Building the codebook \(\mathcal {CB}_\mathrm {j}\) is performed by using K-means method to cluster over all action descriptors. Each cluster is considered as a codeword that represents a specific motion pattern shared by the MBH descriptors in that cluster. We represent an action descriptor by the histogram encoding method [1]. The method means that one codeword is assigned to each MBH descriptor based on the minimum Euclidean distance. For classification, corresponding to each viewpoint \(\mathcal {V}_\mathrm {j}\), we train \(\mathrm {N}\) binary classifiers, Support Vector Machines (SVM), to perform multi-class classification to \(\mathrm {N}\) action classes.

Fig. 4.
figure 4

The performance of each augmented viewpoint in the action recognition from the seen viewpoint.

4 Experiments

4.1 Prediction Procedure

For a given testing sample (i.e. samples from target view), we obtain its BoW representations corresponding to pairs of training and testing viewpoints \(\mathcal {X} = \{x^\mathrm {(k,l)}\},\ k = 1..{M_\mathrm {tr}},\ l = 1..{M_\mathrm {ts}}\), where \({M_\mathrm {tr}}\) and \({M_\mathrm {ts}}\) are respectively number of augmented viewpoints from the training and testing data. Through the trained classifiers, each \(x^\mathrm {(k,l)}\) have a corresponding score vector \(y(x^\mathrm {(k,l)})\) as follows:

$$\begin{aligned} y(x^\mathrm {(k,l)}) = [s_\mathrm {1}^\mathrm {(k,l)}, s_\mathrm {2}^\mathrm {(k,l)}, ..., s_\mathrm {N}^\mathrm {(k,l)}], \quad k = 1..{M_\mathrm {tr}},\ l = 1..{M_\mathrm {ts}}, \end{aligned}$$
(2)

where:

  • \(s_{\mathrm {j}^\mathrm {(k,l)}}\): is a response score from the \(j^\mathrm {th}\) classifier at the \(k^\mathrm {th}\) training viewpoint and the \(l^\mathrm {th}\) testing viewpoint.

  • N: is a number of action classes.

Finally, we get the max-max score to find its label using (3).

$$\begin{aligned} j^{\mathrm {*}} = \displaystyle \mathop {{{\mathrm{arg\,max}}}}_{\mathrm {j}} s_{\mathrm {j}}^{\mathrm {*}}, \quad j = 1..N, \end{aligned}$$
(3)

where \(s_\mathrm {j}^\mathrm {*} = \max (s_\mathrm {j}^\mathrm {(k,l)}),\ k = 1..{M_\mathrm {tr}},\ l = 1..{M_\mathrm {ts}}\): is the maximal response score from the \(j^\mathrm {th}\) classifier of each pair of training and testing viewpoints.

Our proposed method was evaluated on the benchmark N-UCLA3D dataset [11]. We compare our method to the state-of-the-art cross-view action recognition methods including Domain Adaptation [8], Discriminative Virtual Views [17], And-Or Graph [11], and Histogram of Oriented Principal Components [12].

Table 1. Comparison for six combinations of training and testing cameras
Fig. 5.
figure 5

The viewpoint variation in action “Pick Up With One Hand” from 3 different views.

4.2 Implementation Details

Viewpoint Augmentation. The first step is to sample viewpoints which are used to decompose a 3D action. We empirically sample viewpoints with a step of \(\pi /6\). In total, we obtain a few number \((n = 7)\) of viewpoints (azimuthal angle \(\phi \in \mathrm {\Phi } = \{0, \pi /6, \pi /3, \pi /2, 2\pi /3, 5\pi /6, \pi \}\)). We define \(\phi = \pi /2\) as the camera viewpoint.

Dense Trajectory Extraction. We take the trajectory length of 15 frames and a dense sampling step size of 5 pixels. The sampling is performed at multiple scales with a factor of \(1/\sqrt{2}\). In our experiments, we only used the trajectory-aligned descriptor MBH [7]. The descriptor is computed within a \(32 \times 32\) space-time volume around the trajectory. We then adapt the BoW model to represent actions. At each viewpoint, we cluster the corresponding dense trajectory descriptors into \(k = 2000\) clusters using k-means to make viewpoint-specific codebooks.

Table 2. Comparison to the state-of-the-art methods
Fig. 6.
figure 6

Action samples are shot by camera 2 in the N-UCLA Multiview Action 3D dataset. The action samples are presented with depth data.

4.3 Northwestern-UCLA Multiview Action 3D Dataset [11]

The N-UCLA3D dataset contains data categories: RGB, depth, and skeleton information captured simultaneously by three Kinect cameras (see Fig. 5). This dataset consists of 10 action classes: pick up with one hand, pick up with two hand, drop trash, walk around, sit down, stand up, donning, doffing, throw, and carry, see action samples in Fig. 6. Each action is performed by 10 persons. We only use depth data for our experiments. In order to evaluate our proposed framework, we make two evaluation settings:

  • Action recognition from the seen viewpoint

  • Action recognition across different viewpoints

4.4 Action Recognition from the Seen Viewpoint

In this setting, we employ the cross-validation method for each camera. Steps are conducted as follows:

Step 1: Generate subsets

For a given video dataset \(\mathcal {D}\), we first divide it into \(\mathrm {K}\) parts:

$$\begin{aligned} \mathcal {D} = \displaystyle \bigcup _{i=1}^{\mathrm {K}} \mathsf {D}_i, \end{aligned}$$
(4)

where \(\mathsf {D}_i\) is the \(i^{th}\) subset and \(\displaystyle \bigcap _{i-1}^{\mathrm {K}} \mathsf {D}_i = \varnothing \). We obtain \(\mathsf {D}_i\) across subjects. It means that if a subject \(S_k\) belongs to \(\mathsf {D}_i\) then \(\mathsf {D}_j\) does not contain \(S_k, \forall i \ne j\).

Step 2: Create development and validation data

We define \(\mathcal {P}\) as a set of pairs of development and validation data:

$$\begin{aligned} \mathcal {P} = \{\mathcal {P}_i\} = \{(\mathsf {Dev}_i, \mathsf {Val}_i)\} \qquad \, , i = 1..\mathrm {K}, \end{aligned}$$
(5)

where:

  • \(\mathsf {Val}_i = \{\mathrm {D}_i\}\): corresponds to the validation data.

  • \(\mathsf {Dev}_i = \mathcal {D} \backslash \mathsf {Val}_i\): represents the development data.

With such development and validation data, our framework is evaluated in the cross-subject setting.

Step 3: Compute average accuracy for each augmented viewpoint

Given \(\mathrm {M}\) augmented viewpoints, we calculate average accuracy \(\overline{A}_j\) with classifiers trained from the \(j^{th}\) augmented viewpoint and test samples from \(\mathrm {M}\) viewpoints. Since we have \(\mathrm {K}\) combinations \(\mathcal {P}_i\), we evaluate the role of \(j^{th}\) viewpoint relied on mean average accuracy \(\mathcal {A}_j\), as follows:

$$\begin{aligned} \mathcal {A}_j = {1 \over \mathrm {K}} \displaystyle \sum _{i=1}^{\mathrm {K}}{\overline{A}_j^{(i)}}, \end{aligned}$$
(6)

Figure 4 shows the role of viewpoints to action recognition. The experimental results indicate that the original camera viewpoints are not the best for action recognition. We achieve the best results at \(49.8\,\%\), \(49.8\,\%\) and \(50.7\,\%\) respectively corresponding to \(\phi = \pi \), \(\phi = \pi /6\) and \(\phi = 2\pi /3\) from data on camera 1, camera 2, and camera 3. The results provide the benefit of selecting camera location to mount in several realistic surveillance systems. We also take the advantages to apply for action recognition across different viewpoints.

Table 3. Comparison for six combinations of training and testing cameras

4.5 Action Recognition Across Different Viewpoints

In this section, we evaluate our framework at action recognition in the cross-view setting. As described in [12], we use samples from one camera for training phase, and samples from the two remaining cameras for testing phase. Table 1 shows the recognition accuracy of our method for six possible combinations of training and testing cameras. We report the recognition accuracy in four following settings:

  • The first setting: We use all \(\mathrm {M}\) viewpoints for both training and testing phases.

  • The second setting: We use \(\mathrm {K}\) best viewpoints which we obtained in Sect. 4.4 in the training phase, and predict action labels over all \(\mathrm {M}\) viewpoints.

  • The third setting:We use the \(\mathrm {K}\) best viewpoints for both training and testing phases.

  • The fourth setting: We follow the \(3^{rd}\) setting but perform recognition from the same viewpoints seen in the training data.

The experimental results in Table 1 show that our framework not only improves the computational cost (i.e., we used a smaller number of viewpoints in the \(2^{nd}\), \(3^{rd}\) and \(4^{th}\) settings) but also achieves the better performance. We achieve the best result at the \(3^{rd}\) setting. Since the final prediction depends on the maximal response from pairs of training and testing viewpoints, using only the same viewpoints as described in the \(4^{th}\) settings restrict the recognition performance. In contrast, the \(2^{nd}\) setting is a mixture of the \(1^{st}\) and \(3^{rd}\) settings. It guarantees the reasonably computational cost in training phase, but easily cause confusions in testing phase.

Table 2 presents the performance of our method in comparison with the state-of-the-art methods. The results show that our method outperforms the other methods. The recognition accuracy of our method is significantly higher than the methods [8, 11, 17]. Unlike [12], our method does not depend on human body segmentation which is not a trivial task. Our method has proved the effectiveness for cross-view action recognition in much noise environments.

Table 4. Confusion matrix of our method on the N-UCLA3D dataset in the fourth setting.

In addition, we also conduct the evaluation on another training/testing setting. In this setting, we use samples from two cameras for training and samples from the remaining one for testing. The experimental results, as shown in Table 3, indicate that providing more information from other viewpoints can lead to an improvement in recognition.

In Table 4, we show confusion matrices of our method on the N-UCLA3D dataset in the \(4^{th}\) setting. Table 4a shows the confusion matrix when using training samples from camera 2 and testing samples from camera 3. Action (4) walk around and action (10) carry have maximum confusion with each other because the majority of movement within action carry is walking. Action pairs, such as (6,3) and (8,7) (i.e., (throw, drop trash) and (doffing, donning)) have some confusion due to some similarity in motion and appearance. Table 4b shows the improvement to the confusion of the action pairs mentioned in advance. With more information provided from camera 1, we return the better results. We can easily realize the effectiveness when evaluate the recognition performance on action (6) stand up. In this case, we achieve the significant improvement from \(68.1\,\%\) to \(95.7\,\%\).

5 Conclusion

This paper presents a study on human action recognition with depth sequences. In this paper, we discuss about the role of viewpoints to action recognition. We evaluate their role through our multi-projection-based framework. Our method exploits the diversity in action execution to enrich useful information. It takes advantages of state-of-the-art 2D techniques to build dedicated features and classifiers. In addition, evaluating the information from some best viewpoints enables to effectively recognize actions that can be from unseen camera viewpoints. Therefore, our method can be applied for several realistic camera-based system. Experimental results have clearly shown the outstanding performance of our proposed method.

In this study, we only investigated cross-view action recognition using depth data obtained from one camera. Therefore, self-occlusion is still a problem that can influence the discrimination power. One possible way forward is to use multiple cameras [22, 23]. With multiple cameras, we can collect much more discriminative motion information. As a result this should lead to an improvement in recognition accuracy. In the future work, we will try this idea within our viewpoint augmentation-based framework.