Cross-View Action Recognition by Projection-Based Augmentation

Le, Chien-Quang; Ngo, Thanh Duc; Le, Duy-Dinh; Satoh, Shin’ichi; Duong, Duc Anh

doi:10.1007/978-3-319-29451-3_18

Chien-Quang Le¹⁷,
Thanh Duc Ngo¹⁸,
Duy-Dinh Le¹⁹,
Shin’ichi Satoh¹⁹ &
…
Duc Anh Duong¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9431))

Included in the following conference series:

Image and Video Technology

2397 Accesses

Abstract

Challenging issue in cross-view action recognition is the difference between training viewpoint and testing viewpoint. Existing research deals with this problem by transferring knowledge, i.e., finding a viewpoint independent latent space in which action descriptors from different viewpoints are directly comparable. In this paper, we propose a novel approach to tackle the problem based on exploiting the discrimination in action execution through various viewpoints. We take the advantages of depth data to augment viewpoints from an initial camera viewpoint. In our framework, the local motion features and dedicated classifiers are built from the augmented viewpoints. We conduct experiments on the benchmark dataset, the Northwestern-UCLA Multiview Action 3D (N-UCLA3D) dataset. The experimental results indicated that our proposed method leads to outperform the state-of-the-art on the benchmark. In addition, we show the important role of viewpoints to improve the performance of action recognition.

You have full access to this open access chapter, Download conference paper PDF

Open-view human action recognition based on linear discriminant analysis

Article 30 January 2018

View-invariant human action recognition via robust locally adaptive multi-view learning

Article 07 November 2015

Automatic Multi-view Action Recognition with Robust Features

Keywords

1 Introduction

Human action recognition is one of the important research tasks of various fields, such as, patient care, human-computer interaction, and smart surveillance [9, 10]. In this task, cross-view action recognition has been becoming a key problem due to the unpredictable recognition performance when we need recognize actions from unseen viewpoints. According to [2, 3] actions are best defined as patterns in four-dimensional space. However, in practice, video recordings of actions can be similarly defined as patterns in three-dimensional space of image-space and time. This issue causes cross-view action recognition to become more challenging.

In order to deal with this challenging issue, the majority of recent research is based on the idea of knowledge transfer and achieves good results [8, 13–19]. The aim of knowledge transfer is to find a view independent latent space in which action representations mapped from different viewpoints are directly comparable. Therefore, the performance of this approach largely depends on the discrimination power of the local features in practice.

In this study, we approach the cross-view action recognition from another interesting perspective: augmenting a sufficiently large number of viewpoints from a single viewpoint in which actions are performed. To address that, we leverage the advantages of depth data in comparison with intensity data, i.e., less sensitive to variations in illumination, appearance and texture. First we obtain 3D actions from depth sequences which are shot by depth cameras, e.g., Kinect camera. We then decompose each 3D action into a set of 2D actions corresponding to augmented viewpoints in 3D space. The decomposed actions, afterwards, are used to build dedicated features and classifiers. With this approach, we do not completely rely on the discrimination power of local features. In addition, we can exploit the state-of-the-art 2D techniques (e.g., spatio-temporal interest points [6], motion patterns [4, 5]) to effectively recognize 3D actions across different viewpoints.

Figure 1 illustrates our multi-projection-based framework. In the training phase, we extract local motion features from augmented viewpoints. Inspired by the success of the bag-of-words (BoW) model, we build a codebook corresponding to an augmented viewpoint by clustering dense trajectory motion features with K-means. We then describe an action sample in the augmented viewpoint by a histogram of codewords and build classifiers corresponding to the augmented viewpoint. In the testing phase, processes, such as, 3D action decomposition and feature extraction, are similar to the processes in the training phase. We use the built codebooks to generate action representations corresponding to the augmented viewpoints. And, classifiers built in the training phase are used to predict action labels.

We evaluate the proposed framework on the benchmark dataset, the N-UCLA3D dataset. Experimental results show two key points:

1.
Augmented viewpoints provide more useful information to improve the performance of action recognition.
2.
The discrimination performance of our method outperforms the state-of-the-art methods at cross-view action recognition with depth sequences.

The outline of this paper is as follows. We first provide a brief review of the related work in Sect. 2. We then describe our multi-projection-based framework in Sect. 3. Finally, experimental results, evaluation and conclusion are presented in Sect. 4 and in Sect. 5.

2 Related Work

Based on the data type, the literature on action recognition can be divided into two categories including 2D video-based and 3D video-based methods. In 2D videos, the majority of existing work focuses on single view action recognition, where actions in training and testing datasets are captured from the same view. In order to deal with viewpoint changes, one possible approach is to transfer knowledge. This approach builds an intermediate domain in which features extracted from different viewpoints are directly comparable. Works [13, 14] treat each viewpoint as a language and build corresponding vocabularies. Actions in different viewpoints are modeled by the corresponding vocabularies. Then the modeled actions are translated into an “action view interlingua”. Another method is to learn a transferable dictionary pair which includes a source dictionary and a target dictionary using shared action videos across the source and target views [15, 16]. Unlike the previous works, [17, 18] seek a set of linear transformations connecting source and target views, called “virtual views”. In a similar manner, [19] learns a non-linear knowledge transfer model that transfers knowledge from multiple views to a canonical view. However, such methods, in general, are either not adaptable or not enough effective when target actions are from unseen viewpoints.

To overcome this problem, [20] relies on unlabeled 3D human motion examples to learn a probabilistic model for feature transformations due to viewpoint changes. Although this method can be applied to action recognition from unseen viewpoints, data source, captured from motion capture systems, is so ideal that is applicable for realistic applications.

In 3D videos, more recently, [11] proposed a hierarchical compositional model to effectively express the geometry, appearance and motion variations across multiple view points. However, requirement related to 3D skeleton data for training is not always available. Another method [12] extended spatio-temporal interest point-based approach for 3D video. They proposed a Histogram of Oriented Principal Components descriptor that is well integrated with their spatio-temporal keypoint detection algorithm. However, interest point-based methods are often sensitive to changes in the surroundings. Therefore, the method effectiveness is directly affected since depth data is always unstable. In this paper, our proposed method uses the trajectory-based feature extracted from some selected viewpoints to generate dedicated representations and classifiers for action recognition.

In comparison with the literature, this paper makes the following contributions:

We give a novel view for human action recognition based on augmenting data from 3D actions to enrich information for training and testing phrases.
In addition, we can apply the success of the state-of-the-art 2D techniques into 3D data and achieve good results.
The experimental results are outstanding, and they are applicable to real-world applications.

3 Multi-projection-Based Framework

In this section, we describe our framework in detail. We first present the mapping procedure to decompose each 3D action to a set of 2D actions. Secondly, we give a review of the dense trajectory feature extraction. We then present action representation in each augmented viewpoint with the BoW model. Finally we provide an effective evaluation procedure to accurately predict action labels.

3.1 3D Action Decomposition

Recognizing arbitrary human action is a challenging task as it has to take into account the variations in executing the same action across different viewpoints. We have used the 2D motion pattern-based features directly with data classifiers such as Support Vector Machines (SVM) to recognize actions from different viewpoints. The complementary property of the 2D motion pattern-based features, suggests that the recognition of 3D actions can be achieved by recognizing a subset of 2D actions. In this section, we propose a method for decomposing arbitrary input 3D action as a subset of 2D actions. With this method, we can leverage the effectiveness of the state-of-the-art techniques in 2D action recognition for cross-view action recognition with 3D videos.

Given a camera viewpoint, we generate a mapping model defined by only one parameter, the azimuthal angle $\phi $. Figure 2 presents the mapping model and specific results in mapping. In this model, we define $\phi = \pi /2$ as the camera viewpoint. We only consider viewpoints in $[0\ \pi ]$. After decomposing a 3D action, we adapt a trajectory-based feature extraction method to the 2D actions.

3.2 Feature Extraction

Dense trajectories [21] have indicated to be effective for action recognition. Our motivation for using dense trajectories is to capture discriminative motion patterns from 2D actions decomposed from 3D actions. To extract trajectories from videos, Wang et al. [21] proposed sampling on a dense grid. The sampling is performed at multiple scales. And tracking the dense points uses displacement information from a dense optical flow field. The tracking is implemented by using:

$$\begin{aligned} P_\mathrm {t+1} = P_\mathrm {t} + (\mathcal {M}*\omega )|_\mathrm {(\bar{x}_t,\bar{y}_t)}, \end{aligned}$$

(1)

where:

$P_\mathrm {t+1}$: is the point tracked from the point $P_{t}$.
$\mathcal {M}$: is the kernel of median filtering.
$\omega $: denotes the dense optical flow field.
$(\bar{x}_\mathrm {t},\bar{y}_\mathrm {t})$: is the rounded position of $P_\mathrm {t}$.

Figure 3 visualizes dense trajectories of action Stand up from specific viewpoints. Once the trajectories have been extracted, two kinds of descriptor, i.e., a trajectory shape descriptor and a trajectory-aligned descriptor can be used. In our experiments, we only used the Motion Boundary Histogram (MBH), an trajectory-aligned descriptor, due to its effectiveness [21].

3.3 Decomposed Action Representation

So far we have augmented $\mathrm {M}$ viewpoints $\mathcal {V}_\mathrm {j}$. From each viewpoint $\mathcal {V}_\mathrm {j}$, we have corresponding decomposed actions. We represent the actions by using the BoW model. We first extract dense trajectory-based features (i.e., MBH descriptors) [21] from the actions. Then, we build a codebook $\mathcal {CB}_\mathrm {j}$ corresponding to viewpoint $\mathcal {V}_\mathrm {j}$. We only use the development data to build the codebook $\mathcal {CB}_\mathrm {j}$. Building the codebook $\mathcal {CB}_\mathrm {j}$ is performed by using K-means method to cluster over all action descriptors. Each cluster is considered as a codeword that represents a specific motion pattern shared by the MBH descriptors in that cluster. We represent an action descriptor by the histogram encoding method [1]. The method means that one codeword is assigned to each MBH descriptor based on the minimum Euclidean distance. For classification, corresponding to each viewpoint $\mathcal {V}_\mathrm {j}$, we train $\mathrm {N}$ binary classifiers, Support Vector Machines (SVM), to perform multi-class classification to $\mathrm {N}$ action classes.

4 Experiments

4.1 Prediction Procedure

For a given testing sample (i.e. samples from target view), we obtain its BoW representations corresponding to pairs of training and testing viewpoints $\mathcal {X} = \{x^\mathrm {(k,l)}\},\ k = 1..{M_\mathrm {tr}},\ l = 1..{M_\mathrm {ts}}$, where ${M_\mathrm {tr}}$ and ${M_\mathrm {ts}}$ are respectively number of augmented viewpoints from the training and testing data. Through the trained classifiers, each $x^\mathrm {(k,l)}$ have a corresponding score vector $y(x^\mathrm {(k,l)})$ as follows:

$$\begin{aligned} y(x^\mathrm {(k,l)}) = [s_\mathrm {1}^\mathrm {(k,l)}, s_\mathrm {2}^\mathrm {(k,l)}, ..., s_\mathrm {N}^\mathrm {(k,l)}], \quad k = 1..{M_\mathrm {tr}},\ l = 1..{M_\mathrm {ts}}, \end{aligned}$$

(2)

where:

$s_{\mathrm {j}^\mathrm {(k,l)}}$: is a response score from the $j^\mathrm {th}$ classifier at the $k^\mathrm {th}$ training viewpoint and the $l^\mathrm {th}$ testing viewpoint.
N: is a number of action classes.

Finally, we get the max-max score to find its label using (3).

$$\begin{aligned} j^{\mathrm {*}} = \displaystyle \mathop {{{\mathrm{arg\,max}}}}_{\mathrm {j}} s_{\mathrm {j}}^{\mathrm {*}}, \quad j = 1..N, \end{aligned}$$

(3)

where $s_\mathrm {j}^\mathrm {*} = \max (s_\mathrm {j}^\mathrm {(k,l)}),\ k = 1..{M_\mathrm {tr}},\ l = 1..{M_\mathrm {ts}}$: is the maximal response score from the $j^\mathrm {th}$ classifier of each pair of training and testing viewpoints.

Our proposed method was evaluated on the benchmark N-UCLA3D dataset [11]. We compare our method to the state-of-the-art cross-view action recognition methods including Domain Adaptation [8], Discriminative Virtual Views [17], And-Or Graph [11], and Histogram of Oriented Principal Components [12].

Table 1. Comparison for six combinations of training and testing cameras

Full size table

4.2 Implementation Details

Viewpoint Augmentation. The first step is to sample viewpoints which are used to decompose a 3D action. We empirically sample viewpoints with a step of $\pi /6$. In total, we obtain a few number $(n = 7)$ of viewpoints (azimuthal angle $\phi \in \mathrm {\Phi } = \{0, \pi /6, \pi /3, \pi /2, 2\pi /3, 5\pi /6, \pi \}$). We define $\phi = \pi /2$ as the camera viewpoint.

Dense Trajectory Extraction. We take the trajectory length of 15 frames and a dense sampling step size of 5 pixels. The sampling is performed at multiple scales with a factor of $1/\sqrt{2}$. In our experiments, we only used the trajectory-aligned descriptor MBH [7]. The descriptor is computed within a $32 \times 32$ space-time volume around the trajectory. We then adapt the BoW model to represent actions. At each viewpoint, we cluster the corresponding dense trajectory descriptors into $k = 2000$ clusters using k-means to make viewpoint-specific codebooks.

Table 2. Comparison to the state-of-the-art methods

Full size table

4.3 Northwestern-UCLA Multiview Action 3D Dataset [11]

The N-UCLA3D dataset contains data categories: RGB, depth, and skeleton information captured simultaneously by three Kinect cameras (see Fig. 5). This dataset consists of 10 action classes: pick up with one hand, pick up with two hand, drop trash, walk around, sit down, stand up, donning, doffing, throw, and carry, see action samples in Fig. 6. Each action is performed by 10 persons. We only use depth data for our experiments. In order to evaluate our proposed framework, we make two evaluation settings:

Action recognition from the seen viewpoint
Action recognition across different viewpoints

4.4 Action Recognition from the Seen Viewpoint

In this setting, we employ the cross-validation method for each camera. Steps are conducted as follows:

Step 1: Generate subsets

For a given video dataset $\mathcal {D}$, we first divide it into $\mathrm {K}$ parts:

$$\begin{aligned} \mathcal {D} = \displaystyle \bigcup _{i=1}^{\mathrm {K}} \mathsf {D}_i, \end{aligned}$$

(4)

where $\mathsf {D}_i$ is the $i^{th}$ subset and $\displaystyle \bigcap _{i-1}^{\mathrm {K}} \mathsf {D}_i = \varnothing $. We obtain $\mathsf {D}_i$ across subjects. It means that if a subject $S_k$ belongs to $\mathsf {D}_i$ then $\mathsf {D}_j$ does not contain $S_k, \forall i \ne j$.

Step 2: Create development and validation data

We define $\mathcal {P}$ as a set of pairs of development and validation data:

$$\begin{aligned} \mathcal {P} = \{\mathcal {P}_i\} = \{(\mathsf {Dev}_i, \mathsf {Val}_i)\} \qquad \, , i = 1..\mathrm {K}, \end{aligned}$$

(5)

where:

$\mathsf {Val}_i = \{\mathrm {D}_i\}$: corresponds to the validation data.
$\mathsf {Dev}_i = \mathcal {D} \backslash \mathsf {Val}_i$: represents the development data.

With such development and validation data, our framework is evaluated in the cross-subject setting.

Step 3: Compute average accuracy for each augmented viewpoint

Given $\mathrm {M}$ augmented viewpoints, we calculate average accuracy $\overline{A}_j$ with classifiers trained from the $j^{th}$ augmented viewpoint and test samples from $\mathrm {M}$ viewpoints. Since we have $\mathrm {K}$ combinations $\mathcal {P}_i$, we evaluate the role of $j^{th}$ viewpoint relied on mean average accuracy $\mathcal {A}_j$, as follows:

$$\begin{aligned} \mathcal {A}_j = {1 \over \mathrm {K}} \displaystyle \sum _{i=1}^{\mathrm {K}}{\overline{A}_j^{(i)}}, \end{aligned}$$

(6)

Figure 4 shows the role of viewpoints to action recognition. The experimental results indicate that the original camera viewpoints are not the best for action recognition. We achieve the best results at $49.8\,\%$, $49.8\,\%$ and $50.7\,\%$ respectively corresponding to $\phi = \pi $, $\phi = \pi /6$ and $\phi = 2\pi /3$ from data on camera 1, camera 2, and camera 3. The results provide the benefit of selecting camera location to mount in several realistic surveillance systems. We also take the advantages to apply for action recognition across different viewpoints.

Table 3. Comparison for six combinations of training and testing cameras

Full size table

4.5 Action Recognition Across Different Viewpoints

In this section, we evaluate our framework at action recognition in the cross-view setting. As described in [12], we use samples from one camera for training phase, and samples from the two remaining cameras for testing phase. Table 1 shows the recognition accuracy of our method for six possible combinations of training and testing cameras. We report the recognition accuracy in four following settings:

The first setting: We use all $\mathrm {M}$ viewpoints for both training and testing phases.
The second setting: We use $\mathrm {K}$ best viewpoints which we obtained in Sect. 4.4 in the training phase, and predict action labels over all $\mathrm {M}$ viewpoints.
The third setting:We use the $\mathrm {K}$ best viewpoints for both training and testing phases.
The fourth setting: We follow the $3^{rd}$ setting but perform recognition from the same viewpoints seen in the training data.

The experimental results in Table 1 show that our framework not only improves the computational cost (i.e., we used a smaller number of viewpoints in the $2^{nd}$, $3^{rd}$ and $4^{th}$ settings) but also achieves the better performance. We achieve the best result at the $3^{rd}$ setting. Since the final prediction depends on the maximal response from pairs of training and testing viewpoints, using only the same viewpoints as described in the $4^{th}$ settings restrict the recognition performance. In contrast, the $2^{nd}$ setting is a mixture of the $1^{st}$ and $3^{rd}$ settings. It guarantees the reasonably computational cost in training phase, but easily cause confusions in testing phase.

Table 2 presents the performance of our method in comparison with the state-of-the-art methods. The results show that our method outperforms the other methods. The recognition accuracy of our method is significantly higher than the methods [8, 11, 17]. Unlike [12], our method does not depend on human body segmentation which is not a trivial task. Our method has proved the effectiveness for cross-view action recognition in much noise environments.

Table 4. Confusion matrix of our method on the N-UCLA3D dataset in the fourth setting.

Full size table

In addition, we also conduct the evaluation on another training/testing setting. In this setting, we use samples from two cameras for training and samples from the remaining one for testing. The experimental results, as shown in Table 3, indicate that providing more information from other viewpoints can lead to an improvement in recognition.

In Table 4, we show confusion matrices of our method on the N-UCLA3D dataset in the $4^{th}$ setting. Table 4a shows the confusion matrix when using training samples from camera 2 and testing samples from camera 3. Action (4) walk around and action (10) carry have maximum confusion with each other because the majority of movement within action carry is walking. Action pairs, such as (6,3) and (8,7) (i.e., (throw, drop trash) and (doffing, donning)) have some confusion due to some similarity in motion and appearance. Table 4b shows the improvement to the confusion of the action pairs mentioned in advance. With more information provided from camera 1, we return the better results. We can easily realize the effectiveness when evaluate the recognition performance on action (6) stand up. In this case, we achieve the significant improvement from $68.1\,\%$ to $95.7\,\%$.

5 Conclusion

This paper presents a study on human action recognition with depth sequences. In this paper, we discuss about the role of viewpoints to action recognition. We evaluate their role through our multi-projection-based framework. Our method exploits the diversity in action execution to enrich useful information. It takes advantages of state-of-the-art 2D techniques to build dedicated features and classifiers. In addition, evaluating the information from some best viewpoints enables to effectively recognize actions that can be from unseen camera viewpoints. Therefore, our method can be applied for several realistic camera-based system. Experimental results have clearly shown the outstanding performance of our proposed method.

In this study, we only investigated cross-view action recognition using depth data obtained from one camera. Therefore, self-occlusion is still a problem that can influence the discrimination power. One possible way forward is to use multiple cameras [22, 23]. With multiple cameras, we can collect much more discriminative motion information. As a result this should lead to an improvement in recognition accuracy. In the future work, we will try this idea within our viewpoint augmentation-based framework.

References

Chatfield, K., Lempitsky, V.S., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC, vol. 2, no. 4, p. 8 (2011)
Google Scholar
Neumann, J., Fermller, C., Aloimonos, Y.: Animated heads: From 3d motion fields to action descriptions. In: Magnenat-Thalmann, N., Thalmann, D. (eds.) Deformable Avatars, pp. 1–11. Springer, USA (2001)
Chapter Google Scholar
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)
Article Google Scholar
Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: action recognition through the motion analysis of tracked features. In: 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pp. 514–521. IEEE (2009)
Google Scholar
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 104–111. IEEE (2009)
Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
Chapter Google Scholar
Gopalan, R., Li, R., Chellappa, R.: Domain adaptation for object recognition: an unsupervised approach. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 999–1006. IEEE (2011)
Google Scholar
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. (CSUR) 43(3), 16 (2011)
Article Google Scholar
Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)
Article Google Scholar
Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.-C.: Cross-view action modeling, learning, and recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2649–2656. IEEE (2014)
Google Scholar
Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of Oriented Principal Components for Cross-View Action Recognition (2014). (arXiv preprint) arXiv:1409.6813
Li, B., Camps, O., Sznaier, M.: Cross-view activity recognition using hankelets. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1362–1369. IEEE (2012)
Google Scholar
Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3209–3216. IEEE (2011)
Google Scholar
Zheng, J., Jiang, Z., Jonathon Phillips, P., Chellappa, R.: Cross-view action recognition via a transferable dictionary pair. In: BMVC, vol. 1, no. 2, p. 7 (2012)
Google Scholar
Zheng, J., Jiang, Z.: Learning view-invariant sparse representations for cross-view action recognition. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3176–3183. IEEE (2013)
Google Scholar
Li, R., Zickler, T.: Discriminative virtual views for cross-view action recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2855–2862. IEEE (2012)
Google Scholar
Zhang, Z., Wang, C., Xiao, B., Zhou, W., Liu, S., Shi, C.: Cross-view action recognition via a continuous virtual path. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2690–2697. IEEE (2013)
Google Scholar
Rahmani, H., Mian, A.: Learning a non-linear knowledge transfer model for cross-view action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2458–2466 (2015)
Google Scholar
Gupta, A., Shafaei, A., Little, J.J., Woodham, R.J.: Unlabelled 3D motion examples improve cross-view action recognition. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014)
Google Scholar
Wang, H., Klser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Nakajima, H., Mitsugami, I., Yagi, Y.: Depth-based gait feature representation. Inf. Media Technol. 8(4), 1085–1089 (2013)
Google Scholar
Sivapalan, S., Chen, D., Denman, S., Sridharan, S., Fookes, C.: The backfilled GEI-a cross-capture modality gait feature for frontal and side-view gait recognition. In: 2012 International Conference on Digital Image Computing Techniques and Applications (DICTA), pp. 1–8. IEEE (2012)
Google Scholar

Download references

Acknowledgment

This research is the output of the project “Action Recognition on 3D Video” under grant number D2015-04 which belongs to University of Information Technology-Vietnam National University HoChiMinh City.

Author information

Authors and Affiliations

The Graduate University for Advanced Studies (SOKENDAI), Hayama, Japan
Chien-Quang Le
University of Information Technology, Ho Chi Minh City, Vietnam
Thanh Duc Ngo & Duc Anh Duong
National Institute of Informatics, Tokyo, Japan
Duy-Dinh Le & Shin’ichi Satoh

Authors

Chien-Quang Le
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Duc Ngo
View author publications
You can also search for this author in PubMed Google Scholar
Duy-Dinh Le
View author publications
You can also search for this author in PubMed Google Scholar
Shin’ichi Satoh
View author publications
You can also search for this author in PubMed Google Scholar
Duc Anh Duong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chien-Quang Le .

Editor information

Editors and Affiliations

The University of Western Australia, Crawley, Perth, West Australia, Australia
Thomas Bräunl
University of Otago, Dunedin, New Zealand
Brendan McCane
en Matematicas A.C., Centro de Investigación, Guanajuato, Mexico
Mariano Rivera
Central China Normal University, Wuhan, Hubei, China
Xinguo Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le, CQ., Ngo, T.D., Le, DD., Satoh, S., Duong, D.A. (2016). Cross-View Action Recognition by Projection-Based Augmentation. In: Bräunl, T., McCane, B., Rivera, M., Yu, X. (eds) Image and Video Technology. PSIVT 2015. Lecture Notes in Computer Science(), vol 9431. Springer, Cham. https://doi.org/10.1007/978-3-319-29451-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-29451-3_18
Published: 04 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29450-6
Online ISBN: 978-3-319-29451-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Cross-View Action Recognition by Projection-Based Augmentation

Abstract

Similar content being viewed by others

Open-view human action recognition based on linear discriminant analysis

View-invariant human action recognition via robust locally adaptive multi-view learning

Automatic Multi-view Action Recognition with Robust Features

Keywords

1 Introduction

2 Related Work