Abstract
Learning a good pose representation is significant for many applications, such as human pose estimation and action recognition. However, the representations learned by most approaches are not intrinsic and their transferability in different datasets and different tasks is limited. In this paper, we introduce a method to learn a versatile representation, which is capable of recovering unseen corrupted skeletons, being applied to the human action recognition, and transferring pose from one view to another view without knowing the relationships of cameras. To this end, a sequential bidirectional recursive network (SeBiReNet) is proposed for modeling kinematic dependency between skeleton joints. Utilizing the SeBiReNet as the core module, a denoising autoencoder is designed to learn intrinsic pose features through the task of recovering corrupted skeletons. Instead of only extracting the view-invariant feature as many other methods, we disentangle the view-invariant feature from the view-variant feature in the latent space and use them together as a representation of the human pose. For a better feature disentanglement, an adversarial augmentation strategy is proposed and applied to the denoising autoencoder. Disentanglement of view-variant and view-invariant features enables us to realize view transfer on 3D poses. Extensive experiments on different datasets and different tasks verify the effectiveness and versatility of the learned representation.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Fig11_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Caetano, C., Brémond, F., & Schwartz, W. R. (2019). Skeleton image representation for 3D action recognition based on tree structure and reference joints. In 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI) (pp. 16–23). IEEE.
Demisse, G. G., Papadopoulos, K., Aouada, D., & Ottersten, B. (2018). Pose encoding for robust skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp 188–194).
Ding, W., Liu, K., Belyaev, E., & Cheng, F. (2018). Tensor-based linear dynamical systems for action recognition from 3D skeletons. Pattern Recognition, 77, 75–86.
Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1110–1118).
Holden, D., Saito, J., Komura, T., & Joyce, T. (2015). Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs (pp 1–4).
Huang, Z., Wan, C., Probst, T., & Van Gool, L. (2017). Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6099–6108). IEEE Computer Society.
Hussein, M. E., Torki, M., Gowayyed, M. A., & El-Saban, M. (2013). Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. IJCAI, 13, 2466–2472.
Irsoy, O., & Cardie, C. (2014). Deep recursive neural networks for compositionality in language. In Advances in neural information processing systems (pp. 2096–210).
Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14(2), 201–211.
Ke, Q., An, S., Bennamoun, M., Sohel, F., & Boussaid, F. (2017a). Skeletonnet: Mining deep part features for 3-d action recognition. IEEE Signal Processing Letters, 24(6), 731–735.
Ke, Q., Bennamoun, M., An, S., Sohel, F., & Boussaid, F. (2017b). A new representation of skeleton sequences for 3d action recognition. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4570–4579). IEEE.
Kundu, J. N., Gor, M., Uppala, P. K., & Radhakrishnan, V. B. (2019). Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 1459–1467). IEEE.
Li, C., Hou, Y., Wang, P., & Li, W. (2017). Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, 24(5), 624–628.
Li, J., Wong, Y., Zhao, Q., & Kankanhalli, M. (2018). Unsupervised learning of view-invariant action representations. In Advances in neural information processing systems (pp. 1254–1264).
Liao, S., Lyons, T., Yang, W., & Ni, H. (2019). Learning stochastic differential equations using rnn with log signature features. arXiv preprint arXiv:1908.08286.
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2019.2916873.
Liu, J., Shahroudy, A., Wang, G., Duan, L. Y., & Chichung, A. K. (2019b). Skeleton-based online action prediction using scale selection network. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Liu, J., Shahroudy, A., Xu, D., Kot, A. C., & Wang, G. (2017a). Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 3007–3021.
Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal LSTM with trust gates for 3D human action recognition. In European conference on computer vision (pp. 816–833). Berlin: Springer.
Liu, J., Wang, G., Duan, L. Y., Abdiyeva, K., & Kot, A. C. (2018). Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing, 27(4), 1586–1599.
Liu, M., Liu, H., & Chen, C. (2017b). Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68, 346–362.
Luo, Z., Peng, B., Huang, D. A., Alahi, A., & Fei-Fei, L. (2017). Unsupervised learning of long-term motion dynamics for videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2203–2212).
Moreno-Noguer, F. (2017). 3D human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2823–2832).
Nie, Q., Wang, J., Wang, X., & Liu, Y. (2019). View-invariant human action recognition based on a 3D bio-constrained skeleton model. IEEE Transactions on Image Processing., 28(8), 3959–3972.
Papadopoulos, K., Ghorbel, E., Aouada, D., & Ottersten, B. (2019). Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv preprint arXiv:1912.09745.
Rahmani, H., Mahmood, A., Huynh, D., & Mian, A. (2016). Histogram of oriented principal components for cross-view action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(12), 2430–2443.
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In European conference on computer vision (pp. 573–586). Berlin: Springer.
Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE conference on computer vision and pattern recognition.
Socher, R., Lin, C. C., Manning, C., & Ng, A. Y. (2011). Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 129–136).
Socher, R., Manning, C. D., & Ng, A. Y. (2010). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In: Proceedings of the NIPS-2010 deep learning and unsupervised feature learning workshop (Vol. 2010, pp. 1–9).
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631–1642).
Sun, X., Shang, J., Liang, S., & Wei, Y. (2017). Compositional human pose regression. In Proceedings of the IEEE international conference on computer vision (pp. 2602–2611).
Vemulapalli, R., Arrate, F., & Chellappa, R. (2014). Human action recognition by representing 3D skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 588–595).
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec), 3371–3408.
Wang, H., & Wang, L. (2017). Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In e Conference on computer vision and pattern recognition (CVPR).
Wang, H., & Wang, L. (2018). Learning content and style: Joint action recognition and person identification from human skeletons. Pattern Recognition, 81, 23–35.
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2013). Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 914–927.
Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2014a). Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 914–927.
Wang, J., Nie, X., Xia, Y., Wu, Y., & Zhu, S. C. (2014b). Cross-view action modeling, learning and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2649–2656).
Wei, S., Song, Y., & Zhang, Y. (2017). Human skeleton tree recurrent neural network with joint relative motion feature for skeleton based action recognition. In 2017 IEEE international conference on image processing (ICIP) (pp. 91–95). IEEE.
Xia, L., Chen, C. C., & Aggarwal, J. (2012). View invariant human action recognition using histograms of 3D joints. In 2012 IEEE computer society conference on computer vision and pattern recognition workshops (CVPRW) (pp. 20–27). IEEE.
Yang, X., & Tian, Y. L. (2012). Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In 2012 IEEE computer society conference on Computer vision and pattern recognition workshops (CVPRW) (pp. 14–19). IEEE.
Yoshiyasu, Y., Sagawa, R., Ayusawa, K., & Murai, A. (2018). Skeleton transformer networks: 3D human pose and skinned mesh from single RGB image. arXiv preprint arXiv:1812.11328.
Yu, T. H., Kim, T. K., & Cipolla, R. (2013). Unconstrained monocular 3D human pose estimation by action detection and cross-modality regression forest. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3642–3649).
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., & Zheng, N. (2019). View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence., 41(8), 1963–1978.
Zheng, N., Wen, J., Liu, R., Long, L., Dai, J., & Gong, Z. (2018). Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In Thirty-Second AAAI conference on artificial intelligence.
Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017). Weaklysupervised transfer for 3D human pose estimation in the wild. In IEEE international conference on computer vision, ICCV (vol. 3, p. 7).
Acknowledgements
This work is supported in part by the Hong Kong RGC via project 14202918, the InnoHK programme of the HKSAR government via the Hong Kong Centre for Logistics Robotics, and Shenzhen Science and Technology Innovation Commission via KQTD20140630150243062.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A More Details About the SeBiReNet
1.1 A.1 Transformation Between the Human Skeleton and Tree Structure
As shown in Fig 12, the left one is a human skeleton model with 25 joints as defined by the Kinect sensor. The joint 0 (hip center) is generally selected as the root joint and its coordinate is set to (0,0,0). By reordering those joints but keeping the connection relationship between them, we can obtain a tree structure of the human body shown as the middle one (finger joints 7, 10, 21, 22, 23, 24 and feet joints 15, 19 are discarded). By replacing the joints with neural cells, a conventional recursive neural network corresponding to the human body is obtained as shown in the right side.
1.2 A.2 Relationship Between the Root Node and the Leaf Node
In the proposed SeBiReNet, each node actually corresponds to a joint in the human body. Therefore, the relationship between the root node and the leaf node is similar to the relationship between the root joint and the end joint. Usually, human motion transmits from the root joint to the end joint. However, to determine the position of a joint, both its parent joint and child joints should be considered. To this end, by imitating the physical forward kinematics and inverse kinematic process, we design our SeBiReNet as shown in the Fig. 2 in the Sect. 3.1.
1.3 A.3 Differences with the Other Tree Structures
The conventional recursive neural network shown in Fig. 12 is one of the most widely used tree structure network, especially in the text or language analysis field. In conventional recursive neural network, information flows from leaf nodes to root node. Hence, it’s strong at summarizing the input information and getting a semantic understanding. Compared to the recursive neural network or other tree structures, our SeBiReNet differs with them in the following aspects:
-
a.
Structure: The proposed SeBiReNet sequentially connects two tree structures that have opposite information flow directions. As shown in Fig. 2, we call the left part (a conventional recursive NN) recursive subnetwork and the right part diffuse subnetwork according to the information flow direction. The recursive subnetwork imitates the inverse kinematic process and the diffuse subnetwork imitates the forward kinematic process. Compared to other tree structures in general, the SeBiReNet has an iterative and directed inference process with ordered dependency pairs.
-
b.
Hidden states: We share the hidden states between the corresponding nodes in the two subnetworks considering the fact that they represent a same joint in the human body. In this manner, the result of the recursive process stored in the shared space can be refined by the diffuse subnetwork, and vice versa. Therefore, the whole SeBiReNet is able to run recurrently to achieve a better result.
-
c.
Outputs: The output of SeBiReNet is generated from the shared hidden states of all joints. This is different from the recursive network that only considers the output of the root node. It’s also different from the regular fusion design that concatenates the results of the two subnetworks.
1.4 A.4 Inference Process of the SeBiReNet
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Figa_HTML.png)
1.5 A.5 Network used for the Ablation Study in Sect. 5.3.1
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Figb_HTML.png)
Appendix B Pose Recovery Results on Unseen Action Datasets
![figure c](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Figc_HTML.png)
Appendix C Pose View Transfer Based on the Feature Disentanglement in the Latent Space
![figure d](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11263-020-01354-7/MediaObjects/11263_2020_1354_Figd_HTML.png)
Appendix D Unsupervised Classification with Clustering Methods
To demonstrate the effectiveness of the learned representation more rigorously, we compared the clustering results achieved from the learned pose representation and the joint coordinates using different clustering methods, as shown in Table 7. Different from the supervised method, it’s difficult to evaluate the output of clustering methods in an intuitive manner. The purity and adjusted rand index (ARI) are adopted as the measure of clustering quality. Purity is a measure of the extent to which clusters contains a single class. The adjusted rand index (ARI\(\in [-1,1]\)) measures how better the clustering results coincide with the true classes. Though both of them can reflect the clustering quality, ARI is better in measuring the correctness of unsupervised classification to some extent.
In Table 7, we compared four different clustering methods: K-means, Mean shift, Hierarchical clustering and Gaussian Mixture Model (GMM). For K-means, Hierarchical clustering and GMM methods, we pre-set the number of clusters (k) equal to the true number of classes. Mean shift achieves the number of clusters automatically and the number of clusters achieved is denoted as \(k^*\). As K-means and GMM are sensitive to the initialization of parameters, we report the average results and its standard deviation. GMM is not suitable for processing high dimensional data because a large scale of memory is needed. To make the GMM method computable, we use the principle analysis (PCA) to lower the dimension of action representation from \(TimeSteps*f_{vi}\) to 100.
On the Northwestern-UCLA dataset, the purity and ARI achieved from the learned representation are much higher than the results of using joint coordinates in K-means, Hierarchical clustering, and GMM methods. As the attained clusters are much more than the true classes, the purity and ARI are meaningless in the Mean Shift method. However, the cluster number obtained automatically by the Mean Shift method from the learned representation is much less than the 234 clusters based on the joint coordinates. Compared to the joint coordinates, the learned representation shows a much better performance in unsupervised action clustering.
Different from the learning methods based on LSTM, clustering methods cannot handle the time variation of an action. Hence, the clustering result becomes bad when action lengths vary a lot in a large dataset, such as the NTU RGB+D dataset. As Table 7 shows, both the clustering results based on the learned representation and the joint coordinates are not good on the NTU RGB+D dataset. While the ARI achieved from the learned representation is still higher than the one achieved from joint coordinates, which indicates that the learned representation leads to a better action classification. The number of clusters attained using the learned representation is also much less than the number of clusters using the joint coordinates in Mean shift method.
Rights and permissions
About this article
Cite this article
Nie, Q., Liu, Y. View Transfer on Human Skeleton Pose: Automatically Disentangle the View-Variant and View-Invariant Information for Pose Representation Learning. Int J Comput Vis 129, 1–22 (2021). https://doi.org/10.1007/s11263-020-01354-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01354-7