Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Laraba, Sohaib; Tilmanne, Joëlle; Dutoit, Thierry

doi:10.1007/978-3-030-34995-0_56

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11754))

Included in the following conference series:

International Conference on Computer Vision Systems

2958 Accesses
3 Citations

Abstract

Skeleton-based human action recognition has recently drawn increasing attention thanks to the availability of low-cost motion capture devices, and accessibility of large-scale 3D skeleton datasets. One of the key challenges in action recognition lies in the high dimensionality of the captured data. In recent works, researchers draw inspiration from the success of deep learning in computer vision in order to improve the performances of action recognition systems. Unfortunately, most of these studies do not leverage different available deep architectures but develop new architectures. Most of the available architecture achieve very high accuracy in different image classification problems. In this paper, we use these architectures that are already pre-trained on other image classification tasks. Skeleton sequences are first transformed into image-like data representation. The resulting images are used to train different state-of-the-art CNN architectures following different training procedures. The experimental results obtained on the popular NTU RGB+D dataset, are very promising and outperform most of the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Skeleton-Based Activity Recognition: Preprocessing and Approaches

Pixel Convolutional Networks for Skeleton-Based Human Action Recognition

Human action recognition using Lie Group features and convolutional neural networks

Article 16 January 2020

References

Adel, H., Schütze, H.: Exploring different dimensions of attention for uncertainty detection. arXiv preprint arXiv:1612.06549 (2016)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Bengio, Y., Goodfellow, I., Courville, A.: Deep learning, vol. 1. Citeseer (2017)
Google Scholar
Bengio, Y., Simard, P., Frasconi, P., et al.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Broadwater, D.R., Smith, N.E.: A fine-tuned inception v3 constitutional neural network (CNN) architecture accurately distinguishes between benign and malignant breast histology. Technical report, 59 MDW San Antonio United States (2018)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014)
Du, Y., Fu, Y., Wang, L.: Skeleton based action recognition with convolutional neural network. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583. IEEE (2015)
Google Scholar
Du, Y., Fu, Y., Wang, L.: Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans. Image Process. 25(7), 3010–3022 (2016)
Article MathSciNet Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Google Scholar
Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Null, p. 726. IEEE (2003)
Google Scholar
Fan, H., Zheng, L., Yan, C., Yang, Y.: Unsupervised person re-identification: clustering and fine-tuning. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 14(4), 83 (2018)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Dauphin, Y.N.: A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344 (2016)
Han, D., Liu, Q., Fan, W.: A new image classification method using CNN transfer learning and web data augmentation. Expert Syst. Appl. 95, 43–56 (2018)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Huang, Z., Wan, C., Probst, T., Van Gool, L.: Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6099–6108 (2017)
Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and $<$0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016)
Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circ. Syst. Video Technol. 28(10), 2896–2907 (2017)
Article Google Scholar
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Laraba, S., Brahimi, M., Tilmanne, J., Dutoit, T.: 3D skeleton-based action recognition by representing motion capture sequences as 2D-RGB images. Comput. Anim. Virtual Worlds 28(3–4), e1782 (2017)
Article Google Scholar
Li, C., Sun, S., Min, X., Lin, W., Nie, B., Zhang, X.: End-to-end learning of deep convolutional neural network for 3D human action recognition. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 609–612. IEEE (2017)
Google Scholar
Li, C., Hou, Y., Wang, P., Li, W.: Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process. Lett. 24(5), 624–628 (2017)
Article Google Scholar
Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Action recognition by learning deep multi-granular spatio-temporal video representation. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 159–166. ACM (2016)
Google Scholar
Liu, H., Tu, J., Liu, M.: Two-stream 3D convolutional neural network for skeleton-based action recognition. arXiv preprint arXiv:1705.08106 (2017)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Müller, M.: Information Retrieval for Music and Motion, vol. 2. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74048-3
Book Google Scholar
Ohn-Bar, E., Trivedi, M.: Joint angles similarities and HOG2 for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 465–470 (2013)
Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-first AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4041–4049 (2015)
Google Scholar
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wang, P., Li, W., Li, C., Hou, Y.: Action recognition based on joint trajectory maps with convolutional neural networks. Knowl.-Based Syst. 158, 43–53 (2018)
Article Google Scholar
Wang, P., Li, Z., Hou, Y., Li, W.: Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 102–106. ACM (2016)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C.L.Y., Courville, A.: Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720 (2017)

Download references

Author information

Authors and Affiliations

TCTS Lab, Numediart Institute, University of Mons (UMONS), Mons, Belgium
Sohaib Laraba, Joëlle Tilmanne & Thierry Dutoit

Authors

Sohaib Laraba
View author publications
You can also search for this author in PubMed Google Scholar
Joëlle Tilmanne
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Dutoit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sohaib Laraba .

Editor information

Editors and Affiliations

Centre for Research and Technology Hellas (CERTH-ITI), Thessaloniki, Greece
Dimitrios Tzovaras
Centre for Research and Technology Hellas (CERTH-ITI), Thessaloniki, Greece
Dimitrios Giakoumis
Vienna University of Technology, Vienna, Austria
Markus Vincze
Foundation for Research and Technology Hellas (FORTH), Heraklion, Greece
Antonis Argyros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laraba, S., Tilmanne, J., Dutoit, T. (2019). Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition. In: Tzovaras, D., Giakoumis, D., Vincze, M., Argyros, A. (eds) Computer Vision Systems. ICVS 2019. Lecture Notes in Computer Science(), vol 11754. Springer, Cham. https://doi.org/10.1007/978-3-030-34995-0_56

Download citation

DOI: https://doi.org/10.1007/978-3-030-34995-0_56
Published: 23 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34994-3
Online ISBN: 978-3-030-34995-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Skeleton-Based Activity Recognition: Preprocessing and Approaches

Pixel Convolutional Networks for Skeleton-Based Human Action Recognition

Human action recognition using Lie Group features and convolutional neural networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Skeleton-Based Activity Recognition: Preprocessing and Approaches

Pixel Convolutional Networks for Skeleton-Based Human Action Recognition

Human action recognition using Lie Group features and convolutional neural networks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation