Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Meng, Bo; Liu, XueJun; Wang, Xiaolin

doi:10.1007/s11042-018-5893-9

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Published: 22 March 2018

Volume 77, pages 26901–26918, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Bo Meng¹,
XueJun Liu¹ &
Xiaolin Wang¹

1194 Accesses
39 Citations
Explore all metrics

Abstract

Convolutional neural networks (CNN) are the state-of-the-art method for action recognition in various kinds of datasets. However, most existing CNN models are based on lower-level handcrafted features from gray or RGB image sequences from small datasets, which are incapable of being generalized for application to various realistic scenarios. Therefore, we propose a new deep learning network for action recognition that integrates quaternion spatial-temporal convolutional neural network (QST-CNN) and Long Short-Term Memory network (LSTM), called QST-CNN-LSTM. Unlike a traditional CNN, the input for a QST-CNN utilizes a quaternion expression for an RGB image, and the values of the red, green, and blue channels are considered simultaneously as a whole in a spatial convolutional layer, avoiding the loss of spatial features. Because the raw images in video datasets are large and have background redundancy, we pre-extract key motion regions from RGB videos using an improved codebook algorithm. Furthermore, the QST-CNN is combined with LSTM for capturing the dependencies between different video clips. Experiments demonstrate that QST-CNN-LSTM is effective for improving recognition rates in the Weizmann, UCF sports, and UCF11 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Article 17 October 2023

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

Article 20 September 2022

References

Aharon M, Elad M, Bruckstein A (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322
Article Google Scholar
Annane D, Chevrolet JC, Chevret S et al (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Proces Syst 1(4):568–576
Google Scholar
Chaudhry R, Ravichandran A, Hager G, et al (2009) Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. Computer Vision and Pattern Recogn, pp. 1932–1939
Cheron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. pp. 3218–3226
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, San Diego, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Chapter Google Scholar
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis & Machine Intelligence 39(4):677–691
Article Google Scholar
Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream LSTM: a deep fusion framework for human action recognition. In: Winter Conference on Applications of Computer Vision. IEEE, Santa Rosa, pp 177–186. https://doi.org/10.1109/WACV.2017.27
Chapter Google Scholar
Gorelick L, Blank M, Shechtman E et al (2005) Actions as space-time shapes. IEEE Transactions on Pattern Analysis & Machine Intelligence 29(12):2247
Article Google Scholar
Guo XL, Yang TT (2016) Dynamic gesture recognition based on kinect depth data. Journal of Northeast Electric Power University 36(2):90–94
Google Scholar
Gutub AA (2010) Pixel Indicator technique for RGB image steganography. Journal of Emerging Technologies in Web Intelligence 2(1):56–64
Google Scholar
Gutub A, Al-Juaid N, Khan E (2017) Counting-based secret sharing technique for multimedia applications. Multimedia Tools & Applications 1:1–29
Google Scholar
Hamilton WR (1969) Elements of quaternions. Vols. I, II. Chelsea Publishing Co, New York
Google Scholar
Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Computer Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, pp 3–20. https://doi.org/10.1007/978-3-319-16814-2_1
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Ji S, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1):221–231
Article Google Scholar
Kim K, Chalidabhongse TH, Harwood D, Davis L (2005) Real-time foreground-background segmentation using codebook model. Real-Time Imaging 11(3):172–185
Article Google Scholar
Kim HJ, Lee JS, Yang HS (2007) Human action recognition using a modified convolutional neural network. In: Advances in Neural Networks - ISNN 2007, 4th International Symposium on Neural Networks. ISNN, Nanjing, pp 715–723. https://doi.org/10.1007/978-3-540-72393-6_85
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp 1097–1105.
Lan R, Zhou Y (2016) Quaternion-michelson descriptor for color image classification. IEEE Trans Image Process 25(11):5281–5292
Article MathSciNet Google Scholar
Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: International Conference on Computer Vision. IEEE, Barcelona, pp 2003–2010. https://doi.org/10.1109/ICCV.2011.6126472
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond Gaussian Pyramid: multi-skip feature stacking for action recognition. In: Conference on Computer Vision and Pattern Recognition. IEEE, Boston, pp 204–212. https://doi.org/10.1109/CVPR.2015.7298616
Chapter Google Scholar
Lan ZZ, Yu S-I, Yao D, Lin M, Raj B, Hauptmann AG (2016) The Best of BothWorlds: combining data-independent and data-driven approaches for action recognition. In: Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Las Vegas, pp 1196–1205. https://doi.org/10.1109/CVPRW.2016.152
Chapter Google Scholar
Laptev I, Lindeberg T (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. https://doi.org/10.1007/s11263-005-1838-7
Article Google Scholar
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: The 24th Conference on Computer Vision and Pattern Recognition. IEEE, Colorado Springs, pp 3361–3368. https://doi.org/10.1109/CVPR.2011.5995496
Chapter Google Scholar
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos. In: Computer vision and Pattern Recognition. IEEE, Miami, pp 1996–2003. https://doi.org/10.1109/CVPRW.2009.5206744
Liu Z, Zhang C, Tian Y (2016) 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100
Article Google Scholar
Ng JYH, Hausknecht MJ, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Computer vision and Pattern Recognition. IEEE, Boston, pp 4694–4702. https://doi.org/10.1109/CVPR.2015.7299101
Chapter Google Scholar
Norah A, Basem A, Adnan G (2017) Applicable light-weight cryptography to secure medical data in IoT systems. Journal of Research in Engineering and Appl Sci 2(2):50–58
Google Scholar
Palm RB (2012) Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark, Lyngby, pp 1–87
Google Scholar
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: Winter Conference on Application of Computer Vision. IEEE, Lake Placid, pp 1–8. https://doi.org/10.1109/WACV.2016.7477589
Chapter Google Scholar
Peng X, Qiao Y, Peng Q (2014) Motion boundary based sampling and 3D co-occurrence descriptors for action recognition. Image Vis Comput 32(9):616–628
Article Google Scholar
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: Computer Vision - 13th european conference. ECCV, Zurich, pp 581–595. https://doi.org/10.1007/978-3-319-10602-1_38
Chapter Google Scholar
Ravanbakhsh M, Mousavi H, Rastegari M, Murino V, Davis LS (2015) Action recognition with image based CNN features, CoRR abs/1512.03980. https://dblp.org/rec/bib/journals/corr/RavanbakhshMRMD15. Accessed 07 Jun 2017
Rodriguez MD, Ahmed J, Shah M (2008) Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, Anchorage. https://doi.org/10.1109/CVPR.2008.4587727
Salih AAA, Youssef C (2016) Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics. Pattern Recogn Lett 83:32–41
Article Google Scholar
Sapienza M, Cuzzolin F, Torr PHS (2014) Feature sampling and partitioning for visual vocabulary generation on large action classification datasets. CoRR abs/1405.7545. http://arxiv.org/abs/1405.7545. Accessed 07 Jun 2017
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. CoRR abs/1511.04119. http://arxiv.org/abs/1511.04119. Accessed 07 Jun 2017
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: 9th International Conference on Computer Vision. IEEE, Nice, p 1470–1477. https://doi.org/10.1109/ICCV.2003.1238663
Chapter Google Scholar
Tian YY, Tan QC (2016) Sub-pixel edge localization algorithm for filtering noise analysis. Journal of Northeast Electric Power University 5:56–60
Google Scholar
Wang H, Schmid C (2014) Action recognition with improved trajectories. In: International Conference on Computer Vision. IEEE, Sydney, pp 3551–3558. https://doi.org/10.1109/ICCV.2013.441
Chapter Google Scholar
Wang H, Klaser A, Schmid C, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: The 24th Conference on Computer Vision and Pattern Recognition. IEEE, Colorado Springs, pp 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407
Chapter Google Scholar
Wang JW, Le NT, Lee JS et al (2016) Color face image enhancement using adaptive singular value decomposition in fourier domain for face recognition. Pattern Recogn 57(C):31–49
Article Google Scholar
Wang L, Ge L, Li R et al (2017) Three-stream CNNs for action recognition[J]. Pattern Recogn Lett 92(C):33–40
Article Google Scholar
Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: International Conference on Computer Vision. IEEE, Santiago, pp 3164–3172. https://doi.org/10.1109/ICCV.2015.362
Chapter Google Scholar
Wu K, Li Gui J, Han GL (2017) Color image detail enhancement based on quaternion guided filter. Journal of Computer-Aided Design & Computer Graphics 29(3):419–427
Google Scholar
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: MM’12 Proceedings of the 20th ACM international conference on Multimedia. Nara, pp 1057–1060. https://doi.org/10.1145/2393347.2396382
Zeng R, Wu J, Shao Z, Chen Y, Chen B, Senhadji L, Shu H (2016) Color image classification via quaternion principal component analysis network. Neurocomputing 216:416–428
Article Google Scholar
Zou C, Kou KI, Wang Y (2016) Quaternion collaborative and sparse representation with application to color face recognition. IEEE Trans Image Process 25(7):3287–3302
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (61602108), Jilin Science and Technology Innovation Developing Scheme (20166016), and the Electric Power Intelligent Robot Collaborative Innovation Group.

Author information

Authors and Affiliations

School of Information Engineering, Northeast Electric Power University, Jilin, China
Bo Meng, XueJun Liu & Xiaolin Wang

Authors

Bo Meng
View author publications
You can also search for this author in PubMed Google Scholar
XueJun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to XueJun Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meng, B., Liu, X. & Wang, X. Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos. Multimed Tools Appl 77, 26901–26918 (2018). https://doi.org/10.1007/s11042-018-5893-9

Download citation

Received: 08 June 2017
Revised: 31 January 2018
Accepted: 13 March 2018
Published: 22 March 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11042-018-5893-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Abstract

Access this article

Similar content being viewed by others

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Abstract

Access this article

Similar content being viewed by others

BDNet: a method based on forward and backward convolutional networks for action recognition in videos

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Human Action Recognition Research Based on Fusion TS-CNN and LSTM Networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation