Skip to main content

Advertisement

Log in

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Convolutional neural networks (CNN) are the state-of-the-art method for action recognition in various kinds of datasets. However, most existing CNN models are based on lower-level handcrafted features from gray or RGB image sequences from small datasets, which are incapable of being generalized for application to various realistic scenarios. Therefore, we propose a new deep learning network for action recognition that integrates quaternion spatial-temporal convolutional neural network (QST-CNN) and Long Short-Term Memory network (LSTM), called QST-CNN-LSTM. Unlike a traditional CNN, the input for a QST-CNN utilizes a quaternion expression for an RGB image, and the values of the red, green, and blue channels are considered simultaneously as a whole in a spatial convolutional layer, avoiding the loss of spatial features. Because the raw images in video datasets are large and have background redundancy, we pre-extract key motion regions from RGB videos using an improved codebook algorithm. Furthermore, the QST-CNN is combined with LSTM for capturing the dependencies between different video clips. Experiments demonstrate that QST-CNN-LSTM is effective for improving recognition rates in the Weizmann, UCF sports, and UCF11 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Aharon M, Elad M, Bruckstein A (2006) K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322

    Article  Google Scholar 

  2. Annane D, Chevrolet JC, Chevret S et al (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Proces Syst 1(4):568–576

    Google Scholar 

  3. Chaudhry R, Ravichandran A, Hager G, et al (2009) Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. Computer Vision and Pattern Recogn, pp. 1932–1939

  4. Cheron G, Laptev I, Schmid C (2015) P-CNN: pose-based CNN features for action recognition. pp. 3218–3226

  5. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, San Diego, pp 886–893. https://doi.org/10.1109/CVPR.2005.177

    Chapter  Google Scholar 

  6. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis & Machine Intelligence 39(4):677–691

    Article  Google Scholar 

  7. Gammulle H, Denman S, Sridharan S, Fookes C (2017) Two stream LSTM: a deep fusion framework for human action recognition. In: Winter Conference on Applications of Computer Vision. IEEE, Santa Rosa, pp 177–186. https://doi.org/10.1109/WACV.2017.27

    Chapter  Google Scholar 

  8. Gorelick L, Blank M, Shechtman E et al (2005) Actions as space-time shapes. IEEE Transactions on Pattern Analysis & Machine Intelligence 29(12):2247

    Article  Google Scholar 

  9. Guo XL, Yang TT (2016) Dynamic gesture recognition based on kinect depth data. Journal of Northeast Electric Power University 36(2):90–94

    Google Scholar 

  10. Gutub AA (2010) Pixel Indicator technique for RGB image steganography. Journal of Emerging Technologies in Web Intelligence 2(1):56–64

    Google Scholar 

  11. Gutub A, Al-Juaid N, Khan E (2017) Counting-based secret sharing technique for multimedia applications. Multimedia Tools & Applications 1:1–29

    Google Scholar 

  12. Hamilton WR (1969) Elements of quaternions. Vols. I, II. Chelsea Publishing Co, New York

    Google Scholar 

  13. Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Computer Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, pp 3–20. https://doi.org/10.1007/978-3-319-16814-2_1

    Google Scholar 

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  15. Ji S, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence 35(1):221–231

    Article  Google Scholar 

  16. Kim K, Chalidabhongse TH, Harwood D, Davis L (2005) Real-time foreground-background segmentation using codebook model. Real-Time Imaging 11(3):172–185

    Article  Google Scholar 

  17. Kim HJ, Lee JS, Yang HS (2007) Human action recognition using a modified convolutional neural network. In: Advances in Neural Networks - ISNN 2007, 4th International Symposium on Neural Networks. ISNN, Nanjing, pp 715–723. https://doi.org/10.1007/978-3-540-72393-6_85

  18. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp 1097–1105.

  19. Lan R, Zhou Y (2016) Quaternion-michelson descriptor for color image classification. IEEE Trans Image Process 25(11):5281–5292

    Article  MathSciNet  Google Scholar 

  20. Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: International Conference on Computer Vision. IEEE, Barcelona, pp 2003–2010. https://doi.org/10.1109/ICCV.2011.6126472

  21. Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond Gaussian Pyramid: multi-skip feature stacking for action recognition. In: Conference on Computer Vision and Pattern Recognition. IEEE, Boston, pp 204–212. https://doi.org/10.1109/CVPR.2015.7298616

    Chapter  Google Scholar 

  22. Lan ZZ, Yu S-I, Yao D, Lin M, Raj B, Hauptmann AG (2016) The Best of BothWorlds: combining data-independent and data-driven approaches for action recognition. In: Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Las Vegas, pp 1196–1205. https://doi.org/10.1109/CVPRW.2016.152

    Chapter  Google Scholar 

  23. Laptev I, Lindeberg T (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123. https://doi.org/10.1007/s11263-005-1838-7

    Article  Google Scholar 

  24. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: The 24th Conference on Computer Vision and Pattern Recognition. IEEE, Colorado Springs, pp 3361–3368. https://doi.org/10.1109/CVPR.2011.5995496

    Chapter  Google Scholar 

  25. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos. In: Computer vision and Pattern Recognition. IEEE, Miami, pp 1996–2003. https://doi.org/10.1109/CVPRW.2009.5206744

  26. Liu Z, Zhang C, Tian Y (2016) 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100

    Article  Google Scholar 

  27. Ng JYH, Hausknecht MJ, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Computer vision and Pattern Recognition. IEEE, Boston, pp 4694–4702. https://doi.org/10.1109/CVPR.2015.7299101

    Chapter  Google Scholar 

  28. Norah A, Basem A, Adnan G (2017) Applicable light-weight cryptography to secure medical data in IoT systems. Journal of Research in Engineering and Appl Sci 2(2):50–58

    Google Scholar 

  29. Palm RB (2012) Prediction as a candidate for learning deep hierarchical models of data. Technical University of Denmark, Lyngby, pp 1–87

    Google Scholar 

  30. Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: Winter Conference on Application of Computer Vision. IEEE, Lake Placid, pp 1–8. https://doi.org/10.1109/WACV.2016.7477589

    Chapter  Google Scholar 

  31. Peng X, Qiao Y, Peng Q (2014) Motion boundary based sampling and 3D co-occurrence descriptors for action recognition. Image Vis Comput 32(9):616–628

    Article  Google Scholar 

  32. Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: Computer Vision - 13th european conference. ECCV, Zurich, pp 581–595. https://doi.org/10.1007/978-3-319-10602-1_38

    Chapter  Google Scholar 

  33. Ravanbakhsh M, Mousavi H, Rastegari M, Murino V, Davis LS (2015) Action recognition with image based CNN features, CoRR abs/1512.03980. https://dblp.org/rec/bib/journals/corr/RavanbakhshMRMD15. Accessed 07 Jun 2017

  34. Rodriguez MD, Ahmed J, Shah M (2008) Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, Anchorage. https://doi.org/10.1109/CVPR.2008.4587727

  35. Salih AAA, Youssef C (2016) Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics. Pattern Recogn Lett 83:32–41

    Article  Google Scholar 

  36. Sapienza M, Cuzzolin F, Torr PHS (2014) Feature sampling and partitioning for visual vocabulary generation on large action classification datasets. CoRR abs/1405.7545. http://arxiv.org/abs/1405.7545. Accessed 07 Jun 2017

  37. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. CoRR abs/1511.04119. http://arxiv.org/abs/1511.04119. Accessed 07 Jun 2017

  38. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: 9th International Conference on Computer Vision. IEEE, Nice, p 1470–1477. https://doi.org/10.1109/ICCV.2003.1238663

    Chapter  Google Scholar 

  39. Tian YY, Tan QC (2016) Sub-pixel edge localization algorithm for filtering noise analysis. Journal of Northeast Electric Power University 5:56–60

    Google Scholar 

  40. Wang H, Schmid C (2014) Action recognition with improved trajectories. In: International Conference on Computer Vision. IEEE, Sydney, pp 3551–3558. https://doi.org/10.1109/ICCV.2013.441

    Chapter  Google Scholar 

  41. Wang H, Klaser A, Schmid C, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: The 24th Conference on Computer Vision and Pattern Recognition. IEEE, Colorado Springs, pp 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407

    Chapter  Google Scholar 

  42. Wang JW, Le NT, Lee JS et al (2016) Color face image enhancement using adaptive singular value decomposition in fourier domain for face recognition. Pattern Recogn 57(C):31–49

    Article  Google Scholar 

  43. Wang L, Ge L, Li R et al (2017) Three-stream CNNs for action recognition[J]. Pattern Recogn Lett 92(C):33–40

    Article  Google Scholar 

  44. Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: International Conference on Computer Vision. IEEE, Santiago, pp 3164–3172. https://doi.org/10.1109/ICCV.2015.362

    Chapter  Google Scholar 

  45. Wu K, Li Gui J, Han GL (2017) Color image detail enhancement based on quaternion guided filter. Journal of Computer-Aided Design & Computer Graphics 29(3):419–427

    Google Scholar 

  46. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: MM’12 Proceedings of the 20th ACM international conference on Multimedia. Nara, pp 1057–1060. https://doi.org/10.1145/2393347.2396382

  47. Zeng R, Wu J, Shao Z, Chen Y, Chen B, Senhadji L, Shu H (2016) Color image classification via quaternion principal component analysis network. Neurocomputing 216:416–428

    Article  Google Scholar 

  48. Zou C, Kou KI, Wang Y (2016) Quaternion collaborative and sparse representation with application to color face recognition. IEEE Trans Image Process 25(7):3287–3302

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (61602108), Jilin Science and Technology Innovation Developing Scheme (20166016), and the Electric Power Intelligent Robot Collaborative Innovation Group.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to XueJun Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meng, B., Liu, X. & Wang, X. Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos. Multimed Tools Appl 77, 26901–26918 (2018). https://doi.org/10.1007/s11042-018-5893-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-5893-9

Keywords

Navigation