Two-stream spatiotemporal feature fusion for human action recognition

Abdelbaky, Amany; Aly, Saleh

doi:10.1007/s00371-020-01940-3

Two-stream spatiotemporal feature fusion for human action recognition

Original article
Published: 09 August 2020

Volume 37, pages 1821–1835, (2021)
Cite this article

The Visual Computer Aims and scope Submit manuscript

1255 Accesses
36 Citations
1 Altmetric
Explore all metrics

Abstract

Human action recognition is still a challenging topic in the computer vision field that has attracted a large number of researchers. It has a significant importance in varieties of applications such as intelligent video surveillance, sports analysis, and human–computer interaction. Recent works attempt to exploit the progress in deep learning architecture to learn spatial and temporal features from action video. However, it remains unclear how to combine spatial and temporal information with convolutional neural network. In this paper, we propose a novel human action recognition method by fusing spatial and temporal features learned from a simple unsupervised convolutional neural network called principal component analysis network (PCANet) in combination with bag-of-features (BoF) and vector of locally aggregated descriptors (VLAD ) encoding schemes. Firstly, both spatial and temporal features are learned via PCANet using a subset of frames and temporal templates for each video, while their dimensionality is reduced using whitening transformation (WT). The temporal templates are calculated using short-time motion energy images (ST-MEI) based on frame differencing. Then, the encoding scheme is applied to represent the final dual spatiotemporal PCANet features by feature fusion. Finally, the support vector machine (SVM) classifier is exploited for action recognition. Extensive experiments have been performed on two popular datasets, namely KTH and UCF sports, to evaluate the performance of proposed method. Our experimental results using leave-one-out evaluation strategy demonstrate that the proposed method presents satisfactory and comparable results on both datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human action recognition using short-time motion energy template images and PCANet features

Article 19 January 2020

Human action recognition using three orthogonal planes with unsupervised deep convolutional neural network

Article 04 March 2021

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Article 14 March 2020

References

Abdelbaky, A., Aly, S.: Human action recognition based on simple deep convolution network pcanet. In: 2020 International Conference on Innovative Trends in Communication and Computer Engineering (ITCE), pp. 257–262. IEEE (2020)
Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 35(4), 591–607 (2019)
Article Google Scholar
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. (CSUR) 43(3), 16 (2011)
Article Google Scholar
Ahmed, A., Aly, S.: Human action recognition using short-time motion energy template images and pcanet features. Neural Comput. Appl. 1–14 (2020)
Aly, S., Mohamed, A.: Unknown-length handwritten numeral string recognition using cascade of pca-svmnet classifiers. IEEE Access 7, 52024–52034 (2019)
Article Google Scholar
Aly, S., Sayed, A.: Human action recognition using bag of global and local zernike moment features. Multimed. Tools Appl. 1–31 (2019)
Aly, W., Aly, S., Almotairi, S.: User-independent american sign language alphabet recognition based on depth image and pcanet features. IEEE Access 7, 123138–123150 (2019)
Article Google Scholar
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)
Arashloo, S.R., Amirani, M.C., Noroozi, A.: Dynamic texture representation using a deep multi-scale convolutional network. J. Vis. Commun. Image Represent. 43, 89–97 (2017)
Article Google Scholar
Asadi-Aghbolaghi, M., Clapes, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 476–483. IEEE (2017)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Chan, T.H., Jia, K., Gao, S., Lu, J., Zeng, Z., Ma, Y.: Pcanet: a simple deep learning baseline for image classification. IEEE Trans. Image Process. 24(12), 5017–5032 (2015)
Article MathSciNet MATH Google Scholar
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3828–3836 (2015)
Csurka, G., Perronnin, F.: Fisher vectors: beyond bag-of-visual-words image representations. In: International Conference on Computer Vision, Imaging and Computer Graphics, pp. 28–42. Springer (2010)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2005. CVPR 2005., vol. 1, pp. 886–893. IEEE (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, pp. 428–441. Springer (2006)
Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (stip) detector. Vis. Comput. 32(3), 289–306 (2016)
Article Google Scholar
Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 524–531. IEEE (2005)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: European Conference on Computer Vision, pp. 392–407. Springer (2014)
Han, Y., Zhang, P., Zhuo, T., Huang, W., Zhang, Y.: Going deeper with two-stream convnets for action recognition in video surveillance. Pattern Recogn. Lett. 107, 83–90 (2018)
Article Google Scholar
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR 2010-23rd IEEE Conference on Computer Vision & Pattern Recognition, pp. 3304–3311. IEEE Computer Society (2010)
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 1–8. Ieee (2007)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Kessy, A., Lewin, A., Strimmer, K.: Optimal whitening and decorrelation. Am. Stat. 72(4), 309–314 (2018)
Article MathSciNet Google Scholar
Khan, F.S., Anwer, R.M., Van De Weijer, J., Bagdanov, A.D., Lopez, A.M., Felsberg, M.: Coloring action recognition in still images. Int. J. Comput. Vis. 105(3), 205–221 (2013)
Article Google Scholar
Khan, F.S., Van De Weijer, J., Anwer, R.M., Bagdanov, A.D., Felsberg, M., Laaksonen, J.: Scale coding bag of deep features for human attribute and action recognition. Mach. Vis. Appl. 29(1), 55–71 (2018)
Article Google Scholar
Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British Machine Vision Conference, pp. 275–1. British Machine Vision Association (2008)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3361–3368. IEEE (2011)
Li, Y., Ye, J., Wang, T., Huang, S.: Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Vis. Comput. 31(10), 1383–1394 (2015)
Article Google Scholar
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)
Article Google Scholar
Nazir, S., Yousaf, M.H., Nebel, J.C., Velastin, S.A.: Dynamic spatio-temporal bag of expressions (d-stboe) model for human action recognition. Sensors 19(12), 2790 (2019)
Article Google Scholar
Nazir, S., Yousaf, M.H., Velastin, S.A.: Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 72, 660–669 (2018)
Article Google Scholar
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Article Google Scholar
Pei, L., Ye, M., Zhao, X., Dou, Y., Bao, J.: Action recognition by learning temporal slowness invariant features. Vis. Comput. 32(11), 1395–1404 (2016)
Article Google Scholar
Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: European Conference on Computer Vision, pp. 581–595. Springer (2014)
Rahmani, H., Mian, A., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. IEEE Trans. Pattern Anal. Mach. Intell. 40(3), 667–681 (2017)
Article Google Scholar
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recognition require? In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol. 3, pp. 32–36. IEEE (2004)
Shapovalova, N., Vahdat, A., Cannons, K., Lan, T., Mori, G.: Similarity constrained latent support vector machine: an application to weakly supervised action classification. In: European Conference on Computer Vision, pp. 55–68. Springer (2012)
Sharma, G., Jurie, F., Schmid, C.: Discriminative spatial saliency for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3506–3513. IEEE (2012)
Shi, J., Wu, J., Li, Y., Zhang, Q., Ying, S.: Histopathological image classification with color pattern random binary hashing-based pcanet and matrix-form classifier. IEEE J. Biomed. Health Inform. 21(5), 1327–1337 (2017)
Article Google Scholar
Shin, A., Yamaguchi, M., Ohnishi, K., Harada, T.: Dense image representation with spatial pyramid vlad coding of cnn for locally robust captioning. arXiv preprint arXiv:1603.09046 (2016)
Shiyang, Yan, Jeremy, S., Smith, B.Z.: Action recognition from still images based on deep vlad spatial pyramids. Signal Process. Image Commun. 54, 118–129 (2017)
Article Google Scholar
Sun, C., Nevatia, R.: Large-scale web video event classification by use of fisher vectors. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 15–22. IEEE (2013)
Sun, L., Jia, K., Chan, T.H., Fang, Y., Wang, G., Yan, S.: Dl-sfa: Deeply-learned slow feature analysis for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2632 (2014)
Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: Advances in Neural Information Processing Systems, pp. 1988–1996 (2014)
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: European Conference on Computer Vision, pp. 140–153. Springer (2010)
Van De Sande, K., Gevers, T., Snoek, C.: Evaluating color descriptors for object and scene recognition. IEEE Trans.Pattern Anal. Mach. Intell. 32(9), 1582–1596 (2009)
Article Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Wang, L., Xu, Y., Cheng, J., Xia, H., Yin, J., Wu, J.: Human action recognition by learning spatio-temporal features with deep neural networks. IEEE Access 6, 17913–17922 (2018)
Article Google Scholar
Wang, T., Wang, S., Ding, X.: Detecting human action as the spatio-temporal tube of maximum mutual information. IEEE Trans. Circuits Syst. Video Technol. 24(2), 277–290 (2013)
Article Google Scholar
Whytock, T., Belyaev, A., Robertson, N.: Gei+ hog for action recognition. In: Fourth UK Computer Vision Student Workshop (2012)
Wu, J., Hu, D., Chen, F.: Action recognition by hidden temporal models. Vis. Comput. 30(12), 1395–1404 (2014)
Article Google Scholar
Wu, J., Qiu, S., Zeng, R., Kong, Y., Senhadji, L., Shu, H.: Multilinear principal component analysis network for tensor object classification. IEEE Access 5, 3322–3331 (2017)
Article Google Scholar
Xu, H., Tian, Q., Wang, Z., Wu, J.: A survey on aggregating methods for action recognition with dense trajectories. Multimed. Tools Appl. 75(10), 5701–5717 (2016)
Article Google Scholar
Xu, Y., Han, Y., Hong, R., Tian, Q.: Sequential video vlad: training the aggregation locally and temporally. IEEE Trans. Image Process. 27(10), 4933–4944 (2018)
Article MathSciNet Google Scholar
Yao, G., Lei, T., Zhong, J.: A review of convolutional-neural-network-based action recognition. Pattern Recognit. Lett. 118, 14–22 (2019)
Article Google Scholar
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporal interest points for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–730 (2013)
Zhang, K., Zhang, L.: Extracting hierarchical spatial and temporal features for human action recognition. Multimed. Tools Appl. 77(13), 16053–16068 (2018)
Article Google Scholar
Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: Panda: Pose aligned networks for deep attribute modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1644 (2014)
Zhen, X., Shao, L.: Action recognition via spatio-temporal local features: a comprehensive study. Image Vis. Comput. 50, 1–13 (2016)
Article Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)

Download references

Acknowledgements

The author Saleh Aly extends their appreciation to the Deanship of Scientific Research at Majmaah University for funding this work under project number No. (RGP-2019-24).

Author information

Authors and Affiliations

Electrical Engineering Department, Faculty of Engineering, Aswan University, Aswan, 81542, Egypt
Amany Abdelbaky & Saleh Aly
Department of Information Technology, College of Computer and Information Sciences, Majmaah University, Majmaah, 11952, Saudi Arabia
Saleh Aly

Authors

Amany Abdelbaky
View author publications
You can also search for this author in PubMed Google Scholar
Saleh Aly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saleh Aly.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdelbaky, A., Aly, S. Two-stream spatiotemporal feature fusion for human action recognition. Vis Comput 37, 1821–1835 (2021). https://doi.org/10.1007/s00371-020-01940-3

Download citation

Published: 09 August 2020
Issue Date: July 2021
DOI: https://doi.org/10.1007/s00371-020-01940-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-stream spatiotemporal feature fusion for human action recognition

Abstract

Access this article

Similar content being viewed by others

Human action recognition using short-time motion energy template images and PCANet features

Human action recognition using three orthogonal planes with unsupervised deep convolutional neural network

Human action recognition using fusion of multiview and deep features: an application to video surveillance

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-stream spatiotemporal feature fusion for human action recognition

Abstract

Access this article

Similar content being viewed by others

Human action recognition using short-time motion energy template images and PCANet features

Human action recognition using three orthogonal planes with unsupervised deep convolutional neural network

Human action recognition using fusion of multiview and deep features: an application to video surveillance

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation