Binary dense sift flow based two stream CNN for human action recognition

Park, Sang Kyoo; Chung, Jun Ho; Kang, Tae Koo; Lim, Myo Taeg

doi:10.1007/s11042-021-10795-2

Binary dense sift flow based two stream CNN for human action recognition

1166: Advances of machine learning in data analytics and visual information processing
Published: 10 June 2021

Volume 80, pages 35697–35720, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Sang Kyoo Park¹,
Jun Ho Chung¹,
Tae Koo Kang² &
…
Myo Taeg Lim ORCID: orcid.org/0000-0003-2990-8066¹

460 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Two-stream CNN is a widely-used network for human action recognition. Two-stream CNN consists of a spatial stream and a temporal stream. The spatial stream, through which the RGB image passes, extracts the shape features of human motion. The temporal stream, through which the optical flow images pass, extracts the sequence features of the listed motions. However, because of the constraints of the optical flow, such as brightness, constancy, and piecewise smoothness, there are limitations to the performance of two-stream CNN. One of the efficient methods to solve this problem is to expand the network model to a three-stream network, fuse it with LSTM, and add a modified pooling layer. This method improves the performance of the model but it increases the computational cost. Besides, the limitations of the optical flow are still present. In this paper, without extending the network model, a binary dense SIFT flow-based two-stream CNN is used instead of the optical flow. Unlike the optical flow, binary dense SIFT flow, which is a feature-based matching flow field is robust in brightness, constancy and piecewise smoothness. To evaluate the binary dense SIFT flow-based two-stream CNN, the UCF-101 dataset was selected for human action recognition. Furthermore, to evaluate the robustness of its brightness constancy and piecewise smoothness, a custom dataset was made up of classes that were extracted from UCF-101. Finally, the proposed method was compared with the state-of-the-art, which uses an optical flow-based two-stream CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Temporal Pooling for Efficient Online Action Recognition

Hidden Two-Stream Convolutional Networks for Action Recognition

Dense Optical Flow and Residual Network-Based Human Activity Recognition

References

Barron J, Fleet D, Beauchemin S (1994) System and experiment performance of optical flow techniques. Int J Comput Vision 12(1):43–77
Article Google Scholar
Black MJ, Anandan P (1996) The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Comput Vision Image Understand 63(1):75–104
Article Google Scholar
Blunsden S, Fisher RB (2010) The behave video dataset ground truthed video for multi-person behavior classification. In: Annals of the BMVA, vol 4, pp 1–12
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, pp 25–36
Calonder M, Lepetit V, Ozuysal M, Trzcinski T, Strecha C, Fua P (2012) Brief:computing a local binary descriptor very fast. In: IEEE Transactions on pattern analysis and machine intelligence, vol 34, pp 1281–1298,
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. arXiv:1405.3531
Chenarlogh VA, Razzazi F (2018) Multi-stream 3d cnn structure for human action recognition trained by limited data. IET Comput Vis 13(3):338–344
Article Google Scholar
Cong G, Domeniconi G, Shapiro J, Yang CC, Chen B (2019) Video action recognition with an additional end-to-end trained temporal stream. In: 2019 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 51–60
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441
Dawar N, Chen C, Jafari R, Kehtarnavaz N (2017) Real-time continuous action detection and recognition using depth images and inertial signals. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE). IEEE, pp 1342–1347
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Fradi H, Luvison B, Pham QC (2016) Crowd behavior analysis using local mid-level visual descriptors. IEEE Trans Circ Syst Video Technol 27 (3):589–602
Article Google Scholar
Hariyono J, Jo KH (2015) Pedestrian action recognition using motion type classification. In: 2015 IEEE 2nd international conference on cybernetics (CYBCONF). IEEE, pp 129–132
Hu Y, Lu M, Lu X (2018) Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition. In: 2018 15th international conference on control, automation, robotics and vision (ICARCV). IEEE, pp 1271–1277
Huang CD, Wang CY, Wang JC (2015) Human action recognition system for elderly and children care using three stream convnet. In: 2015 International conference on orange technologies (ICOT). IEEE, pp 5–9
Ji S, Xu W, Yang M (2012) Yu, k.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Jin CB, Li S, Kim H (2017) Real-time action detection in video surveillance using sub-action descriptor with multi-cnn. arXiv:1710.03383
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kataoka H, Satoh Y, Aoki Y, Oikawa S, Matsui Y (2018) Temporal and fine-grained pedestrian action recognition on driving recorder database. Sensors 18(2):627
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 204–212
Li Y, Li W, Mahadevan V, Vasconcelos N (2016) Vlad3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1951–1960
Liu C, Yuen J, Torralba A (2010) Sift flow: Dense correspondence across scenes and its applications. IEEE Trans Pattern Anal Mach Intell 33(5):978–994
Article Google Scholar
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: CVPR 2011. IEEE, pp 3337–3344
Liu Y, Lu Z, Li J, Yang T, Yao C (2018) Global temporal representation based cnns for infrared action recognition. IEEE Signal Process Lett 25 (6):848–852
Article Google Scholar
Lucas BD, Kanade T et al (1981) An iterative image registration technique with an application to stereo vision
Negin F, Bremond F (2016) Human action recognition in videos: A survey. In: INRIA Technical report
Peng Y, Zhao Y, Zhang J (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circ Syst Video Technol 29(3):773–786
Article Google Scholar
Pienaar SW, Malekian R (2019) Human activity recognition using lstm-rnn deep neural network architecture. In: arXiv:1905.00599
Richter J, Wiede C, Dayangac E, Shahenshah A, Hirtz G (2016) Activity recognition for elderly care by evaluating proximity to objects and human skeleton data. In: International conference on pattern recognition applications and methods. Springer, pp 139–155
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: A local svm approach. In: Pattern recognition, vol 3
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Ge L, Li R, Fang Y (2017) Three-stream cnns for action recognition. Pattern Recogn Lett 92:33–40
Article Google Scholar
Wang L, Qiao Y, Tang X (2013) Motionlets: Mid-level 3d parts for human motion recognition. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 2674–2681
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang X, Gao L, Wang P, Sun X, Liu X (2017) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644
Article Google Scholar
Wei H, Xiao Y, Li R, Liu X (2018) Crowd abnormal detection using two-stream fully convolutional neural networks. In: 2018 10th international conference on measuring technology and mechatronics automation (ICMTMA). IEEE, pp 332–336
Wu Z, Jiang YG, Wang X, Ye H, Xue X (2016) Multi-stream multi-class fusion of deep networks for video classification. In: Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 791–800
Wu Z, Wang X, Jiang YG, Ye H, Xue X (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 461–470
Yu S, Cheng Y, Xie L, Li SZ (2017) Fully convolutional networks for action recognition. IET Comput Vis 11(8):744–749
Article Google Scholar
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grants No. NRF-2016R1D1A1B01016071 and NRF-2019R1A2C108974211).

Author information

Authors and Affiliations

School of Electrical Engineering, Korea University, Seoul, Republic of Korea
Sang Kyoo Park, Jun Ho Chung & Myo Taeg Lim
Department of Human Intelligence and Robot Engineering, Sangmyung University, Cheonan, Republic of Korea
Tae Koo Kang

Authors

Sang Kyoo Park
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ho Chung
View author publications
You can also search for this author in PubMed Google Scholar
Tae Koo Kang
View author publications
You can also search for this author in PubMed Google Scholar
Myo Taeg Lim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tae Koo Kang or Myo Taeg Lim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, S.K., Chung, J.H., Kang, T.K. et al. Binary dense sift flow based two stream CNN for human action recognition. Multimed Tools Appl 80, 35697–35720 (2021). https://doi.org/10.1007/s11042-021-10795-2

Download citation

Received: 03 April 2020
Revised: 10 February 2021
Accepted: 10 March 2021
Published: 10 June 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11042-021-10795-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Binary dense sift flow based two stream CNN for human action recognition

Abstract

Access this article

Similar content being viewed by others

Hierarchical Temporal Pooling for Efficient Online Action Recognition

Hidden Two-Stream Convolutional Networks for Action Recognition

Dense Optical Flow and Residual Network-Based Human Activity Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Binary dense sift flow based two stream CNN for human action recognition

Abstract

Access this article

Similar content being viewed by others

Hierarchical Temporal Pooling for Efficient Online Action Recognition

Hidden Two-Stream Convolutional Networks for Action Recognition

Dense Optical Flow and Residual Network-Based Human Activity Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation