Abstract
Two-stream CNN is a widely-used network for human action recognition. Two-stream CNN consists of a spatial stream and a temporal stream. The spatial stream, through which the RGB image passes, extracts the shape features of human motion. The temporal stream, through which the optical flow images pass, extracts the sequence features of the listed motions. However, because of the constraints of the optical flow, such as brightness, constancy, and piecewise smoothness, there are limitations to the performance of two-stream CNN. One of the efficient methods to solve this problem is to expand the network model to a three-stream network, fuse it with LSTM, and add a modified pooling layer. This method improves the performance of the model but it increases the computational cost. Besides, the limitations of the optical flow are still present. In this paper, without extending the network model, a binary dense SIFT flow-based two-stream CNN is used instead of the optical flow. Unlike the optical flow, binary dense SIFT flow, which is a feature-based matching flow field is robust in brightness, constancy and piecewise smoothness. To evaluate the binary dense SIFT flow-based two-stream CNN, the UCF-101 dataset was selected for human action recognition. Furthermore, to evaluate the robustness of its brightness constancy and piecewise smoothness, a custom dataset was made up of classes that were extracted from UCF-101. Finally, the proposed method was compared with the state-of-the-art, which uses an optical flow-based two-stream CNN.
Similar content being viewed by others
References
Barron J, Fleet D, Beauchemin S (1994) System and experiment performance of optical flow techniques. Int J Comput Vision 12(1):43–77
Black MJ, Anandan P (1996) The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Comput Vision Image Understand 63(1):75–104
Blunsden S, Fisher RB (2010) The behave video dataset ground truthed video for multi-person behavior classification. In: Annals of the BMVA, vol 4, pp 1–12
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, pp 25–36
Calonder M, Lepetit V, Ozuysal M, Trzcinski T, Strecha C, Fua P (2012) Brief:computing a local binary descriptor very fast. In: IEEE Transactions on pattern analysis and machine intelligence, vol 34, pp 1281–1298,
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. arXiv:1405.3531
Chenarlogh VA, Razzazi F (2018) Multi-stream 3d cnn structure for human action recognition trained by limited data. IET Comput Vis 13(3):338–344
Cong G, Domeniconi G, Shapiro J, Yang CC, Chen B (2019) Video action recognition with an additional end-to-end trained temporal stream. In: 2019 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 51–60
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441
Dawar N, Chen C, Jafari R, Kehtarnavaz N (2017) Real-time continuous action detection and recognition using depth images and inertial signals. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE). IEEE, pp 1342–1347
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Fradi H, Luvison B, Pham QC (2016) Crowd behavior analysis using local mid-level visual descriptors. IEEE Trans Circ Syst Video Technol 27 (3):589–602
Hariyono J, Jo KH (2015) Pedestrian action recognition using motion type classification. In: 2015 IEEE 2nd international conference on cybernetics (CYBCONF). IEEE, pp 129–132
Hu Y, Lu M, Lu X (2018) Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition. In: 2018 15th international conference on control, automation, robotics and vision (ICARCV). IEEE, pp 1271–1277
Huang CD, Wang CY, Wang JC (2015) Human action recognition system for elderly and children care using three stream convnet. In: 2015 International conference on orange technologies (ICOT). IEEE, pp 5–9
Ji S, Xu W, Yang M (2012) Yu, k.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Jin CB, Li S, Kim H (2017) Real-time action detection in video surveillance using sub-action descriptor with multi-cnn. arXiv:1710.03383
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kataoka H, Satoh Y, Aoki Y, Oikawa S, Matsui Y (2018) Temporal and fine-grained pedestrian action recognition on driving recorder database. Sensors 18(2):627
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 204–212
Li Y, Li W, Mahadevan V, Vasconcelos N (2016) Vlad3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1951–1960
Liu C, Yuen J, Torralba A (2010) Sift flow: Dense correspondence across scenes and its applications. IEEE Trans Pattern Anal Mach Intell 33(5):978–994
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: CVPR 2011. IEEE, pp 3337–3344
Liu Y, Lu Z, Li J, Yang T, Yao C (2018) Global temporal representation based cnns for infrared action recognition. IEEE Signal Process Lett 25 (6):848–852
Lucas BD, Kanade T et al (1981) An iterative image registration technique with an application to stereo vision
Negin F, Bremond F (2016) Human action recognition in videos: A survey. In: INRIA Technical report
Peng Y, Zhao Y, Zhang J (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circ Syst Video Technol 29(3):773–786
Pienaar SW, Malekian R (2019) Human activity recognition using lstm-rnn deep neural network architecture. In: arXiv:1905.00599
Richter J, Wiede C, Dayangac E, Shahenshah A, Hirtz G (2016) Activity recognition for elderly care by evaluating proximity to objects and human skeleton data. In: International conference on pattern recognition applications and methods. Springer, pp 139–155
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: A local svm approach. In: Pattern recognition, vol 3
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Ge L, Li R, Fang Y (2017) Three-stream cnns for action recognition. Pattern Recogn Lett 92:33–40
Wang L, Qiao Y, Tang X (2013) Motionlets: Mid-level 3d parts for human motion recognition. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 2674–2681
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang X, Gao L, Wang P, Sun X, Liu X (2017) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644
Wei H, Xiao Y, Li R, Liu X (2018) Crowd abnormal detection using two-stream fully convolutional neural networks. In: 2018 10th international conference on measuring technology and mechatronics automation (ICMTMA). IEEE, pp 332–336
Wu Z, Jiang YG, Wang X, Ye H, Xue X (2016) Multi-stream multi-class fusion of deep networks for video classification. In: Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 791–800
Wu Z, Wang X, Jiang YG, Ye H, Xue X (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 461–470
Yu S, Cheng Y, Xie L, Li SZ (2017) Fully convolutional networks for action recognition. IET Comput Vis 11(8):744–749
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849
Acknowledgements
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grants No. NRF-2016R1D1A1B01016071 and NRF-2019R1A2C108974211).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Park, S.K., Chung, J.H., Kang, T.K. et al. Binary dense sift flow based two stream CNN for human action recognition. Multimed Tools Appl 80, 35697–35720 (2021). https://doi.org/10.1007/s11042-021-10795-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10795-2