Skip to main content
Log in

Binary dense sift flow based two stream CNN for human action recognition

  • 1166: Advances of machine learning in data analytics and visual information processing
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Two-stream CNN is a widely-used network for human action recognition. Two-stream CNN consists of a spatial stream and a temporal stream. The spatial stream, through which the RGB image passes, extracts the shape features of human motion. The temporal stream, through which the optical flow images pass, extracts the sequence features of the listed motions. However, because of the constraints of the optical flow, such as brightness, constancy, and piecewise smoothness, there are limitations to the performance of two-stream CNN. One of the efficient methods to solve this problem is to expand the network model to a three-stream network, fuse it with LSTM, and add a modified pooling layer. This method improves the performance of the model but it increases the computational cost. Besides, the limitations of the optical flow are still present. In this paper, without extending the network model, a binary dense SIFT flow-based two-stream CNN is used instead of the optical flow. Unlike the optical flow, binary dense SIFT flow, which is a feature-based matching flow field is robust in brightness, constancy and piecewise smoothness. To evaluate the binary dense SIFT flow-based two-stream CNN, the UCF-101 dataset was selected for human action recognition. Furthermore, to evaluate the robustness of its brightness constancy and piecewise smoothness, a custom dataset was made up of classes that were extracted from UCF-101. Finally, the proposed method was compared with the state-of-the-art, which uses an optical flow-based two-stream CNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

References

  1. Barron J, Fleet D, Beauchemin S (1994) System and experiment performance of optical flow techniques. Int J Comput Vision 12(1):43–77

    Article  Google Scholar 

  2. Black MJ, Anandan P (1996) The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Comput Vision Image Understand 63(1):75–104

    Article  Google Scholar 

  3. Blunsden S, Fisher RB (2010) The behave video dataset ground truthed video for multi-person behavior classification. In: Annals of the BMVA, vol 4, pp 1–12

  4. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, pp 25–36

  5. Calonder M, Lepetit V, Ozuysal M, Trzcinski T, Strecha C, Fua P (2012) Brief:computing a local binary descriptor very fast. In: IEEE Transactions on pattern analysis and machine intelligence, vol 34, pp 1281–1298,

  6. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets. arXiv:1405.3531

  7. Chenarlogh VA, Razzazi F (2018) Multi-stream 3d cnn structure for human action recognition trained by limited data. IET Comput Vis 13(3):338–344

    Article  Google Scholar 

  8. Cong G, Domeniconi G, Shapiro J, Yang CC, Chen B (2019) Video action recognition with an additional end-to-end trained temporal stream. In: 2019 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 51–60

  9. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection

  10. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428–441

  11. Dawar N, Chen C, Jafari R, Kehtarnavaz N (2017) Real-time continuous action detection and recognition using depth images and inertial signals. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE). IEEE, pp 1342–1347

  12. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  13. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476

  14. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  15. Fradi H, Luvison B, Pham QC (2016) Crowd behavior analysis using local mid-level visual descriptors. IEEE Trans Circ Syst Video Technol 27 (3):589–602

    Article  Google Scholar 

  16. Hariyono J, Jo KH (2015) Pedestrian action recognition using motion type classification. In: 2015 IEEE 2nd international conference on cybernetics (CYBCONF). IEEE, pp 129–132

  17. Hu Y, Lu M, Lu X (2018) Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition. In: 2018 15th international conference on control, automation, robotics and vision (ICARCV). IEEE, pp 1271–1277

  18. Huang CD, Wang CY, Wang JC (2015) Human action recognition system for elderly and children care using three stream convnet. In: 2015 International conference on orange technologies (ICOT). IEEE, pp 5–9

  19. Ji S, Xu W, Yang M (2012) Yu, k.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  20. Jin CB, Li S, Kim H (2017) Real-time action detection in video surveillance using sub-action descriptor with multi-cnn. arXiv:1710.03383

  21. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  22. Kataoka H, Satoh Y, Aoki Y, Oikawa S, Matsui Y (2018) Temporal and fine-grained pedestrian action recognition on driving recorder database. Sensors 18(2):627

    Article  Google Scholar 

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  24. Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 204–212

  25. Li Y, Li W, Mahadevan V, Vasconcelos N (2016) Vlad3: Encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1951–1960

  26. Liu C, Yuen J, Torralba A (2010) Sift flow: Dense correspondence across scenes and its applications. IEEE Trans Pattern Anal Mach Intell 33(5):978–994

    Article  Google Scholar 

  27. Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: CVPR 2011. IEEE, pp 3337–3344

  28. Liu Y, Lu Z, Li J, Yang T, Yao C (2018) Global temporal representation based cnns for infrared action recognition. IEEE Signal Process Lett 25 (6):848–852

    Article  Google Scholar 

  29. Lucas BD, Kanade T et al (1981) An iterative image registration technique with an application to stereo vision

  30. Negin F, Bremond F (2016) Human action recognition in videos: A survey. In: INRIA Technical report

  31. Peng Y, Zhao Y, Zhang J (2018) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Trans Circ Syst Video Technol 29(3):773–786

    Article  Google Scholar 

  32. Pienaar SW, Malekian R (2019) Human activity recognition using lstm-rnn deep neural network architecture. In: arXiv:1905.00599

  33. Richter J, Wiede C, Dayangac E, Shahenshah A, Hirtz G (2016) Activity recognition for elderly care by evaluating proximity to objects and human skeleton data. In: International conference on pattern recognition applications and methods. Springer, pp 139–155

  34. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: A local svm approach. In: Pattern recognition, vol 3

  35. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  36. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  37. Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4597–4605

  38. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  39. Wang L, Ge L, Li R, Fang Y (2017) Three-stream cnns for action recognition. Pattern Recogn Lett 92:33–40

    Article  Google Scholar 

  40. Wang L, Qiao Y, Tang X (2013) Motionlets: Mid-level 3d parts for human motion recognition. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 2674–2681

  41. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

  42. Wang X, Gao L, Wang P, Sun X, Liu X (2017) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644

    Article  Google Scholar 

  43. Wei H, Xiao Y, Li R, Liu X (2018) Crowd abnormal detection using two-stream fully convolutional neural networks. In: 2018 10th international conference on measuring technology and mechatronics automation (ICMTMA). IEEE, pp 332–336

  44. Wu Z, Jiang YG, Wang X, Ye H, Xue X (2016) Multi-stream multi-class fusion of deep networks for video classification. In: Proceedings of the 24th ACM international conference on Multimedia. ACM, pp 791–800

  45. Wu Z, Wang X, Jiang YG, Ye H, Xue X (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 461–470

  46. Yu S, Cheng Y, Xie L, Li SZ (2017) Fully convolutional networks for action recognition. IET Comput Vis 11(8):744–749

    Article  Google Scholar 

  47. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  48. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726

  49. Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grants No. NRF-2016R1D1A1B01016071 and NRF-2019R1A2C108974211).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tae Koo Kang or Myo Taeg Lim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, S.K., Chung, J.H., Kang, T.K. et al. Binary dense sift flow based two stream CNN for human action recognition. Multimed Tools Appl 80, 35697–35720 (2021). https://doi.org/10.1007/s11042-021-10795-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10795-2

Keywords

Navigation