Skip to main content
Log in

Action representation and recognition through temporal co-occurrence of flow fields and convolutional neural networks

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Many applications require action recognition skills, from human-machine interaction to intelligent video surveillance. Action recognition in video sequences cannot be based on simply processing raw color images or optical flow fields. Color images provide appearance information of moving objects, but lack motion features. They are also very sensitive to variations due to clothing and camera pose that badly affect the action recognition accuracy. In turn, raw optical flow measures instantaneous motion, not the overall dynamics of actions, and is sensitive to noise. More robust and meaningful motion features and classifiers are thus required for action recognition to be reliable. This paper proposes a new action recognition technique based on a deep convolutional neural network (CNN) fed with Histograms of Optical Flow Co-Occurrence (HOF-CO) motion features. HOF-CO is a robust motion representation previously proposed by the authors to encode the relative frequency of pairs of optical flow directions computed at each image pixel. Experimental results show that this approach outperforms state-of-the-art action recognition methods on three different public datasets KTH, UCF-11 Youtube and HOLLYWOOD2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ahmed I, Ahmad A, Piccialli F, Sangaiah AK, Jeon G (2018) “A robust features-based person tracker for overhead views in industrial environment’. IEEE Internet of Things Journal 5(3):1598–1605

    Article  Google Scholar 

  2. Ali S, Basharat A, Shah M (2007) “Chaotic invariants for human action recognition”. In: 2007 IEEE 11th International Conference on Computer Vision, pp 1–8

  3. Bashir K, Xiang T, Gong S (2009) “Gait representation using flow fields”. In: Proceedings of the British Machine Vision Conference. 1em plus 0.5em minus 0.4em BMVA Press, pp. 113.1–113.11.

  4. BenAbdelkader C, Cutler R, Nanda H, Davis LS (2001) “Eigengait: Motion-based recognition of people using image self-similarity”. In: Proceedings of the Third International Conference on Audio- and Video-Based Biometric Person Authentication, ser. AVBPA ’01. 1em plus 0.5em minus 0.4em London, UK, UK: Springer-Verlag, pp. 284–294. [Online]. Available: http://dl.acm.org/citation.cfm?id=646073.677457

  5. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(3):257–267

    Article  Google Scholar 

  6. Castro FM, Marín-Jimenez MJ, Medina-Carnicer R (2014) “Pyramidal fisher motion for multiview gait recognition”. In: Proceedings of the 2014 22Nd International Conference on Pattern Recognition, ser. ICPR ’14. 1em plus 0.5em minus 0.4em Washington, DC, USA: IEEE Computer Society, pp. 1692–1697. [Online]. Available: https://doi.org/10.1109/ICPR.2014.298

  7. Chattopadhyay P, Roy A, Sural S, Mukhopadhyay J (2014) “Pose depth volume extraction from rgb-d streams for frontal gait recognition”. J. Vis. Comun. Image Represent. 25(1):53–63. https://doi.org/10.1016/j.jvcir.2013.02.010

    Article  Google Scholar 

  8. Cheok MJ, Omar Z, Jaward MH (2019) A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics 10(1):131–153

    Article  Google Scholar 

  9. Choudhury SD, Tjahjadi T (2013) “Gait recognition based on shape and motion analysis of silhouette contours”. Computer Vision and Image Understanding 117(12):1770 – 1785. http://www.sciencedirect.com/science/article/pii/S1077314213001537

    Article  Google Scholar 

  10. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4):677–691

    Article  Google Scholar 

  11. Feichtenhofer C, Pinz A, Zisserman A (2016) “Convolutional two-stream network fusion for video action recognition”. CoRR, vol. abs/1604.06573, [Online]. Available: 1604.06573

  12. Han J, Bhanu B (2006) Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 28(2):316–322

    Article  Google Scholar 

  13. Hayder Ali CAEGM, Dargham J (2011) Gait recognition using gait energy image. International Journal of Signal Processing, Image Processing and Pattern Recognition 4:3.141–3.152

    Google Scholar 

  14. He W, Li P (2010) “Gait recognition using the temporal information of leg angles”. In: Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on vol. 5, pp 78–83

  15. He K, Zhang X, Ren S, Sun J (2016) “Deep residual learning for image recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778

  16. Herath S, Harandi M, Porikli F (2017) “Going deeper into action recognition”. Image Vision Comput 60(C):4–21. https://doi.org/10.1016/j.imavis.2017.01.010

    Article  Google Scholar 

  17. Ji S, Xu W, Yang M, Yu K (2013) “3d convolutional neural networks for human action recognition”. IEEE Trans. Pattern Anal. Mach. Intell 35 (1):221–231. https://doi.org/10.1109/TPAMI.2012.59

    Article  Google Scholar 

  18. Kovac~ J, Peer P (2014) Human skeleton model based dynamic features for walking speed invariant gait recognition. Mathematical Problems in Engineering, vol (2014)

  19. Lam THW, Cheung KH, Liu JNK (2011) “Gait flow image: A silhouette-based gait representation for human identification”. Pattern Recogn 44(4):973–987. https://doi.org/10.1016/j.patcog.2010.10.011

    Article  Google Scholar 

  20. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8

  21. Lee H, Hong S, Kim E (2009) “An efficient gait recognition with backpack removal”. EURASIP J. Adv. Signal Process 2009:46.1–46.7. https://doi.org/10.1155/2009/384384

    Article  Google Scholar 

  22. Lee CP, Tan AW, Tan SC (2014) Time-sliced averaged motion history image for gait recognition. J Vis Commun Image Represent 25(5):822–826

    Article  Google Scholar 

  23. Lu J, Hu J, Tan YP (2016) “Nonlinear metric learning for visual tracking”. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6

  24. McLaughlin N, del Rincon JM, Miller PC (2016) “Person reidentification using deep convnets with multitask learning”. IEEE Transactions on Circuits and Systems for Video Technology 27(3):525–539

    Article  Google Scholar 

  25. Misra I, Zitnick CL, Hebert M (2016) “Unsupervised learning using sequential verification for action recognition”. CoRR, vol. abs/1603.08561, [Online]. Available: 1603.08561

  26. Paszke A, Gross S, Chintala S, Chanan G (2017) “Pytorch”

  27. Peng X, Qiao Y, Peng Q (2014) Motion boundary based sampling and 3d co-occurrence descriptors for action recognition. Image Vis Comput 32 (9):616–628

    Article  Google Scholar 

  28. Rahmani H, Mian AS, Shah M (2016) “Learning a deep model for human action recognition from novel viewpoints”. CoRR, vol. abs/1602.00828, [Online]. Available: 1602.00828

  29. Rashwan HA, García MÁ, Chambon S, Puig D (2019) “Gait representation and recognition from temporal co-occurrence of flow fields”. Machine Vision and Applications 30(1):139–152. https://doi.org/10.1007/s00138-018-0982-3

    Article  Google Scholar 

  30. Rashwan HA, García MA, Puig D (2013) “Variational optical flow estimation based on stick tensor voting’. IEEE Trans Image Process 22(7):2589–2599

    Article  Google Scholar 

  31. Rashwan HA, Puig D, García MA (2012) “Improving the robustness of variational optical flow through tensor voting”. Computer Vision and Image Understanding 116(9):953–966. https://doi.org/10.1016/j.cviu.2012.04.006

    Article  Google Scholar 

  32. Simonyan K, Zisserman A (2014) “Two-stream convolutional networks for action recognition in videos”. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, ser. NIPS’14. 1em plus 0.5em minus 0.4em Cambridge, MA, USA: MIT Press, pp. 568–576. [Online]. Available: http://dl.acm.org/citation.cfm?id=2968826.2968890

  33. Srivastava N, Mansimov E, Salakhudinov R (2015) “Unsupervised learning of video representations using lstms”. In: International Conference on Machine Learning. pp 843–852

  34. Subetha T, Chitrakala S (2016) “A survey on human activity recognition from videos”. In: 2016 International Conference on Information Communication and Embedded Systems (ICICES), pp 1–7

  35. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) “Learning spatio-temporal features with 3d convolutional networks”. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ser. ICCV ’15. 1em plus 0.5em minus 0.4em Washington, DC, USA: IEEE Computer Society, pp. 4489–4497. [Online]. Available: https://doi.org/10.1109/ICCV.2015.510

  36. Varol G, Laptev I, Schmid C (2016) “Long-term temporal convolutions for action recognition”. CoRR, vol. abs/1604.04494. 1604.04494

  37. Wang H, Kläser A, Schmid C, Liu C-L (2013) “Dense trajectories and motion boundary descriptors for action recognition”. International Journal of Computer Vision 103(1):60–79. https://hal.inria.fr/hal-00803241

    Article  MathSciNet  Google Scholar 

  38. Wang J, Liu Z, Wu Y, Yuan J (2012) “Mining actionlet ensemble for action recognition with depth cameras”. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 1290–1297

  39. Wang C, Zhang J, Wang L, Pu J, Yuan X (2012) “Human identification using temporal information preserving gait template’. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11):2164–2176

    Article  Google Scholar 

  40. Willems G, Tuytelaars T, Van Gool L (2008) An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. 1em plus 0.5em minus 0.4em Berlin Heidelberg: Springer Berlin Heidelberg, pp 650–663

  41. Wu Z, Jiang Y-G, Wang X, Ye H, Xue X (2016) “Multi-stream multi-class fusion of deep networks for video classification”. In: Proceedings of the 2016 ACM on Multimedia Conference, ser. MM ’16. 1em plus 0.5em minus 0.4em New York, NY, USA: ACM, pp. 791–800. [Online]. Available: http://doi.acm.org/10.1145/2964284.2964328

  42. Yilmaz A, Shah M (2005) “Recognizing human actions in videos acquired by uncalibrated moving cameras”. In: Tenth IEEE International Conference on Computer Vision (ICCV?05) Volume 1, vol. 1, pp. 150?157 Vol. 1

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hatem A. Rashwan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rashwan, H.A., Garcia, M.A., Abdulwahab, S. et al. Action representation and recognition through temporal co-occurrence of flow fields and convolutional neural networks. Multimed Tools Appl 79, 34141–34158 (2020). https://doi.org/10.1007/s11042-020-09194-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09194-w

Keywords

Navigation