Skip to main content
Log in

Deep Packet Flow: Action Recognition via Multiresolution Deep Wavelet Packet of Local Dense Optical Flows

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Action recognition with dynamic actor and scene has been a tremendous research topic. Recently, spatio temporal features such as optical flows have been utilized to define motion representation over sequence of time. However, to increase accuracy, deep decomposition is necessary either to enrich information under location or time-varying actions due to spatio temporal dynamics. To this end, we propose algorithm consists of vectors obtained by applying multi-resolution analysis of motion using Haar Wavelet Packet (HWP) over time. Its computation efficiency and robustness have led HWP to gain popularity in texture analysis but their applicability in motion analysis is yet to be explored. To extract representation, a sequence of bin of Histogram of Flow (HOF) is treated as signal channel. Deep decomposition is then applied by utilizing Wavelet Packet decomposition called Packet Flow to many levels. It allows us to represent action’s motions with various speeds and ranges which focuses not only on HOF within one frame or one cuboid but also on the temporal sequence. HWP, however, has translation covariant property that is not efficient in performance because actions occur in arbitrary time and various sampling location. To gain translation invariant capability, we pool each respective coefficient of decomposition for each level. It is found that with proper packet selection, it gives comparable results on the KTH action and Hollywood dataset with train-test division without localization. Even if spatiotemporal cuboid sampling is not densely sampled like of baseline method, we achieve lower complexity and comparable performance on camera motion burdened dataset like UCF Sports that motion features such as HOF do not perform well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12

Similar content being viewed by others

References

  1. Somasundaram, G. et al. (2014). Action recognition using global spatio-temporal features derived from sparse representations. Computer Vision and Image Understanding, 123, 1–13.

    Article  Google Scholar 

  2. Klaser, A., Marszaek, M., Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. BMVC 2008-19th British machine vision conference british machine vision association.

  3. Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE.

  4. Byrne, J. (2015). Nested motion descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  5. Lan, Z. et al. (2015). Long-short Term Motion Feature for Action Classification and Retrieval. arXiv:1502.04132.

  6. Chen, Q.-Q., & Zhang, Y.-J. (2016). Cluster trees of improved trajectories for action recognition. Neurocomputing, 173, 364–372.

    Article  Google Scholar 

  7. Lin, Z., Jiang, Z., Davis, L. S. (2009). Recognizing actions by shape-motion prototype trees. In 2009 IEEE 12th International conference on computer vision. IEEE.

  8. Sadanand, S., & Corso, J. J. (2012). Action bank: a high- level representation of activity in video. In 2012 IEEE Conference on IEEE computer vision and pattern recognition (CVPR).

  9. Fleet, D. J., & Jepson, A. D. (1990). Computation of component image velocity from local phase information. International Journal of Computer Vision, 5.1, 77–104.

    Article  Google Scholar 

  10. Jain, M., Jegou, H., Bouthemy, P. (2013). Better exploiting motion for better action recognition. In 2013 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE.

  11. Schuldt, C., Laptev, I., Caputo, B. (2004). Recognizing human actions: a local SVM approach. In Proceedings of the 17th International conference on pattern recognition, 2004. ICPR 2004. (Vol. 3). IEEE.

  12. Matsukawa, T. (2010). TakioKurita action recognition usingthree-way cross-correlations feature of local motion attributes ICPR.

  13. Liu, C., Yuen, J., Torralba, A. (2011). Sift flow: dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33.5, 978–994.

    Article  Google Scholar 

  14. Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. Image Analysis, (pp. 33–370). Berlin: Springer.

    MATH  Google Scholar 

  15. Lowe, D. G. (1999). Object recognition from local scale-invariant features. In The Proceedings of the seventh IEEE international conference on computer vision (Vol. 2). IEEE.

  16. Shi, F., Petriu, E., Laganiere, R. (2013). Sampling strategies for real-time action recognition. In 2013 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE.

  17. Schindler, K., & Gool, L. V. (2008). Action snippets: how many frames does human action recognition require? In IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE.

  18. Kobayashi, T., & Otsu, N. (2008). Image feature extraction using gradient local auto-correlations. Computer Vision ECCV (Vol. 2008, pp. 346–358). Berlin: Springer.

    Google Scholar 

  19. Otsu, N., & Kurita, T. (1988). A new scheme for practical flexible and intelligent vision systems MVA.

  20. Ke, Y., Sukthankar, R., Hebert, M. (2007). Spatio-temporal shape and flow correlation for action recognition. In 2007 IEEE Conference on computer vision and pattern recognition CVPR?07. IEEE.

  21. Mikolajczyk, K., & Uemura, H. (2011). Action recognition with appearance & motion features and fast search trees. Computer Vision and Image Understanding, 115.3, 426–438.

    Article  Google Scholar 

  22. Chakraborty, B. et al. (2012). Selective spatio-temporal interest points. Computer Vision and Image Understanding, 116.3, 396–410.

    Article  Google Scholar 

  23. Wang, H. et al. (2009). Evaluation of local spatio-temporal features for action recognition. BMVC 2009-British machine vision conference. BMVA Press.

  24. Uijlings, J. et al. (2014). Realtime video classification using dense HOF/HOG. In Proceedings of international conference on multimedia retrieval. ACM.

  25. Theriault, C., Thome, N., Cord, M. (2013). Dynamic scene classification: learning motion descriptors with slow features analysis. In 2013 IEEE conference on computer vision and pattern recognition (CVPR). IEEE.

  26. Legenstein, R., Wilbert, N., Wiskott, L. (2010). Reinforcement learning on slow features of high-dimensional input streams. PLoS Computational Biology, 6.8, e1000894.

    Article  MathSciNet  Google Scholar 

  27. Lan, T., Wang, Y., Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In 2011 IEEE International conference on computer vision (ICCV). IEEE.

  28. Hadjidemetriou, E., Grossberg, M. D., Nayar, S. K. (2004). Multiresolution histograms and their use for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26.7, 831–847.

    Article  Google Scholar 

  29. Lee, J., & Shin, S. Y. (2000). Multiresolution motion analysis with applications. In Proc. International workshop on human modeling and animation.

  30. Yu, W., Sommer, G., Daniilidis, K. (2003). Multiple motion analysis: in spatial or in spectral domain? Computer Vision and Image Understanding, 90.2, 129–152.

    Article  Google Scholar 

  31. Oshin, O., Gilbert, A., Bowden, R. (2014). Capturing relative motion and finding modes for action recognition in the wild. Computer Vision and Image Understanding, 125, 155–171.

    Article  Google Scholar 

  32. Bhattacharya, S. et al. (2014). Recognition of complex events: exploiting temporal dynamics between underlying concepts. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  33. Wang, L., Qiao, Y., Xiaoou, T. (2015). MoFAP a multi-level representation for action recognition. International Journal of Computer Vision, 1–18.

  34. Fernando, B. et al. (2015). Rank Pooling for Action Recognition. arXiv preprint arXiv:1512.01848.

  35. Gokhale, M.Y., & Khanduja, D. K. (2010). Time domain signal analysis using wavelet packet decomposition approach. Int’l Journal of Communications Network and System Sciences, 3.03, 321.

    Article  Google Scholar 

  36. Laine, A., & Fan, J. (1993). Texture classification by wavelet packet signatures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5.11, 1186–1191.

    Article  Google Scholar 

  37. Le, Q. V. et al. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In 2011 IEEE Conference on IEEE computer vision and pattern recognition (CVPR).

  38. Yudistira, N., & Kurita, T. (2015). Multiresolution local autocorrelation of optical flows over time for action recognition. In 2015 IEEE International conference on systems, man, and cybernetics (SMC). IEEE.

  39. Rodriguez, M. D., Ahmed, J., Shah, M. (2008). Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. Computer Vision and Pattern Recognition.

  40. Soomro, K., & Zamir, A.R. (2014). Action recognition in realistic sports videos, computer vision in sports. Springer International Publishing.

  41. Boughorbel, S., Tarel, J.-P., Boujemaa, N. (2005). Generalized histogram intersection kernel for image recognition. In 2005 ICIP IEEE International conference on image processing, 2005 (Vol. 3). IEEE.

  42. Wang, H. et al. (2009). Evaluation of local spatio-temporal features for action recognition. BMVC 2009-British machine vision conference. BMVA Press.

  43. Wang, H. et al. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103.1, 60–79.

    Article  MathSciNet  Google Scholar 

  44. Laptev, I. et al. (2008). Learning realistic human actions from movies. In IEEE Conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE.

  45. Sun, L. et al. (2014). DL-SFA: deeply-learned slow feature analysis for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  46. Wong, S.-F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In IEEE 11th International conference on computer vision, 2007. ICCV 2007. IEEE.

  47. Taylor, G. W., Fergus, R., LeCun, Y., Bregler, C. (2010). Convolutional learning of spatio-temporal features. In Proceedings of the 11th European conference on computer vision Part VI, ECCV?10 (pp. 140–153). Berlin: Springer-Verlag.

  48. Ji, S., Xu, W., Yang, M., Yu, K. (2013). 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.

    Article  Google Scholar 

  49. Bruna, J., & Mallat, S. (2013). Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35.8, 1872–1886.

    Article  Google Scholar 

Download references

Acknowledgments

This work was partially supported by KAKENHI (23500211).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Novanto Yudistira.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yudistira, N., Kurita, T. Deep Packet Flow: Action Recognition via Multiresolution Deep Wavelet Packet of Local Dense Optical Flows. J Sign Process Syst 91, 609–625 (2019). https://doi.org/10.1007/s11265-018-1363-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-018-1363-x

Keywords

Navigation