Skip to main content
Log in

Learning correlations for human action recognition in videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human action recognition in realistic videos is an important and challenging task. Recent studies demonstrate that multi-feature fusion can significantly improve the classification performance for human action recognition. Therefore, a number of researches utilize fusion strategies to combine multiple features and achieve promising results. Nevertheless, previous fusion strategies ignore the correlations of different action categories. To address this issue, we propose a novel multi-feature fusion framework, which utilizes the correlations of different action categories and multiple features. To describe human actions, this framework combines several classical features, which are extracted with deep convolutional neural networks and improved dense trajectories. Moreover, massive experiments are conducted on two challenging datasets to evaluate the effectiveness of our approach, and the proposed approach obtains the state-of-the-art classification accuracy of 68.1 % and 93.3 % on the HMDB51 and UCF101 datasets, respectively. Furthermore, the proposed approach achieves better performances than five classical fusion schemes, as the correlations are used to combine multiple features in this framework. To the best of our knowledge, this work is the first attempt to learn the correlations of different action categories for multi-feature fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://lear.inrialpes.fr/people/wang/improved_trajectories

  2. https://github.com/yjxiong/caffe/tree/action_recog

  3. https://www.csie.ntu.edu.tw/cjlin/libsvmtools/multicore-liblinear/

  4. http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database

  5. http://crcv.ucf.edu/data/UCF101.php

References

  1. Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185

    MathSciNet  Google Scholar 

  2. Ballan L, Bertini M, Del Bimbo A, Seidenari L (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14(4):1234–1245

    Article  Google Scholar 

  3. Benmokhtar R (2014) Robust human action recognition scheme based on high-level feature fusion. Multimed Tools Appl 69(2):253–275

    Article  Google Scholar 

  4. Borges PVK, Conci N, Cavallaro A (2013) Video-based human behavior understanding: a survey. IEEE Trans Circ Syst Vid Technol 23(11):1993–2008

    Article  Google Scholar 

  5. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27

    Article  Google Scholar 

  6. Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Trans Human-Mach Syst 45 (1):51–61

    Article  Google Scholar 

  7. Chen M, Gong L, Wang T, Liu F, Feng Q (2016) Modeling spatio-temporal layout with lie algebrized gaussians for action recognition. Multimed Tools Appl 75 (17):10:335–10:355

    Article  Google Scholar 

  8. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 886–893

  9. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, pp 428–441

  10. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  11. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on Image analysis, pp 363–370

  12. Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Asian Conference on Computer Vision, pp 3–20

  13. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia, pp 675–678

  14. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732

  15. Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear svms. In: ACM SIGKDD international conference on Knowledge discovery and data mining, pp 408–416

  16. Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: IEEE International Conference on Computer Vision, pp 1487–1494

  17. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: IEEE International Conference on Computer Vision, pp 2556–2563

  18. Lai KT, Liu D, Chang SF, Chen MS (2015) Learning sample specific weights for late fusion. IEEE Trans Image Process 24(9):2772–2783

    Article  MathSciNet  Google Scholar 

  19. Zz Lan, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71(1):333–347

    Article  Google Scholar 

  20. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8

  21. Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR Forum, pp 267–276

  22. Li L, Dai S (2016) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimedia Tools and Applications (in press)

  23. Lin CJ, Weng RC, Keerthi SS (2007) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(2):561–568

    MathSciNet  Google Scholar 

  24. Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimed Tools Appl 75(21):113,023—-13,039

    Article  Google Scholar 

  25. Ma T, Oh S, Perera A, Latecki L (2013) Learning non-linear calibration for score fusion with applications to image and video classification. In: IEEE International Conference on Computer Vision Workshops, pp 323–330

  26. Mironică I, Duţă IC, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl 75(15):9045–9072

    Article  Google Scholar 

  27. Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, pp 1817–1824

  28. Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp 1–8

  29. Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Neural Information Processing Systems, pp 568–576

  30. Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:http://arXiv.org/abs/14091556

  31. Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: ACM International Conference on Multimedia, pp 399–402

  32. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01

  33. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp 4489–4497

  34. Vogt CC, Cottrell GW (1999) Fusion via a linear combination of scores. Inf Retr 1(3):151–173

    Article  Google Scholar 

  35. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp 3551–3558

  36. Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3169–3176

  37. Wang H, Yi Y, Wu J (2015a) Human action recognition with trajectory based covariance descriptor in unconstrained videos. In: ACM International Conference on Multimedia, pp 1175–1178

  38. Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238

    Article  MathSciNet  Google Scholar 

  39. Wang L, Qiao Y, Tang X (2015b) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4305–4314

  40. Wang L, Xiong Y, Wang Z, Qiao Y (2015c) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:150702159

  41. Wu D, Shao L (2014) Multimodal dynamic networks for gesture recognition. In: ACM International Conference on Multimedia, pp 945–948

  42. Wu S, Bi Y, Zeng X, Han L (2009) Assigning appropriate weights for the linear combination data fusion method in information retrieval. Inf Process Manag 45 (4):413–426

    Article  Google Scholar 

  43. Xu H, Tian Q, Wang Z, Wu J (2016) A survey on aggregating methods for action recognition with dense trajectories. Multimed Tools Appl 75(10):5701–5717

    Article  Google Scholar 

  44. Ye G, Liu D, Jhuo IH, Chang SF et al (2012) Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3021–3028

  45. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime T VL 1 optical flow. In: Joint Pattern Recognition Symposium, pp 214–223

  46. Zhou X, Depeursinge A, Müller H (2010) Information fusion for combining visual and textual image retrieval. In: International Conference on Pattern Recognition, pp 1590–1593

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61622115 and Grant 61472281, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), and the Science and Technology Projects of education bureau of Jiangxi province of China (No. GJJ151001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hanli Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, Y., Wang, H. & Zhang, B. Learning correlations for human action recognition in videos. Multimed Tools Appl 76, 18891–18913 (2017). https://doi.org/10.1007/s11042-017-4416-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4416-4

Keywords

Navigation