Learning correlations for human action recognition in videos

Yi, Yun; Wang, Hanli; Zhang, Bowen

doi:10.1007/s11042-017-4416-4

Learning correlations for human action recognition in videos

Published: 10 February 2017

Volume 76, pages 18891–18913, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yun Yi^1,2,3,
Hanli Wang^1,2 &
Bowen Zhang^1,2

447 Accesses
Explore all metrics

Abstract

Human action recognition in realistic videos is an important and challenging task. Recent studies demonstrate that multi-feature fusion can significantly improve the classification performance for human action recognition. Therefore, a number of researches utilize fusion strategies to combine multiple features and achieve promising results. Nevertheless, previous fusion strategies ignore the correlations of different action categories. To address this issue, we propose a novel multi-feature fusion framework, which utilizes the correlations of different action categories and multiple features. To describe human actions, this framework combines several classical features, which are extracted with deep convolutional neural networks and improved dense trajectories. Moreover, massive experiments are conducted on two challenging datasets to evaluate the effectiveness of our approach, and the proposed approach obtains the state-of-the-art classification accuracy of 68.1 % and 93.3 % on the HMDB51 and UCF101 datasets, respectively. Furthermore, the proposed approach achieves better performances than five classical fusion schemes, as the correlations are used to combine multiple features in this framework. To the best of our knowledge, this work is the first attempt to learn the correlations of different action categories for multi-feature fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modality Fusion Network for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Article 28 October 2019

Human Action Recognition Based on Dual Correlation Network

Notes

References

Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
MathSciNet Google Scholar
Ballan L, Bertini M, Del Bimbo A, Seidenari L (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14(4):1234–1245
Article Google Scholar
Benmokhtar R (2014) Robust human action recognition scheme based on high-level feature fusion. Multimed Tools Appl 69(2):253–275
Article Google Scholar
Borges PVK, Conci N, Cavallaro A (2013) Video-based human behavior understanding: a survey. IEEE Trans Circ Syst Vid Technol 23(11):1993–2008
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27
Article Google Scholar
Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Trans Human-Mach Syst 45 (1):51–61
Article Google Scholar
Chen M, Gong L, Wang T, Liu F, Feng Q (2016) Modeling spatio-temporal layout with lie algebrized gaussians for action recognition. Multimed Tools Appl 75 (17):10:335–10:355
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, pp 428–441
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on Image analysis, pp 363–370
Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Asian Conference on Computer Vision, pp 3–20
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia, pp 675–678
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732
Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear svms. In: ACM SIGKDD international conference on Knowledge discovery and data mining, pp 408–416
Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: IEEE International Conference on Computer Vision, pp 1487–1494
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: IEEE International Conference on Computer Vision, pp 2556–2563
Lai KT, Liu D, Chang SF, Chen MS (2015) Learning sample specific weights for late fusion. IEEE Trans Image Process 24(9):2772–2783
Article MathSciNet Google Scholar
Zz Lan, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71(1):333–347
Article Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8
Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR Forum, pp 267–276
Li L, Dai S (2016) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimedia Tools and Applications (in press)
Lin CJ, Weng RC, Keerthi SS (2007) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(2):561–568
MathSciNet Google Scholar
Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimed Tools Appl 75(21):113,023—-13,039
Article Google Scholar
Ma T, Oh S, Perera A, Latecki L (2013) Learning non-linear calibration for score fusion with applications to image and video classification. In: IEEE International Conference on Computer Vision Workshops, pp 323–330
Mironică I, Duţă IC, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl 75(15):9045–9072
Article Google Scholar
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, pp 1817–1824
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp 1–8
Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Neural Information Processing Systems, pp 568–576
Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:http://arXiv.org/abs/14091556
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: ACM International Conference on Multimedia, pp 399–402
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp 4489–4497
Vogt CC, Cottrell GW (1999) Fusion via a linear combination of scores. Inf Retr 1(3):151–173
Article Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3169–3176
Wang H, Yi Y, Wu J (2015a) Human action recognition with trajectory based covariance descriptor in unconstrained videos. In: ACM International Conference on Multimedia, pp 1175–1178
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238
Article MathSciNet Google Scholar
Wang L, Qiao Y, Tang X (2015b) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4305–4314
Wang L, Xiong Y, Wang Z, Qiao Y (2015c) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:150702159
Wu D, Shao L (2014) Multimodal dynamic networks for gesture recognition. In: ACM International Conference on Multimedia, pp 945–948
Wu S, Bi Y, Zeng X, Han L (2009) Assigning appropriate weights for the linear combination data fusion method in information retrieval. Inf Process Manag 45 (4):413–426
Article Google Scholar
Xu H, Tian Q, Wang Z, Wu J (2016) A survey on aggregating methods for action recognition with dense trajectories. Multimed Tools Appl 75(10):5701–5717
Article Google Scholar
Ye G, Liu D, Jhuo IH, Chang SF et al (2012) Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3021–3028
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime T V − L ¹ optical flow. In: Joint Pattern Recognition Symposium, pp 214–223
Zhou X, Depeursinge A, Müller H (2010) Information fusion for combining visual and textual image retrieval. In: International Conference on Pattern Recognition, pp 1590–1593

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61622115 and Grant 61472281, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), and the Science and Technology Projects of education bureau of Jiangxi province of China (No. GJJ151001).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, Shanghai, 201804, People’s Republic of China
Yun Yi, Hanli Wang & Bowen Zhang
Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai, 200092, People’s Republic of China
Yun Yi, Hanli Wang & Bowen Zhang
Department of Mathematics and Computer Science, Gannan Normal University, Ganzhou, 341000, People’s Republic of China
Yun Yi

Authors

Yun Yi
View author publications
You can also search for this author in PubMed Google Scholar
Hanli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanli Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yi, Y., Wang, H. & Zhang, B. Learning correlations for human action recognition in videos. Multimed Tools Appl 76, 18891–18913 (2017). https://doi.org/10.1007/s11042-017-4416-4

Download citation

Received: 25 August 2016
Revised: 12 December 2016
Accepted: 20 January 2017
Published: 10 February 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11042-017-4416-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning correlations for human action recognition in videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modality Fusion Network for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Human Action Recognition Based on Dual Correlation Network

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Learning correlations for human action recognition in videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-modality Fusion Network for Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Human Action Recognition Based on Dual Correlation Network

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation