Abstract
Human action recognition in realistic videos is an important and challenging task. Recent studies demonstrate that multi-feature fusion can significantly improve the classification performance for human action recognition. Therefore, a number of researches utilize fusion strategies to combine multiple features and achieve promising results. Nevertheless, previous fusion strategies ignore the correlations of different action categories. To address this issue, we propose a novel multi-feature fusion framework, which utilizes the correlations of different action categories and multiple features. To describe human actions, this framework combines several classical features, which are extracted with deep convolutional neural networks and improved dense trajectories. Moreover, massive experiments are conducted on two challenging datasets to evaluate the effectiveness of our approach, and the proposed approach obtains the state-of-the-art classification accuracy of 68.1 % and 93.3 % on the HMDB51 and UCF101 datasets, respectively. Furthermore, the proposed approach achieves better performances than five classical fusion schemes, as the correlations are used to combine multiple features in this framework. To the best of our knowledge, this work is the first attempt to learn the correlations of different action categories for multi-feature fusion.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11042-017-4416-4/MediaObjects/11042_2017_4416_Fig11_HTML.gif)
Similar content being viewed by others
References
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46(3):175–185
Ballan L, Bertini M, Del Bimbo A, Seidenari L (2012) Effective codebooks for human action representation and classification in unconstrained videos. IEEE Trans Multimed 14(4):1234–1245
Benmokhtar R (2014) Robust human action recognition scheme based on high-level feature fusion. Multimed Tools Appl 69(2):253–275
Borges PVK, Conci N, Cavallaro A (2013) Video-based human behavior understanding: a survey. IEEE Trans Circ Syst Vid Technol 23(11):1993–2008
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27
Chen C, Jafari R, Kehtarnavaz N (2015) Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Trans Human-Mach Syst 45 (1):51–61
Chen M, Gong L, Wang T, Liu F, Feng Q (2016) Modeling spatio-temporal layout with lie algebrized gaussians for action recognition. Multimed Tools Appl 75 (17):10:335–10:355
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, pp 428–441
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on Image analysis, pp 363–370
Hoai M, Zisserman A (2014) Improving human action recognition using score distribution and ranking. In: Asian Conference on Computer Vision, pp 3–20
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia, pp 675–678
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732
Keerthi SS, Sundararajan S, Chang KW, Hsieh CJ, Lin CJ (2008) A sequential dual method for large scale multi-class linear svms. In: ACM SIGKDD international conference on Knowledge discovery and data mining, pp 408–416
Krapac J, Verbeek J, Jurie F (2011) Modeling spatial layout with fisher vectors for image categorization. In: IEEE International Conference on Computer Vision, pp 1487–1494
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: IEEE International Conference on Computer Vision, pp 2556–2563
Lai KT, Liu D, Chang SF, Chen MS (2015) Learning sample specific weights for late fusion. IEEE Trans Image Process 24(9):2772–2783
Zz Lan, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71(1):333–347
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8
Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR Forum, pp 267–276
Li L, Dai S (2016) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimedia Tools and Applications (in press)
Lin CJ, Weng RC, Keerthi SS (2007) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(2):561–568
Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimed Tools Appl 75(21):113,023—-13,039
Ma T, Oh S, Perera A, Latecki L (2013) Learning non-linear calibration for score fusion with applications to image and video classification. In: IEEE International Conference on Computer Vision Workshops, pp 323–330
Mironică I, Duţă IC, Ionescu B, Sebe N (2016) A modified vector of locally aggregated descriptors approach for fast video classification. Multimed Tools Appl 75(15):9045–9072
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, pp 1817–1824
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp 1–8
Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Neural Information Processing Systems, pp 568–576
Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:http://arXiv.org/abs/14091556
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In: ACM International Conference on Multimedia, pp 399–402
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp 4489–4497
Vogt CC, Cottrell GW (1999) Fusion via a linear combination of scores. Inf Retr 1(3):151–173
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3169–3176
Wang H, Yi Y, Wu J (2015a) Human action recognition with trajectory based covariance descriptor in unconstrained videos. In: ACM International Conference on Multimedia, pp 1175–1178
Wang H, Oneata D, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238
Wang L, Qiao Y, Tang X (2015b) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4305–4314
Wang L, Xiong Y, Wang Z, Qiao Y (2015c) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:150702159
Wu D, Shao L (2014) Multimodal dynamic networks for gesture recognition. In: ACM International Conference on Multimedia, pp 945–948
Wu S, Bi Y, Zeng X, Han L (2009) Assigning appropriate weights for the linear combination data fusion method in information retrieval. Inf Process Manag 45 (4):413–426
Xu H, Tian Q, Wang Z, Wu J (2016) A survey on aggregating methods for action recognition with dense trajectories. Multimed Tools Appl 75(10):5701–5717
Ye G, Liu D, Jhuo IH, Chang SF et al (2012) Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3021–3028
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime T V − L 1 optical flow. In: Joint Pattern Recognition Symposium, pp 214–223
Zhou X, Depeursinge A, Müller H (2010) Information fusion for combining visual and textual image retrieval. In: International Conference on Pattern Recognition, pp 1590–1593
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61622115 and Grant 61472281, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), and the Science and Technology Projects of education bureau of Jiangxi province of China (No. GJJ151001).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yi, Y., Wang, H. & Zhang, B. Learning correlations for human action recognition in videos. Multimed Tools Appl 76, 18891–18913 (2017). https://doi.org/10.1007/s11042-017-4416-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4416-4