Abstract
Human action recognition is an active research topic in both computer vision and machine learning communities, which has broad applications including surveillance, biometrics and human computer interaction. In the past decades, although some famous action datasets have been released, there still exist limitations, including the limited action categories and samples, camera views and variety of scenarios. Moreover, most of them are designed for a subset of the learning problems, such as single-view learning problem, cross-view learning problem and multi-task learning problem. In this paper, we introduce a multi-view, multi-modality benchmark dataset for human action recognition (abbreviated to MMA). MMA consists of 7080 action samples from 25 action categories, including 15 single-subject actions and 10 double-subject interactive actions in three views of two different scenarios. Further, we systematically benchmark the state-of-the-art approaches on MMA with respective to all three learning problems by different temporal-spatial feature representations. Experimental results demonstrate that MMA is challenging on all three learning problems due to significant intra-class variations, occlusion issues, views and scene variations, and multiple similar action categories. Meanwhile, we provide the baseline for the evaluation of existing state-of-the-art algorithms.
Similar content being viewed by others
References
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Chen G (2015) Human action recognition via multi-task learning base on spatial-temporal feature. Elsevier Science Inc, pp 418–428
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: IEEE international conference on image processing, pp 168–172
Cheng Z, Qin L, Ye Y, Huang Q, Qi T (2012) Human daily action analysis with multi-view and color-depth data. In: International conference on computer vision, pp 52–61
Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 109–117
Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112(C):83–97
Gao Z, Nie W, Liu A, Zhang H (2016) Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomputing 173(P1):110–117
Gao Z, Li SH, Zhu YJ, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for rgbd action recognition. J Vis Commun Image Represent
Gao Z, Li SH, Zhang GT, Zhu YJ, Wang C, Zhang H (2017) Evaluation of regularized multi-task leaning algorithms for single/multi-view human action recognition. Multimedia Tools and Applications 76(19):1–24
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253
Han Y, Wu F, Tao D, Shao J, Zhuang Y, Jiang J (2012) Sparse unsupervised dimensionality reduction for multiple view data. IEEE Trans Circuits Syst Video Technol 22(10):1485–1496
Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126
Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252
He X, Kan MY, Xie P, Chen X (2014) Comment-based multi-view clustering of web 2.0 items. In: International conference on World Wide Web, pp 771–782
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb a large video database for human motion recognition. In: IEEE international conference on computer vision, ICCV 2011, Barcelona, pp 2556–2563
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR 2008. IEEE conference on computer vision and pattern recognition, 2008, pp 1–8
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: Computer vision and pattern recognition workshops, pp 9–14
Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098
Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2015) A deep structured model with radius–margin bound for 3d human activity recognition. Int J Comput Vis 118(2):256–273
Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2016) A deep structured model with radius–margin bound for 3d human activity recognition. Int J Comput Vis 118(2):256–273
Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2017) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics 47(7):1781–1794
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. pp 1996–2003
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision pattern recognition
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 2929–2936
Rahmani H, Mian A (2016) 3d action recognition from novel viewpoints. In: Computer vision and pattern recognition, pp 1506–1515
Rahmani H, Mahmood A, Du QH, Mian A (2014) HOPC: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision, pp 742–757
Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Ren T, Qiu Z, Liu Y, Yu T, Bei J (2015) Soft-assigned bag of features for object tracking. Multimedia Systems 21(2):189–205
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, vol 3, pp 32–36
Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. pp 1010–1019
Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. Springer International Publishing, pp 408–411
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. Computer Science
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) Learning spatiotemporal features with 3d convolutional networks. pp 4489–4497
Wang H, Schmid C (2014) Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Wang J, Nie X, Xia Y, Wu Y, Zhu SC (2014) Cross-view action modeling, learning, and recognition. In: IEEE conference on computer vision and pattern recognition, pp 2649–2656
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2):249–257
Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15 (3):661–669
Yuan J, Wu Y, Liu Z, Wang J (2014) Mining actionlet ensemble for action recognition with depth cameras. IEEE Trans Softw Eng 36(5):914–927
Zhang H, Zha Z-J, Yang Y, Yan S, Chua T-S (2014) Robust (semi) nonnegative graph embedding. IEEE Trans Image Process 23(7):2996?-3012
Zheng J, Jiang Z, Chellappa R (2016) Cross-view action recognition via transferable dictionary learning. IEEE press, p 2542
Zhou Q, Wang G, Jia K, Qi Z (2014) Learning to share latent tasks for action recognition. In: IEEE international conference on computer vision, pp 2264–2271
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the National Natural Science Foundation of China (No.61572357, No.61202168), Tianjin Municipal Natural Science Foundation (No.14JCZDJC31700, No.13JCQNJC0040).
Rights and permissions
About this article
Cite this article
Gao, Z., Han, Tt., Zhang, H. et al. MMA: a multi-view and multi-modality benchmark dataset for human action recognition. Multimed Tools Appl 77, 29383–29404 (2018). https://doi.org/10.1007/s11042-018-5833-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5833-8