MMA: a multi-view and multi-modality benchmark dataset for human action recognition

Gao, Zan; Han, Tao-tao; Zhang, Hua; Xue, Yan-bing; Xu, Guang-ping

doi:10.1007/s11042-018-5833-8

MMA: a multi-view and multi-modality benchmark dataset for human action recognition

Published: 21 March 2018

Volume 77, pages 29383–29404, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zan Gao^1,2,
Tao-tao Han^1,2,
Hua Zhang^1,2,
Yan-bing Xue^1,2 &
…
Guang-ping Xu^1,2

708 Accesses
5 Citations
Explore all metrics

Abstract

Human action recognition is an active research topic in both computer vision and machine learning communities, which has broad applications including surveillance, biometrics and human computer interaction. In the past decades, although some famous action datasets have been released, there still exist limitations, including the limited action categories and samples, camera views and variety of scenarios. Moreover, most of them are designed for a subset of the learning problems, such as single-view learning problem, cross-view learning problem and multi-task learning problem. In this paper, we introduce a multi-view, multi-modality benchmark dataset for human action recognition (abbreviated to MMA). MMA consists of 7080 action samples from 25 action categories, including 15 single-subject actions and 10 double-subject interactive actions in three views of two different scenarios. Further, we systematically benchmark the state-of-the-art approaches on MMA with respective to all three learning problems by different temporal-spatial feature representations. Experimental results demonstrate that MMA is challenging on all three learning problems due to significant intra-class variations, occlusion issues, views and scene variations, and multiple similar action categories. Meanwhile, we provide the baseline for the evaluation of existing state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

A Comprehensive Survey of Loss Functions in Machine Learning

Article 12 April 2020

References

Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Article MathSciNet Google Scholar
Chen G (2015) Human action recognition via multi-task learning base on spatial-temporal feature. Elsevier Science Inc, pp 418–428
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: IEEE international conference on image processing, pp 168–172
Cheng Z, Qin L, Ye Y, Huang Q, Qi T (2012) Human daily action analysis with multi-view and color-depth data. In: International conference on computer vision, pp 52–61
Chapter Google Scholar
Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 109–117
Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112(C):83–97
Article Google Scholar
Gao Z, Nie W, Liu A, Zhang H (2016) Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomputing 173(P1):110–117
Article Google Scholar
Gao Z, Li SH, Zhu YJ, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for rgbd action recognition. J Vis Commun Image Represent
Gao Z, Li SH, Zhang GT, Zhu YJ, Wang C, Zhang H (2017) Evaluation of regularized multi-task leaning algorithms for single/multi-view human action recognition. Multimedia Tools and Applications 76(19):1–24
Article Google Scholar
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253
Article Google Scholar
Han Y, Wu F, Tao D, Shao J, Zhuang Y, Jiang J (2012) Sparse unsupervised dimensionality reduction for multiple view data. IEEE Trans Circuits Syst Video Technol 22(10):1485–1496
Article Google Scholar
Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24(12):5114–5126
Article MathSciNet Google Scholar
Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252
Article MathSciNet Google Scholar
He X, Kan MY, Xie P, Chen X (2014) Comment-based multi-view clustering of web 2.0 items. In: International conference on World Wide Web, pp 771–782
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb a large video database for human motion recognition. In: IEEE international conference on computer vision, ICCV 2011, Barcelona, pp 2556–2563
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR 2008. IEEE conference on computer vision and pattern recognition, 2008, pp 1–8
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. In: Computer vision and pattern recognition workshops, pp 9–14
Li Z, Liu J, Tang J, Lu H (2015) Robust structured subspace learning for data representation. IEEE Trans Pattern Anal Mach Intell 37(10):2085–2098
Article Google Scholar
Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2015) A deep structured model with radius–margin bound for 3d human activity recognition. Int J Comput Vis 118(2):256–273
Article MathSciNet Google Scholar
Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L (2016) A deep structured model with radius–margin bound for 3d human activity recognition. Int J Comput Vis 118(2):256–273
Article MathSciNet Google Scholar
Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Article Google Scholar
Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2017) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics 47(7):1781–1794
Article Google Scholar
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the wild. pp 1996–2003
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision pattern recognition
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009, pp 2929–2936
Rahmani H, Mian A (2016) 3d action recognition from novel viewpoints. In: Computer vision and pattern recognition, pp 1506–1515
Rahmani H, Mahmood A, Du QH, Mian A (2014) HOPC: Histogram of oriented principal components of 3d pointclouds for action recognition. In: European conference on computer vision, pp 742–757
Google Scholar
Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Article Google Scholar
Ren T, Qiu Z, Liu Y, Yu T, Bei J (2015) Soft-assigned bag of features for object tracking. Multimedia Systems 21(2):189–205
Article Google Scholar
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, vol 3, pp 32–36
Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. pp 1010–1019
Soomro K, Zamir AR (2014) Action recognition in realistic sports videos. Springer International Publishing, pp 408–411
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. Computer Science
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) Learning spatiotemporal features with 3d convolutional networks. pp 4489–4497
Wang H, Schmid C (2014) Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp 3551–3558
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Wang J, Nie X, Xia Y, Wu Y, Zhu SC (2014) Cross-view action modeling, learning, and recognition. In: IEEE conference on computer vision and pattern recognition, pp 2649–2656
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2):249–257
Article Google Scholar
Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15 (3):661–669
Article Google Scholar
Yuan J, Wu Y, Liu Z, Wang J (2014) Mining actionlet ensemble for action recognition with depth cameras. IEEE Trans Softw Eng 36(5):914–927
Google Scholar
Zhang H, Zha Z-J, Yang Y, Yan S, Chua T-S (2014) Robust (semi) nonnegative graph embedding. IEEE Trans Image Process 23(7):2996?-3012
Article MathSciNet Google Scholar
Zheng J, Jiang Z, Chellappa R (2016) Cross-view action recognition via transferable dictionary learning. IEEE press, p 2542
Zhou Q, Wang G, Jia K, Qi Z (2014) Learning to share latent tasks for action recognition. In: IEEE international conference on computer vision, pp 2264–2271

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer Vision and System, Tianjin University of Technology Ministry of Education, Tianjin, 300384, China
Zan Gao, Tao-tao Han, Hua Zhang, Yan-bing Xue & Guang-ping Xu
Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, 300384, China
Zan Gao, Tao-tao Han, Hua Zhang, Yan-bing Xue & Guang-ping Xu

Authors

Zan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Tao-tao Han
View author publications
You can also search for this author in PubMed Google Scholar
Hua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yan-bing Xue
View author publications
You can also search for this author in PubMed Google Scholar
Guang-ping Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zan Gao.

Additional information

This work was supported in part by the National Natural Science Foundation of China (No.61572357, No.61202168), Tianjin Municipal Natural Science Foundation (No.14JCZDJC31700, No.13JCQNJC0040).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, Z., Han, Tt., Zhang, H. et al. MMA: a multi-view and multi-modality benchmark dataset for human action recognition. Multimed Tools Appl 77, 29383–29404 (2018). https://doi.org/10.1007/s11042-018-5833-8

Download citation

Received: 18 November 2017
Revised: 14 January 2018
Accepted: 23 February 2018
Published: 21 March 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-5833-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MMA: a multi-view and multi-modality benchmark dataset for human action recognition

Abstract

Access this article

Similar content being viewed by others

ImageNet Large Scale Visual Recognition Challenge

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A Comprehensive Survey of Loss Functions in Machine Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MMA: a multi-view and multi-modality benchmark dataset for human action recognition

Abstract

Access this article

Similar content being viewed by others

ImageNet Large Scale Visual Recognition Challenge

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A Comprehensive Survey of Loss Functions in Machine Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation