Abstract
Human activity recognition has been a significant goal of computer vision since its inception and has developed considerably in the last years. Recent approaches to this problem increasingly favour the use of data-driven deep learning methods. To facilitate the comparison of these methods, several datasets pertaining to labelled human activity have been created, having great variation in content and methodology. As the field has developed, the datasets used have undergone considerable evolution as well. In this paper, we attempt to classify and describe a variety of datasets for researchers to choose the most suitable benchmark for their domain. For this, we propose a set of characteristics by which datasets may be compared. We also describe the progress in recent years that sets modern datasets apart from those used in the past.
Similar content being viewed by others
References
Asadi-Aghbolaghi, M. et al.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: Proc.—12th IEEE Int. Conf. Autom. Face Gesture Recognition, FG 2017—1st Int. Work. Adapt. Shot Learn. Gesture Underst. Prod. ASL4GUP 2017, Biometrics Wild, Bwild 2017, Heteroge, pp. 476–483 (2017)
Ramasamy Ramamurthy, S., Roy, N.: Recent trends in machine learning for human activity recognition-A survey. Wiley Interdisc. Rev. Data Min. Knowl. Disc. (2018). https://doi.org/10.1002/widm.1254
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)
Ramezani, M., Yaghmaee, F.: A review on human action analysis in videos for retrieval applications. Artif. Intell. Rev. 46(4), 485–514 (2016)
Zhang, N., Wang, Y., Yu, P.: A review of human action recognition in video. In: Proceedings—17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018, pp. 57–62 (2018)
Liu, Y., Nie, L., Han, L., Zhang, L., Rosenblum, D.S.: Action2Activity: Recognizing complex activities from sensor data. In: IJCAI International Joint Conference on Artificial Intelligence, vol. 2015, pp. 1617–1623 (2015)
Liu, Y., Nie, L., Liu, L., Rosenblum, D.S.: From action to activity: sensor-based activity recognition. Neurocomputing 181, 108–115 (2016)
Dai, X., Singh, B., Zhang, G., Davis, L.S. and Chen, Y.Q.: Temporal context network for activity localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017, pp. 5727–5736 (2017)
Zelnik-Manor, L., Irani, M.: Event-based analysis of video. WWW, 2005. [Online]. http://www.wisdom.weizmann.ac.il/~vision/VideoAnalysis/Demos/EventDetection/EventDetection.html. Accessed 25 Mar 2019
Chaquet, J.M., Carmona, E.J., Fernández-Caballero, A.: A survey of video datasets for human action and activity recognition. Comput Vis. Image Underst 117(6), 633–659 (2013)
Fouhey, D.F., Kuo, W.C., Efros, A.A., Malik, J.: From lifestyle Vlogs to everyday interactions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 4991–5000 (2018)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07, pp. 961–970 (2015)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
Kay, W. et al.: The kinetics human action video dataset (2017) arXiv:1705.06950 [cs.CV]
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2556–2563 (2011)
Yoshikawa, Y., Lin, J., Takeuchi, A.: STAIR actions: a video dataset of everyday home actions (2018). arXiv:1804.04326
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: A benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 1017–1025 (2015)
Le, D.-T., Uijlings, J., Bernardi, R.: TUHOI: Trento universal human object interaction dataset. In: Proceedings of the Third Workshop on Vision and Language (2014)
Blunsden, S., Fisher, R.B.: The BEHAVE video dataset: ground truthed video for multi-person behavior classification. Ann. BMVA 2010(4), 1–11 (2010)
Gu, C. et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Monfort, M. et al.: Moments in time dataset: one million videos for event understanding. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1 (2019)
Moltisanti, D. et al.: Scaling egocentric vision: the dataset. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11208, pp. 753–771. LNCS (2018)
Kuehne, H., Iqbal, A., Richard, A., Gall, J.: Mining YouTube—a dataset for learning fine-grained action concepts from Webly supervised video data (2019). arXiv:1906.01012
Gella, S., Keller, F.: An analysis of action recognition datasets for language and vision tasks. In: ACL 2017—55th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Paper), vol. 2, no. c, pp. 64–71 (2017)
Soomro, K., Zamir, A.R.: Computer Vision in Sports. Springer, Berlin (2014)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: 26th IEEE IEEE Conference on Computer Visions Pattern Recognition, CVPR (2008)
Niebles, J.C., Chen, C.W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6312, no. PART 2, pp. 392–405. LNCS (2010)
Karpathy, A. et al.: Large-scale video classification with convolutional neural networks. WWW, 2014. [Online]. https://cs.stanford.edu/people/karpathy/deepvideo/. Accessed 25 Mar 2019
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016, pp. 1971–1980 (2016)
Giancola, S., Amine, M., Dghaily, T., Ghanem, B.: SoccerNet: a scalable dataset for action spotting in soccer videos. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work, vol. 2018, pp. 1792–1802 (2018)
Bloom, V., Makris, D., Argyriou, V.: G3D: a gaming action dataset and real time action recognition evaluation framework. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work, pp. 7–12 (2012)
Bloom, V., Argyriou, V., Makris, D.: G3di: a gaming interaction dataset with a real time detection and evaluation framework. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8925, pp. 698–712 (2015)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work. CVPRW 2010, vol. 2010, pp. 9–14 (2010)
Rybok, L., Friedberger, S., Hanebeck, U.D., Stiefelhagen, R.: The KIT Robo-kitchen data set for the evaluation of view-based activity recognition systems. In: IEEE-RAS International Conference on Humanoid Robots, pp. 128–133 (2011)
De La Torre, F. et al.: Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) Database. In: C. Robot. Inst., p. 19 (2008)
Tenorth, M., Bandouch, J., Beetz, M.: The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 2009, pp. 1089–1096 (2009)
Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: joint learning of gaze and actions in first person video. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11209, pp. 639–655. LNCS (2018)
Fontana, V., Singh, G., Akrigg, S., Di Maio, M., Saha, S., Cuzzolin, F.: Action detection from a robot-car perspective (2018). arXiv:1807.11332
Oxford Mobile Robotics Group, Oxford RobotCar Dataset. WWW (2016). [Online]. http://robotcar-dataset.robots.ox.ac.uk. Accessed 25 Mar 2019
Fisher, R.B.: The PETS04 surveillance ground-truth data sets. In: Evaluation of Tracking and Surveillance, pp. 1–5, 2004
List, T., Bins, J., Vazquez, J., Fisher, R.B.: Performance evaluating the evaluator. In: Proc.—2nd Jt. IEEE Int. Work. Vis. Surveill. Perform. Eval. Track. Surveillance, VS-PETS, vol. 2005, pp. 129–136, 2005
Nghiem, A., et al.: ETISEO, performance evaluation for video surveillance systems. In: IEEE International Conference on Advanced Video and Signal based Surveillance. IEEE, London, UK, 5-7 Sept. 2007. https://doi.org/10.1109/AVSS.2007.4425357
Zhang, J., Li, W., Ogunbona, P.O., Wang, P., Tang, C.: RGB-D-based action recognition datasets: a survey. Pattern Recognit. 60, 86–105 (2016)
Chavarriaga, R., et al.: The Opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 34(15), 2033–2042 (2013)
De-La-Hoz-Franco, E., Ariza-Colpas, P., Quero, J.M., Espinilla, M.: Sensor-based datasets for human activity recognition—a systematic review of literature. IEEE Access 6, 59192–59210 (2018)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Hassner, T.: A critical review of action recognition benchmarks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 245–250 (2013)
Rohrbach M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1194–1201 (2012)
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012. [Online]. https://www.csee.umbc.edu/~hpirsiav/papers/ADLdataset/. Accessed 10 Aug 2019
Weinzaepfel, P., Martin, X., Schmid, C.: Human action localization with sparse spatial supervision (2016). arXiv:1605.05197
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9905, pp. 510–526. LNCS (2016)
Goyal, R. et al.: The ‘something something’ video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, 2017, vol. 2017, pp. 5843–5851 (2018)
de Souza, C.R., Gaidon, A., Cabon, Y., Peña, A.M.L.: Procedural generation of videos to train deep action recognition networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4757–4767 (2017)
Paolacci, G., Chandler, J., Mueller, P.: Online experimentation: amazon mechanical turk. Four Essays Consum. Decis. 5(5), 59 (2012)
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 26th IEEE Conference Computer Vision Pattern Recognition, pp. 1–8. CVPR (2008)
Ikizler-cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: Proceedings of the International Conference On Computer Vision (ICCV’09), pp. 1–8, Kyoto, Japan (2009)
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016, no. 28, pp. 1521–1528 (2011)
Sigurdsson, G.A., Russakovsky, O., Gupta, A.: What actions are needed for understanding human actions in videos? In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017, pp. 2156–2165 (2017)
Nowozin, S., Shotton, J.: Action points: a representation for low-latency online human action recognition. In: Microsoft Res. Cambridge Technical Rep., no. MSR-TR-2012-68, pp. 1–18 (2012)
Heidarivincheh, F., Mirmehdi, M., Damen, D.: Action completion: a temporal model for moment detection (2018). arXiv:1805.06749
Idrees, H., et al.: The THUMOS challenge on action recognition for videos ‘in the wild’. Comput. Vis. Image Underst. 155, 1–23 (2017)
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding (2017). arXiv:1703.07475
Bojanowski, P. et al.: Weakly supervised action labeling in videos under ordering constraints lecture notes in computer science. In: Computer Vision—ECCV 2014, vol. 8693, no. Chapter 41, pp. 628–643 (2014)
Minnen, D., Westeyn, T.L., Starner, T., Ward, J.A., Lukowicz, P.: Performance metrics and evaluation issues for continuous activity recognition. In: Proceedings of International Work Performance Metrics for Intelligent Systems, pp. 141–148 (2006)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017, pp. 706–715 (2017)
Kliper-Gross, O., Hassner, T., Wolf, L.: The action similarity labeling challenge. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 615–621 (2012)
Cao, L., Liu, Z., Huang, T.S.: Cross-dataset action detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1998–2005 (2010)
Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. Proc. Int. Conf. Pattern Recognit. 3, 32–36 (2004)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2007)
Piergiovanni, A.J., Ryoo, M.S.: Fine-grained activity recognition in baseball videos. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work, vol. 2018, pp. 1821–1830 (2018)
Zelnik-Manor, L., Irani, M.: Event-based analysis of video. In: Proceedings 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, no. 1229, pp. II-123–II-130 (2005)
Ivan L., Barbara C.: Recognition of human actions. (2005). [Online]. Available: http://www.nada.kth.se/cvap/actions/. Accessed 25 Mar 2019
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes (2007). [Online]. http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html. Accessed 25 Mar 2019
CAVIAR,: CAVIAR: Context Aware Vision using Image-based Active Recognition, European Commission project IST 2001 37540. Www, 2001. [Online]. http://groups.inf.ed.ac.uk/vision/CAVIAR/. Accessed 25 Mar 2019
Nghiem, A.T., Bremond, F., Thonnat, M., Ma, R.: A new evaluation approach for video processing algorithms. In: 2007 IEEE Work. Motion Video Comput. WMVC 2007 (2007)
Nghiem, A.T., Bremond, F., Thonnat, M., Ma, R.: ETISEO dataset. Www, 2007. [Online]. https://www-sop.inria.fr/orion/ETISEO/. Accessed: 25 Mar 2019
Vezzani, R., Cucchiara, R., Dipartimento, I.: ViSOR : Video Surveillance Online Repository. vol. 2010, no. 2, pp. 1–13 (2010)
Vezzani, R., Cucchiara, R.: Video surveillance online repository (ViSOR), Www, 2013. [Online]. http://www.openvisor.org/index.asp. Accessed 25 Mar 2019
INRIA Xmas Motion Acquisition Sequences (IXMAS), Www, 2006. [Online]. http://4drepository.inrialpes.fr/public/viewgroup/. Accessed 25 Mar 2019
Tieniu Tan, D.Z.S.: Center for biometrics and security research. Www. [Online]. http://www.cbsr.ia.ac.cn/china/Iris Databases CH.asp. Accessed 25 Mar 2019
“UCF Aerial Action Data Set,” Www (2009). [Online]. Available: http://crcv.ucf.edu/data/UCF_Aerial_Action.php. Accessed 25 Mar 2019
Rodriguez, M., Mikel D.A., Javed, S.: UCF-ARG Data Set. Www (2011). [Online]. http://crcv.ucf.edu/data/UCF-ARG.php. Accessed: 25 Mar 2019
Khurram Soomro, A.R.Z.: UCF sports action data set, Www (2008). [Online]. http://crcv.ucf.edu/data/UCF_Sports_Action.php
Fu, R., Song, Y., Zhao, W.: Human activity recognition with smartpnones. In: Proceedings of 10th European Conference on Computer Vision—ECCV’08, no. Section 2, pp. 548–561 (2008)
Tran, D, Sorokin, A.: Human activity recognition with metric learning, www, 2008. [Online]. http://vision.cs.uiuc.edu/projects/activity/. Accessed 25 Mar 2019
Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-view and 3D human action/interaction database. In: CVMP 2009 - 6th Eur. Conf. Vis. Media Prod., pp. 159–168 (2009)
Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: i3DPost multi-view human action datasets. Www (2009). [Online]. http://kahlan.eps.surrey.ac.uk/i3dpost_action/. Accessed 25 Mar 2019
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 104–111 (2009)
Messing, R., Pal, C., Kautz, H.: University of Rochester activities of daily living dataset, Www, 2009. [Online]. http://www.cs.rochester.edu/~rmessing/uradl/. Accessed 25 Mar 2019
Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: 2009 IEEE 12th International Conference on Computer Vision Work ICCV Work. 2009, vol. 24, pp. 1282–1289 (2009)
Choi, W., Shahid, K., Savarese, S.: Collective activity dataset, Www (2009). [Online]. http://vhosts.eecs.umich.edu/vision/activity-dataset.html. Accessed 25 Mar 2019
Project, B.: Computer-assisted prescreening of video streams for unusual activities. Www, 2004. [Online]. http://homepages.inf.ed.ac.uk/rbf/BEHAVE/. Accessed 25 Mar 2019
Singh, S., Velastin, S.A., Ragheb, H.: MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods. In: Proceedings—IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2010, pp. 48–55 (2010)
Murtaza, F., Yousaf, M.H., Velastin, S.A.: Multi-view human action recognition using 2D motion templates based on MHIs and their HOG description. IET Comput. Vis. 10(7), 758–767 (2016)
Singh, S., Velastin, S.A., Ragheb, H.: MuHAVi: multicamera human action video data. [Online]. http://dipersec.king.ac.uk/MuHAVi-MAS/. Accessed 25 Mar 2019
Ryoo, M.S., Aggarwal, J.K., Chen, C., Roy-chowdhury, A.: An overview of contest on semantic description of human activities (SDHA). In: ICPR contests (2010)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: Proceedings of the IEEE International Conference on Computer Vision, no. Iccv, pp. 1593–1600 (2009)
Ryoo, M.S., Aggarwal, J.K., Chen, C., Roy-chowdhury, A.: ICPR 2010 contest on semantic description of human activities (SDHA 2010), Www (2010). [Online]. http://cvrc.ece.utexas.edu/SDHA2010/. Accessed 25 Mar 2019
Chen, C., Aggarwal, J.K.: Recognizing human action from a far field of view the University of Texas at Austin. In: Analysis (2009)
Chen, C.-C., Ryoo, M.S., Aggarwal, J.K.: UT-tower dataset: aerial view activity classification challenge (2010). http://cvrc.ece.utexas.edu/SDHA2010/Aerial_View_Activity.html
Oh, S. et al., AVSS 2011 demo session: a large-scale benchmark dataset for event recognition in surveillance video. In: 2011 8th IEEE Int. Conf. Adv. Video Signal Based Surveillance, AVSS 2011, no. 2, pp. 527–528, 2011
Oh, S. et al.: VIRAT video dataset, Www (2011). [Online]. http://www.viratdata.org/. Accessed 25 Mar 2019
Gehrig, D., et al.: Combined intention, activity, and motion recognition for a humanoid household robot. In: IEEE International Conference on Intelligent Robots and Systems, pp. 4819–4825 (2011)
Gehrig, D., et al.: The Karlsruhe Motion, Intention, and Activity Data set (MINTA), Www (2011). [Online]. https://cvhci.anthropomatik.kit.edu/~lrybok/projects/minta/. Accessed 25 Mar 2019
Rybok, L., Friedberger, S., Hanebeck, U.D., Stiefelhagen, R.: The KIT Robo-Kitchen Activity Data Set, Www (2011). [Online]. https://cvhci.anthropomatik.kit.edu/~lrybok/projects/kitchen/. Accessed 25 Mar 2019
Ivan, L., Marszałek, M., Schmid, C., Rozenfeld, B.: IRISA/INRIA Rennes France: learning human actions from movies, Www (2009). [Online]. http://www.irisa.fr/vista/actions/hollywood. Accessed 25 Mar 2019
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work CVPR Work. 2009, vol. 2009 IEEE, no. i, pp. 2929–2936 (2009)
Ivan, L., Marszałek, M., Schmid, C., Rozenfeld, B.: IRISA/INRIA Rennes France: learning human actions from movies, Www (2008). [Online]. http://www.irisa.fr/vista/actions/hollywood2. Accessed 25 Mar 2019
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos ́in the Wild. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Work CVPR Work. 2009, vol. 2009 IEEE, pp. 1996–2003 (2009)
Liu, J., Jingen Liu, M.S.: UCF YouTube action data set, Www (2009). [Online]. http://crcv.ucf.edu/data/UCF_YouTube_Action.php. [Accessed: 25-Mar-2019]
Patron-Perez, A., Marszalek, M., Zisserman, A., Reid, I.: High five: recognising human interactions in TV shows. In: Br. Mach. Vis. Conf. BMVC 2010—Proc., pp. 50.1–50.11 (2010)
Patron-Perez, A.: TV human interaction dataset. Www (2010). [Online]. http://www.robots.ox.ac.uk/~alonso/tv_human_interactions.html. Accessed 25 Mar 2019
“Olympic sports dataset,” Standford University (2010). [Online]. http://vision.stanford.edu/Datasets/OlympicSports/
Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013)
Reddy, K.K., Shah, M.: UCF50—action recognition data set, Www (2010). [Online]. https://www.crcv.ucf.edu/data/UCF50.php. Accessed 25 Mar 2019
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB : a large human motion database, Www (2014). [Online]. http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/. Accessed 25 Mar 2019
Soomro, K., Zamir, A.R. and Shah, M.: UCF101—action recognition data set, Www (2012). [Online]. http://crcv.ucf.edu/data/UCF101.php. Accessed 25 Mar 2019
Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: MPII cooking activities dataset, Www (2013). [Online]. https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-activities-dataset/. Accessed 25 Mar 2019
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7572, no. PART 1, pp. 144–157. LNCS (2012)
Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: MPII cooking composite activities dataset, Www (2012). [Online]. https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-activities-dataset/. Accessed 10 Aug 2019
Kliper-gross, O., Hassner, T.: The action similarity labeling challenge. Www (2012). [Online]. http://www.openu.ac.il/home/hassner/data/ASLAN/ASLAN.html. Accessed 25 Mar 2019
Mettes, P., van Gemert, J.C., Snoek, C.G.M.: Spot on: action localization from pointly-supervised proposals. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9909 LNCS, no. Mil, pp. 437–453 (2016)
Mettes, P., van Gemert, J.C., Snoek, C.G.M.: “Hollywood2Tubes,” Www (2016). [Online]. https://staff.fnwi.uva.nl/p.s.m.mettes/codedata.html. Accessed 25 Mar 2019
Das, P., Xu, C., Doell, R.F., Corso, J.J.: A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2634–2641 (2013)
Das, P., Xu, C., Doell, R.F., Corso, J.J.: U of Michigan, YouCook: an annotated data set of unconstrained third-person cooking videos, Www (2013) [Online]. http://web.eecs.umich.edu/~jjcorso/r/youcook/. Accessed 25 Mar 2019
Jiang, Y.-G., et al.: THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/
Jiang, Y.-G. et al.: THUMOS 13: the first international workshop on action recognition with a large number of classes, in conjunction with ICCV’13, Sydney, Australia, Www (2013). [Online]. http://crcv.ucf.edu/ICCV13-Action-Workshop/. Accessed 25 Mar 2019
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3192–3199 (2013)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Joint-annotated human motion data base, Www (2013). [Online]. http://jhmdb.is.tue.mpg.de. Accessed 25 Mar 2019
Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014)
Kuehne, H., Arslan, A., Serre, T.: The breakfast actions dataset, Www (2014). [Online]. http://serre-lab.clps.brown.edu/resource/breakfast-actions-dataset/. Accessed 10 Aug 2019
Jiang, Y.-G., et al.: THUMOS challenge 2014, Www (2014). [Online]. http://crcv.ucf.edu/THUMOS14/. Accessed 25 Mar 2019
Karpathy, A., Leung, T.: Large-scale video classification with convolutional neural networks, Www (2014). [Online]. https://cs.stanford.edu/people/karpathy/deepvideo/. Accessed 25 Mar 2019
Idrees, H. et al., THUMOS Challenge 2015 In conjunction with CVPR’15, Www (2015). [Online]. http://www.thumos.info/home.html. Accessed 25 Mar 2019
Lee, K., Ognibene, D., Chang, H.J., Kim, T.K., Demiris, Y.: STARE: spatiooral attention relocation for multiple structured activities detection. IEEE Trans. Image Process. 24(12), 5916–5927 (2015)
Lee, K., Ognibene, D., Chang, H.J., Kim, T.K., Demiris, Y.: Crêpe Dataset, Www, 2013. [Online]. https://osf.io/d5k38/wiki/home/. Accessed 25 Mar 2019
Safdarnejad, S.M., Liu, X., Udpa, L., Andrus, B., Wood, J., Craven, D.: Sports videos in the wild (SVW): a video dataset for sports analysis. In: 2015 11th IEEE Int, p. 2015. FG, Conf. Work. Autom. Face Gesture Recognition (2015)
Safdarnejad, S.M., Liu, X., Udpa, L., Andrus, B., Wood, J., Craven, D.: Sports videos in the wild (SVW), Www (2015). [Online]. http://cvlab.cse.msu.edu/project-svw.html. Accessed 25 Mar 2019
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding, Www (2015). [Online]. http://www.activity-net.org. Accessed 25 Mar 2019
Rohrbach, M., et al.: Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119(3), 346–373 (2016)
Rohrbach, M. et al.: MPII cooking 2 dataset, Www (2015). [Online]. https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/human-activity-recognition/mpii-cooking-2-dataset/. Accessed 10 Aug 2019
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016, pp. 1961–1970 (2016)
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: MERL shopping dataset, Www (2016). [Online]. http://www.merl.com/demos/merl-shopping-dataset. Accessed 25 Mar 2019
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Charades and charades-ego datasets, Www (2016). [Online]. http://allenai.org/plato/charades/. Accessed 25 Mar 2019
Mori, G., Andriluka, M., Russakovsky, O., Jin, N., Fei-Fei, L., Yeung, S.: Every moment counts: dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2017)
Mori, G., Andriluka, M., Russakovsky, O., Jin, N., Fei-Fei, L., Yeung, S.: Every moment counts: dense detailed labeling of actions in complex videos, Www (2018). [Online]. http://ai.stanford.edu/~syyeung/everymoment.html. Accessed 25 Mar 2019
Goyal, R. et al.: The 20BN something something dataset, Www (2017). [Online]. https://20bn.com/datasets/something-something. Accessed 10 Aug 2019
Weinzaepfel, P., Martin, X., Schmid, C.: Daily action localization in Youtube videos, Www (2017). [Online]. http://pascal.inrialpes.fr/data/daly/. Accessed 25 Mar 2019
Gu, C. et al.: AVA actions dataset, Www (2017). [Online]. https://research.google.com/ava/. Accessed 25 Mar 2019
Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? Action understanding with multiple classes of actors. In; Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07, pp. 2264–2273 (2015)
Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.:, A2D: a dataset and benchmark for action recognition and segmentation with multiple classes of actors, Www (2015). [Online]. http://web.eecs.umich.edu/~jjcorso/r/a2d/. Accessed 25 Mar 2019
Kay, W. et al.: Kinetics datasets, Www (2017). [Online]. http://deepmind.com/kinetics. Accessed 25 Mar 2019
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600 (2018). arXiv:1808.01340
Fouhey, D.F., Kuo, W., Efros, A.A., Malik, J.: From lifestyle VLOGs to everyday interactions: the VLOG dataset, Www (2017). [Online]. http://web.eecs.umich.edu/~fouhey/2017/VLOG/. Accessed 10 Aug 2019
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: 32nd AAAI conference on artificial intelligence, AAAI 2018, pp. 7590–7598 (2018)
Zhou, L., Xu, C., Corso, J.J.: YouCook2 dataset, Www, 2018. [Online]. http://youcook2.eecs.umich.edu/
Giancola, S., Amine, M., Dghaily, T., Ghanem, B., SoccerNet: a scalable dataset for action spotting in soccer videos, Www (2018). [Online]. https://silviogiancola.github.io/SoccerNet. Accessed 25 Mar 2019
Piergiovanni, A.J., Ryoo, M.S.: MLB-YouTube dataset, Www (2018). [Online]. https://github.com/piergiaj/mlb-youtube/. Accessed 25 Mar 2019
Yoshikawa, Y., Lin, J., Takeuchi, A.: STAIR actions: a large-scale video dataset of everyday human actions, Www (2018). [Online]. http://actions.stair.center. Accessed 25 Mar 2019
Shim, M., Kim, Y.H., Kim, K., Kim, S.J.: Teaching machines to understand baseball games: large-scale baseball video database for multiple video understanding tasks. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11219, pp. 420–437. LNCS (2018)
Shim, M., Kim, Y.H., Kim, K., Kim, S.J.: Teaching machines to understand baseball games : large-scale baseball video database for multiple video understanding tasks, Www (2018) [Online]. https://sites.google.com/site/eccv2018bbdb/. Accessed 25 Mar 2019
Monfort, M. et al.: Moments in time dataset : a large-scale dataset for recognizing and understanding action in videos, Www (2019). [Online]. http://moments.csail.mit.edu. Accessed 25 Mar 2019
Zhao, H., Yan, Z., Torresani, L., Torralba, A.: HACS: human action clips and segments dataset for recognition and temporal localization (2017). arXiv:1712.09374
Tang, Y. et al.: COIN: a large-scale dataset for comprehensive instructional video analysis (2019). arXiv:1903.02874
Fathi, A., Ren, X., Rehg, J.: Georgia tech egocentric activity datasets, Www (2011). [Online]. http://www.cbi.gatech.edu/fpv/. Accessed 10 Aug 2019
Fathi, A., Ren, X., Rehg, J.: Learning to recognize daily actions using gaze. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7572, no. PART 1, pp. 314–327. LNCS (2012)
Calway, A., Mayol-Cuevas, W., Damen, D., Haines, O., Leelasawassuk, T.: Discovering Task relevant objects and their modes of interaction from multi-user egocentric video. In: BMVC, pp. 30.1–30.13 (2015)
Calway, A., Mayol-Cuevas, W., Damen, D., Teesid, L., Osian, H., Michael, W., Davide M.: Bristol egocentric object interactions dataset, Www (2014). [Online]. http://data.bris.ac.uk/data/dataset/o4hx7jnmfqt01lyzf2n4rchg6. Accessed: 10 Aug 2019
Iwashita, Y., Takamine, A., Kurazume, R., Ryoo, M.S.: First-person animal activity recognition from egocentric videos. In: Proc.—Int. Conf. Pattern Recognit., no. i, pp. 4310–4315 (2014)
Iwashita, Y., Takamine, A., Kurazume, R., Ryoo, M.S.: Dogcentric activity dataset, Www (2014). [Online]. http://robotics.ait.kyushu-u.ac.jp/~yumi/db/first_dog.html. Accessed 10 Aug 2019
Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07, no. 1, pp. 287–295 (2015)
Ohnishi, K., Kanehira, A., Kanezaki, A., Harada, T.: Recognizing activities of daily living with a wrist-mounted camera. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016, no. c, pp. 3103–3111 (2016)
Ohnishi, K., Kanehira, A., Kanezaki, A., Harada, T.: Recognizing activities of daily living with a wrist-mounted camera, Www (2015). [Online]. https://www.mi.t.u-tokyo.ac.jp/static/projects/miladl/. Accessed 25 Mar 2019
Singh, S., Arora, C., Jawahar, C.V.: Trajectory aligned features for first person action recognition. Pattern Recognit. 62, 45–55 (2017)
Singh, S., Arora, C., Jawahar, C.V.: PR 2016 paper—trajectory aligned features for first person action recognition, Www (2017). [Online]. http://cvit.iiit.ac.in/research/projects/cvit-projects/first-person-action-recognition. Accessed 10 Aug 2019
Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: a large-scale dataset of paired third and first person videos, pp. 1–3 (2018)
Moltisanti, D. et al.: EPIC-Kitchens 2019, Www (2019). [Online]. https://epic-kitchens.github.io/2018. Accessed 25 Mar 2019
Abebe, G., Catala, A., Cavallaro, A.: A first-person vision dataset of office activities. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11377 LNAI, pp. 27–37 (2019)
Abebe, G., Catala, A., Cavallaro, A.: A first-person vision dataset of office activities, www (2019). [Online]. http://www.eecs.qmul.ac.uk/~andrea/fpvo.html. Accessed 10 Aug 2019
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016, no. i, pp. 1933–1941 (2016)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017, pp. 5534–5542 (2017)
Wang, L. et al.: Temporal segment networks: towards good practices for deep action recognition. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9912, pp. 20–36. LNCS (2016)
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 3034–3042 (2016)
Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-centric relation network. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11215 LNCS, pp. 335–351 (2018)
ViPER: the video performance evaluation resource. WWW. [Online]. http://viper-toolkit.sourceforge.net/. Accessed 25 Mar 2019
Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition., vol. 2016, pp. 1942–1950 (2016)
Zhu, C. et al.: Fine-grained video categorization with redundancy reduction attention. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11209, pp. 139–155. LNCS (2018)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition (2018). arXiv:1812.03982
Baradel, F., Neverova, N., Wolf, C., Mille, J., Mori, G.: Object level visual reasoning in videos. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11217 LNCS, pp. 106–122 (2018)
Piergiovanni, A., Ryoo, M.S.: Unseen action recognition with multimodal learning, pp. 1–17 (2018). arXiv:1806.08251
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: British Machine Vision Conference, BMVC 2010—Proceedings, pp. 97.1–97.11 (2010)
Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recognit. 47(10), 3343–3361 (2014)
Delaitre, V., Ivan, L., Sivic, J.: Human action classification in still images. WWW (2010). [Online]. https://www.di.ens.fr/willow/research/stillactions. Accessed: 25 Mar 2019
Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: Proceedings of the IEEE Computer Vision and Pattern Recognition, pp. 9–16, 2010
Yao, B., Li, F.F., Khosla, A.: People playing musical instrument (PPMI). WWW (2010). [Online]. http://ai.stanford.edu/~bangpeng/ppmi.html. Accessed: 25 Mar 2019
Yao, B., Jiang X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1331–1338 (2011)
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-fei, L.: Stanford 40 actions dataset, WWW (2011) [Online]. Available: http://vision.stanford.edu/Datasets/40actions.html. Accessed 25 Mar 2019
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO & HICO-DET benchmarks for recognizing human-object interactions in images.WWW (2015) [Online]. http://www-personal.umich.edu/~ywchao/hico/. Accessed: 25 Mar 2019
Ma, S., Bargal, S.A., Zhang, J., Sigal, L., Sclaroff, S.: Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recognit. 68, 334–345 (2017)
Ma, S., Bargal, S.A., Zhang, J., Sigal, L., Sclaroff, S.: BU-action datasets. WWW, 2017. [Online]. http://cs-people.bu.edu/sbargal/BU-action/. Accessed: 25 Mar 2019
Chao, Y.W., Liu, Y., X. Liu, H. Zeng, and J. Deng: Learning to detect human-object interactions. In: Proceedings 2018 IEEE winter conference on applications of computer vision, WACV 2018, vol. 2018, pp. 381–389 (2018)
Fei-Fei, L., Dong, W., Deng, J., Li, K., Socher, R., Li, L.-J.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by L. Zhang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, R., Sonawane, A. & Srivastava, R. Recent evolution of modern datasets for human activity recognition: a deep survey. Multimedia Systems 26, 83–106 (2020). https://doi.org/10.1007/s00530-019-00635-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-019-00635-7