Abstract
Multimodal data is being used more widely for human action recognition nowadays due to the progress of machine learning methods and the development of new types of sensors. The acquisition of the data required by such solutions is often troublesome, and it is difficult to find the proper tools for this process. In this paper, we present a new toolkit for multimodal acquisition. We address and discuss issues concerning the synchronization of data from multiple sensors, the optimization of the initial processing of raw data, and the design of the user interface for efficiently recording large databases. The system was verified in a setup consisting of three types of sensors – a Kinect 2, two PS3Eye cameras, and an accelerometer glove. The accuracy of the synchronization and performance of the initial processing proved to be suitable for human action acquisition and recognition. The system was used for the acquisition of an extensive database of sign language gestures. User feedback indicated the recording process to be efficient, which is also evaluated in the paper. The system is publicly available, both in the form of a standalone application as well as source code, and can be easily customized to any type of sensor setup.










Similar content being viewed by others
Notes
The multimodal data acquisition toolkit is available at https://github.com/fmal-pl/MultiSourceAcquisition.
CL-Eye Platform SDK Homepage: https://codelaboratories.com/products/eye/sdk/
References
Antonakaki P, Kosmopoulos D, Perantonis SJ (2009) Detecting abnormal human behaviour using multiple cameras. Signal Process 89:1723–1738. https://doi.org/10.1016/j.sigpro.2009.03.016
Busso C, Deng Z, Yildirim S, Bulut M, Lee CM, Kazemzadeh A, Lee S, Neumann U, Narayanan S (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. Proc. 6th Int. Conf. Multimodal interfaces - ICMI ‘04 205. https://doi.org/10.1145/1027933.1027968
Chang KI, Bowyer KW, Flynn PJ (2003) Multimodal 2D and 3D biometrics for face recognition. 2003 I.E. Int. SOI Conf. Proc. (cat. No.03CH37443). https://doi.org/10.1109/AMFG.2003.1240842
Chen L, Hoey J, Nugent CD, Cook DJ, Yu Z (2012) Sensor-based activity recognition. IEEE Trans Syst Man Cybern Part C Appl Rev 42:790–808. https://doi.org/10.1109/TSMCC.2012.2198883
Cheung YM, Peng Q (2015) Eye gaze tracking with a web camera in a desktop environment. IEEE Trans Human-Mach Syst 45:419–430. https://doi.org/10.1109/THMS.2015.2400442
Cholewa M, Głomb P (2013) Estimation of the number of states for gesture recognition with hidden Markov models based on the number of critical points in time sequence. Pattern Recogn Lett 34:574–579. https://doi.org/10.1016/j.patrec.2012.12.002
Cholewa M, Głomb P (2015) Natural human gestures classification using multisensor data. 2015 3rd IAPR Asian Conf. Pattern Recogn 499–503
Cui J, Liu Y, Xu Y, Zhao H, Zha H (2013) Tracking generic human motion via fusion of low- and high-dimensional approaches. IEEE Trans Syst Man Cybern Syst 43:996–1002. https://doi.org/10.1109/TSMCA.2012.2223670
Dinh DL, Lee S, Kim TS (2016) Hand number gesture recognition using recognized hand parts in depth images. Multimed Tools Appl 75:1333–1348. https://doi.org/10.1007/s11042-014-2370-y
Ganapathi V, Plagemann C, Koller D, Thrun S (2010) Real time motion capture using a single time-of-flight camera, 2010 I.E. Comput. Soc. Conf. Comput. Vis. Pattern Recogn 755–762. https://doi.org/10.1109/CVPR.2010.5540141.
García J, Gardel A, Bravo I, Lázaro JL, Martínez M (2013) Tracking people motion based on extended condensation algorithm. IEEE Trans Syst Man Cybern Syst Hum 43:606–618. https://doi.org/10.1109/TSMCA.2012.2220540
Gkalelis N, Kim H, Hilton A, Nikolaidis N, Pitas I (2009) The i3DPost multi-view and 3D human action/interaction database. CVMP 2009 - 6th Eur. Conf. Vis. Media prod 159–168. https://doi.org/10.1109/CVMP.2009.19
Hg RI, Jasek P, Rofidal C, Nasrollahi K, Moeslund TB, Tranchet G (2012) An RGB-D database using Microsoft’s kinect for windows for face detection. 2012 eighth Int. Conf. Signal image Technol. Internet based Syst 42–46. https://doi.org/10.1109/SITIS.2012.17
Hoda M, Hoda Y, Hafidh B, El Saddik A (2017) Predicting muscle forces measurements from kinematics data using kinect in stroke rehabilitation. Multimed tools Appl 1–19. doi:https://doi.org/10.1007/s11042-016-4274-5
Holte MB, Tran C, Trivedi MM, Moeslund TB (2012) Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE J Sel Top Signal Process 6:538–552. https://doi.org/10.1109/JSTSP.2012.2196975
Hou YL, Pang GKH (2011) People counting and human detection in a challenging situation. IEEE Trans Syst Man Cybern Syst Hum 41:24–33. https://doi.org/10.1109/TSMCA.2010.2064299
Hwang BW, Kim S, Lee SW (2006) A full-body gesture database for automatic gesture recognition. FGR 2006 proc. 7th Int. Conf. Autom. Face Gesture Recognit 2006: 243–248. https://doi.org/10.1109/FGR.2006.8
Jaimes A, Sebe N (2007) Multimodal human-computer interaction: a survey. Comput Vis Image Underst 108:116–134. https://doi.org/10.1016/j.cviu.2006.10.019.
Kepski M, Kwolek B, Austvoll I (2012) Fuzzy inference-based reliable fall detection using kinect and accelerometer. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes bioinformatics). 7267 LNAI 266–273. https://doi.org/10.1007/978-3-642-29347-4_31
Krumm J, Harris S, Meyers B, Brumitt B, Hale M, Shafer S (2000) Multi-camera multi-person tracking for easy living. Proc Third IEEE Int Work Vis Surveil. https://doi.org/10.1109/VS.2000.856852.
Kumar P, Gauba H, Pratim Roy P, Prosad Dogra D (2017) A multimodal framework for sensor based sign language recognition. Neurocomputing 259:21–38. https://doi.org/10.1016/j.neucom.2016.08.132
Kwolek B, Kepski M (2015) Improving fall detection by the use of depth sensor and accelerometer. Neurocomputing 168:637–645. https://doi.org/10.1016/j.neucom.2015.05.061
Lazzeri N, Mazzei D, De Rossi D (2014) Development and testing of a multimodal acquisition platform for human-robot interaction affective studies. J Human-Robot Interact 3:1. https://doi.org/10.5898/JHRI.3.2.Lazzeri
Li L, Dai S (2016) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tools Appl. https://doi.org/10.1007/s11042-016-3789-0
Liu Y, Zhang X, Cui J, Wu C, Aghajan H, Zha H (2010) Visual analysis of child-adult interactive behaviors in video sequence. 2010 16th Int. Conf. Virtual Syst. Multimedia, VSMM 2010. 26–33. https://doi.org/10.1109/VSMM.2010.5665969
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115. https://doi.org/10.1016/j.neucom.2015.08.096
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. Proc. 30th Conf. Artif. Intell. (AAAI 2016) 1266–1272
Malawski F (2014) Applying hand gesture recognition with time-of-flight camera for 3D medical data analysis. Challenges Mod Technol 5:12–16
Malawski F, Kwolek B (2016) Classification of basic footwork in fencing using accelerometer. Signal Process Algorithms Archit Arrange Appl (SPA), IEEE: 51–55. https://doi.org/10.1109/SPA.2016.7763586
Malawski F, Kwolek B, Sako S (2014) Using Kinect for facial expression recognition under varying poses and illumination. Act. Media Technol. 10th Int. Conf. AMT 2014 - Lect. Notes Comput. Sci. 8610 LNCS 395–406. https://doi.org/10.1007/978-3-319-09912-5_33
Mendels O, Stern H, Berman S (2014) User identification for home entertainment based on free-air hand motion signatures. IEEE Trans Syst Man Cybern Syst Hum 44:1461–1473. https://doi.org/10.1109/TSMC.2014.2329652
Mian AS, Bennamoun M, Owens R (2007) An efficient multimodal 2D-3D hybrid approach to automatic face recognition. IEEE Trans Pattern Anal Mach Intell 29:1927–1943. https://doi.org/10.1109/TPAMI.2007.1105
Michel M, Stanford V (2006) Synchronizing multimodal data streams acquired using commodity hardware. Proc. 4th ACM Int. work. Video Surveill. Sens. Networks - VSSN ‘06 3. https://doi.org/10.1145/1178782.1178785
Min R, Kose N, Dugelay J-L (2013) KinectFaceDB: a Kinect database for face recognition. IEEE Trans Syst Man Cybern Syst 44:1534–1548
Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R (2013) Berkeley MHAD: a comprehensive multimodal human action database. Proc IEEE Work Appl Comput Vis 53–60. https://doi.org/10.1109/WACV.2013.6474999
Oliver N, Garg A, Horvitz E (2004) Layered representations for learning and inferring office activity from multiple sensory channels. Comput Vis Image Underst 96:163–180. https://doi.org/10.1016/j.cviu.2004.02.004
Pantic M, Rothkrantz LJM (2003) Toward an affect-sensitive multimodal human-computer interaction. Proc IEEE 91:1370–1390. https://doi.org/10.1109/JPROC.2003.817122
Plantard P, Hubert HP, Multon F (2017) Filtered pose graph for efficient kinect pose reconstruction. Multimed Tools Appl 76:4291–4312. https://doi.org/10.1007/s11042-016-3546-4
Premaratne P, Ajaz S, Premaratne M (2013) Hand gesture tracking and recognition system using Lucas–Kanade algorithms for control of consumer electronics. Neurocomputing 116:242–249. https://doi.org/10.1016/j.neucom.2011.11.039
Sako S, Hatano M, Kitamura T (2016) Real-time Japanese sign language recognition based on three phonological elements of sign. In: Int. Conf. Human-computer interact, pp 130–136
Sha T, Song M, Bu J, Chen C, Tao D (2011) Feature level analysis for 3D facial expression recognition. Neurocomputing 74:2135–2141. https://doi.org/10.1016/j.neucom.2011.01.008
Song W, Cai X, Xi Y, Cho S, Cho K (2015) Real-time single camera natural user interface engine development. Multimed Tools Appl:11159–11175. https://doi.org/10.1007/s11042-015-2986-6
Tenorth M, Bandouch J, Beetz M (2009) The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition. 2009 I.E. 12th Int. Conf Comput Vis Work ICCV Work 1089–1096. https://doi.org/10.1109/ICCVW.2009.5457583
Uddin MZ, Hassan MM (2015) A depth video-based facial expression recognition system using radon transform, generalized discriminant analysis, and hidden Markov model. Multimed Tools Appl 74:3675–3690. https://doi.org/10.1007/s11042-013-1793-1
Vadakkepat P, Lim P, De Silva LC, Jing L, Ling LL (2008) Multimodal approach to human-face detection and tracking. IEEE Trans Ind Electron 55:1385–1393. https://doi.org/10.1109/TIE.2007.903993
Wu Q, Wang Z, Deng F, Chi Z, Feng DD (2013) Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans Syst Man Cybern Syst Hum 43:875–885. https://doi.org/10.1109/TSMCA.2012.2226575
Xie X, Livermore C (2016) A pivot-hinged, multilayer SU-8 micro motion amplifier assembled by a self-aligned approach. Proc IEEE Int Conf Micro Electro Mech Syst 75–78. https://doi.org/10.1109/MEMSYS.2016.7421561
Xie X, Zaitsev Y, Velásquez-García LF, Teller SJ, Livermore C (2014) Scalable, MEMS-enabled, vibrational tactile actuators for high resolution tactile displays. J Micromech Microeng 24:125014. https://doi.org/10.1088/0960-1317/24/12/125014
Yang J, Zhou J, Fan D, Lv H (2016) Design of intelligent recognition system based on gait recognition technology in smart transportation. Multimed Tools Appl 75:17501–17514. https://doi.org/10.1007/s11042-016-3313-6.
Zhang L, Gao Y, Hong C, Feng Y, Zhu J, Cai D (2014) Feature correlation hypergraph: exploiting high-order potentials for multimodal recognition. IEEE Trans Cybern 44:1408–1419. https://doi.org/10.1109/TCYB.2013.2285219
Acknowledgments
This work was supported by the Polish National Centre for Research and Development - Applied Research Program under Grant PBS2/B3/21/2013 titled: “Virtual sign language translator.”
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Malawski, F., Gałka, J. System for multimodal data acquisition for human action recognition. Multimed Tools Appl 77, 23825–23850 (2018). https://doi.org/10.1007/s11042-018-5696-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5696-z