skip to main content
10.1145/2522848.2532593acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

A multi-modal gesture recognition system using audio, video, and skeletal joint data

Published: 09 December 2013 Publication History

Abstract

This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2,742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.

References

[1]
S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:357--366, August 1980.
[2]
A. Greenfield. Everyware: The dawning age of ubiquitous computing. New Riders, 2010.
[3]
R. Harper, T. Rodden, Y. Rogers, and A. Sellen, editors. Being Human: Human-Computer Interaction in the year 2020. Microsoft Research, 2008.
[4]
M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, August 2013.
[5]
A. K. Jain, K. Nandakumar, and A. Ross. Score normalization in multimodal biometric systems. Pattern Recognition, 38(12):2270--2285, December 2005.
[6]
H. Junker, O. Amft, P. Lukowicz, and G. Tröster. Gesture spotting with body-worn inertial sensors to detect user activities. Pattern Recognition, 41(6):2010--2024, June 2008.
[7]
I. Laptev. On space-time interest points. International Journal of Computer Vision, 64:107--123, September 2005.
[8]
S. Mitra and T. Acharya. Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 37(3):311--324, 2007.
[9]
K. P. Murphy. Dynamic bayesian networks: representation, inference and learning. PhD thesis, 2002.
[10]
A. Sanin, C. Sanderson, M. Harandi, and B. Lovell. Spatio-temporal covariance descriptors for action and gesture recognition. In Proceedings of Workshop on the Applications of Computer Vision, Clearwater, USA, January 2013.
[11]
O. Tuzel, F. Porikli, and P. Meer. Pedestrian, detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10):1713--1727, October 2008.
[12]
H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference, pages 124.1--124.11, London, UK, September 2009.
[13]
D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2):224--241, 2011.
[14]
X. Xiao, E. S. Chng, and H. Li. Normalization of the speech modulation spectra for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16:1662--1674, November 2008.

Cited By

View all
  • (2021)Inferring Tasks and Fluents in Videos by Learning Causal Relations2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412115(7566-7572)Online publication date: 10-Jan-2021
  • (2020)Classification of Wing Chun Basic Hand Movement using Virtual Reality for Wing Chun Training Simulation SystemAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0601286:1(250-256)Online publication date: Jan-2020
  • (2019)Temporal Accumulative Features for Sign Language Recognition2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00164(1288-1297)Online publication date: Oct-2019
  • Show More Cited By

Index Terms

  1. A multi-modal gesture recognition system using audio, video, and skeletal joint data

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction
      December 2013
      630 pages
      ISBN:9781450321297
      DOI:10.1145/2522848
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 December 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. covariance descriptor
      2. fusion
      3. hidden markov model (HMM)
      4. log-energy features
      5. mel frequency cepstral coefficients (MFCC)
      6. multi-modal gesture recognition
      7. space-time interest points (STIP)
      8. support vector machine (SVM)

      Qualifiers

      • Research-article

      Conference

      ICMI '13
      Sponsor:

      Acceptance Rates

      ICMI '13 Paper Acceptance Rate 49 of 133 submissions, 37%;
      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Inferring Tasks and Fluents in Videos by Learning Causal Relations2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412115(7566-7572)Online publication date: 10-Jan-2021
      • (2020)Classification of Wing Chun Basic Hand Movement using Virtual Reality for Wing Chun Training Simulation SystemAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0601286:1(250-256)Online publication date: Jan-2020
      • (2019)Temporal Accumulative Features for Sign Language Recognition2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00164(1288-1297)Online publication date: Oct-2019
      • (2019)Res3ATN - Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos2019 International Conference on 3D Vision (3DV)10.1109/3DV.2019.00061(491-501)Online publication date: Sep-2019
      • (2019)Biometrics for SurveillanceSmart Economy in Smart African Cities10.1007/978-3-030-10713-0_5(127-153)Online publication date: 22-Feb-2019
      • (2017)Multimodal gesture recognitionThe Handbook of Multimodal-Multisensor Interfaces10.1145/3015783.3015796(449-487)Online publication date: 24-Apr-2017
      • (2017)A virtual keyboard implementation based on finger recognition2017 International Conference on Image and Vision Computing New Zealand (IVCNZ)10.1109/IVCNZ.2017.8402452(1-6)Online publication date: Dec-2017
      • (2017)Affective Computing: Historical Foundations, Current Applications, and Future TrendsEmotions and Affect in Human Factors and Human-Computer Interaction10.1016/B978-0-12-801851-4.00009-4(213-231)Online publication date: 2017
      • (2017)Landmark-based multimodal human action recognitionMultimedia Tools and Applications10.1007/s11042-016-3945-676:3(4505-4521)Online publication date: 1-Feb-2017
      • (2017)Biometrics for SurveillanceIntroduction to Intelligent Surveillance10.1007/978-3-319-60228-8_5(107-130)Online publication date: 11-Jul-2017
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media