research-article

A multi-modal gesture recognition system using audio, video, and skeletal joint data

Authors:

Karthik Nandakumar,

Kong Wah Wan,

Siu Man Alice Chan,

Wen Zheng Terence Ng,

Jian Gang Wang,

Wei Yun YauAuthors Info & Claims

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

Pages 475 - 482

https://doi.org/10.1145/2522848.2532593

Published: 09 December 2013 Publication History

Get Access

Abstract

This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I2R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features extracted from three different modalities, namely, audio, 2-dimensional video (RGB), and skeletal joints (Kinect) are used to classify a given sequence of frames into one of the 20 known gestures or an unrecognized gesture. Mel frequency cepstral coefficients (MFCC) are extracted from the audio signals and a Hidden Markov Model (HMM) is used for classification. While Space-Time Interest Points (STIP) are used to represent the RGB modality, a covariance descriptor is extracted from the skeletal joint data. In the case of both RGB and Kinect modalities, Support Vector Machines (SVM) are used for gesture classification. Finally, a fusion scheme is applied to accumulate evidence from all the three modalities and predict the sequence of gestures in each test sample. The proposed gesture recognition system is able to achieve an average edit distance of 0.2074 over the 275 test samples containing 2,742 unlabeled gestures. While the proposed system is able to recognize the known gestures with high accuracy, most of the errors are caused due to insertion, which occurs when an unrecognized gesture is misclassified as one of the 20 known gestures.

References

[1]

S. B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:357--366, August 1980.

Crossref

Google Scholar

[2]

A. Greenfield. Everyware: The dawning age of ubiquitous computing. New Riders, 2010.

Digital Library

Google Scholar

[3]

R. Harper, T. Rodden, Y. Rogers, and A. Sellen, editors. Being Human: Human-Computer Interaction in the year 2020. Microsoft Research, 2008.

Google Scholar

[4]

M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, August 2013.

Digital Library

Google Scholar

[5]

A. K. Jain, K. Nandakumar, and A. Ross. Score normalization in multimodal biometric systems. Pattern Recognition, 38(12):2270--2285, December 2005.

Digital Library

Google Scholar

[6]

H. Junker, O. Amft, P. Lukowicz, and G. Tröster. Gesture spotting with body-worn inertial sensors to detect user activities. Pattern Recognition, 41(6):2010--2024, June 2008.

Digital Library

Google Scholar

[7]

I. Laptev. On space-time interest points. International Journal of Computer Vision, 64:107--123, September 2005.

Digital Library

Google Scholar

[8]

S. Mitra and T. Acharya. Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 37(3):311--324, 2007.

Digital Library

Google Scholar

[9]

K. P. Murphy. Dynamic bayesian networks: representation, inference and learning. PhD thesis, 2002.

Digital Library

Google Scholar

[10]

A. Sanin, C. Sanderson, M. Harandi, and B. Lovell. Spatio-temporal covariance descriptors for action and gesture recognition. In Proceedings of Workshop on the Applications of Computer Vision, Clearwater, USA, January 2013.

Digital Library

Google Scholar

[11]

O. Tuzel, F. Porikli, and P. Meer. Pedestrian, detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10):1713--1727, October 2008.

Digital Library

Google Scholar

[12]

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference, pages 124.1--124.11, London, UK, September 2009.

Crossref

Google Scholar

[13]

D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2):224--241, 2011.

Digital Library

Google Scholar

[14]

X. Xiao, E. S. Chng, and H. Li. Normalization of the speech modulation spectra for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 16:1662--1674, November 2008.

Digital Library

Google Scholar

Cited By

View all

Tang HWei PLi HZheng N(2021)Inferring Tasks and Fluents in Videos by Learning Causal Relations2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412115(7566-7572)Online publication date: 10-Jan-2021
https://doi.org/10.1109/ICPR48806.2021.9412115
Arieyanto HChowanda A(2020)Classification of Wing Chun Basic Hand Movement using Virtual Reality for Wing Chun Training Simulation SystemAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0601286:1(250-256)Online publication date: Jan-2020
https://doi.org/10.25046/aj060128
Kindiroglu AOzdemir OAkarun L(2019)Temporal Accumulative Features for Sign Language Recognition2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00164(1288-1297)Online publication date: Oct-2019
https://doi.org/10.1109/ICCVW.2019.00164
Show More Cited By

Index Terms

A multi-modal gesture recognition system using audio, video, and skeletal joint data
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Recommendations

A multi modal approach to gesture recognition from audio and video data
ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

We describe in this paper our approach for the Multi-modal gesture recognition challenge organized by ChaLearn in conjunction with the ICMI 2013 conference. The competition's task was to learn a vocabulary of 20 types of Italian gestures performed from ...
Study of sub-word acoustical models for Kannada isolated word recognition system

The speech recognition system basically extracts the textual information present in the speech. In the present work, speaker independent isolated word recognition system for one of the south Indian language--Kannada has been developed. For European ...
Heterogeneous hand gesture recognition using 3D dynamic skeletal data
Abstract
Hand gestures are the most natural and intuitive non-verbal communication medium while interacting with a computer, and related research efforts have recently boosted interest. Additionally, the identifiable features of the hand pose ...
Highlights
- Dynamic hand gesture recognition using 3D skeletal data.
- Computing efficient ...

Comments

Information & Contributors

Information

Published In

ICMI '13: Proceedings of the 15th ACM on International conference on multimodal interaction

December 2013

630 pages

ISBN:9781450321297

DOI:10.1145/2522848

General Chairs:
Julien Epps
The University of New South Wales, Australia
,
Fang Chen
National ICT Australia, Australia
,
Sharon Oviatt
Incaa Designs, USA
,
Kenji Mase
Nagoya University, Japan
,
Program Chairs:
Andrew Sears
Rochester Institute of Technology, USA
,
Kristiina Jokinen
University of Helsinki, Finland
,
Björn Schuller
Technische Universität München, Germany

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '13

Sponsor:

SIGCHI

ICMI '13: 2013 International Conference on Multimodal Interaction

December 9 - 13, 2013

Sydney, Australia

Acceptance Rates

ICMI '13 Paper Acceptance Rate 49 of 133 submissions, 37%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
446
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Tang HWei PLi HZheng N(2021)Inferring Tasks and Fluents in Videos by Learning Causal Relations2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412115(7566-7572)Online publication date: 10-Jan-2021
https://doi.org/10.1109/ICPR48806.2021.9412115
Arieyanto HChowanda A(2020)Classification of Wing Chun Basic Hand Movement using Virtual Reality for Wing Chun Training Simulation SystemAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0601286:1(250-256)Online publication date: Jan-2020
https://doi.org/10.25046/aj060128
Kindiroglu AOzdemir OAkarun L(2019)Temporal Accumulative Features for Sign Language Recognition2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00164(1288-1297)Online publication date: Oct-2019
https://doi.org/10.1109/ICCVW.2019.00164
Dhingra NKunz A(2019)Res3ATN - Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos2019 International Conference on 3D Vision (3DV)10.1109/3DV.2019.00061(491-501)Online publication date: Sep-2019
https://doi.org/10.1109/3DV.2019.00061
Yan W(2019)Biometrics for SurveillanceSmart Economy in Smart African Cities10.1007/978-3-030-10713-0_5(127-153)Online publication date: 22-Feb-2019
https://doi.org/10.1007/978-3-030-10713-0_5
Katsamanis APitsikalis VTheodorakis SMaragos P(2017)Multimodal gesture recognitionThe Handbook of Multimodal-Multisensor Interfaces10.1145/3015783.3015796(449-487)Online publication date: 24-Apr-2017
https://dl.acm.org/doi/10.1145/3015783.3015796
Zhang YYan WNarayanan A(2017)A virtual keyboard implementation based on finger recognition2017 International Conference on Image and Vision Computing New Zealand (IVCNZ)10.1109/IVCNZ.2017.8402452(1-6)Online publication date: Dec-2017
https://doi.org/10.1109/IVCNZ.2017.8402452
Daily SJames MCherry DJ. Porter JDarnell SIsaac JRoy T(2017)Affective Computing: Historical Foundations, Current Applications, and Future TrendsEmotions and Affect in Human Factors and Human-Computer Interaction10.1016/B978-0-12-801851-4.00009-4(213-231)Online publication date: 2017
https://doi.org/10.1016/B978-0-12-801851-4.00009-4
Asteriadis SDaras P(2017)Landmark-based multimodal human action recognitionMultimedia Tools and Applications10.1007/s11042-016-3945-676:3(4505-4521)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s11042-016-3945-6
Yan WYan W(2017)Biometrics for SurveillanceIntroduction to Intelligent Surveillance10.1007/978-3-319-60228-8_5(107-130)Online publication date: 11-Jul-2017
https://doi.org/10.1007/978-3-319-60228-8_5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A multi modal approach to gesture recognition from audio and video data

Study of sub-word acoustical models for Kannada isolated word recognition system

Heterogeneous hand gesture recognition using 3D dynamic skeletal data

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations