Abstract
Humans use a combination of gesture and speech to interact with objects and usually do so more naturally without holding a device or pointer. We present a system that incorporates user body-pose estimation, gesture recognition and speech recognition for interaction in virtual reality environments. We describe a vision-based method for tracking the pose of a user in real time and introduce a technique that provides parameterized gesture recognition. More precisely, we train a support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models (HMM) on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. A multimodal recognition process is performed using rank-order fusion to merge speech and vision hypotheses. Finally we describe the use of our multimodal framework in a virtual world application that allows users to interact using gestures and speech.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bazaraa M, Sherali H, Shetty C (1993) Nonlinear programming: theory and algorithms. Wiley, London
Besl P, MacKay N (1992) A method for registration of 3-d shapes. IEEE Trans Pattern Analysis Mach Intell 14:239–256
Breazeal C (2003) Towards sociable robots. Robot Auton Syst 42(3–4):167–175
Bregler C, Malik J (1998) Tracking people with twists and exponential maps. In: Proceedings of computer vision and pattern recognition (CVPR’98)
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167
Cassell J (2000) Nudge nudge wink wink: elements of face-to-face conversation for embodied conversational agents. In: Cassell J, Prevost S, Sullivan J, Churchill E (eds) Embodied conversational agents. MIT Press, cambridge
Collobert R, Bengio S, MariTthoz J (2002) Torch: a modular machine learning software library. Technical Report IDIAP-RR 02-46,IDIAP(2002)
Corradini A, Wesson R, Cohen P (2002) A map-based system using speech and 3D gestures for pervasive computing. In : Proceedings of international conference on multimodal interfaces (ICMI’02). Pittsburgh, PA, pp 191–196
Darrell T, Demirdjian D, Checka N, Felzenszwalb P (2001) Plan-view trajectory estimation with dense stereo background models. In: Proceedings of international conference on computer vision (ICCV’01). Vancouver, Canada
Darrell T, Maes P, Blumberg B, Pentland A (1994) A novel environment for situated vision and behavior. In: IEEE workshop on visual behaviors
Davis JW, Bobick AF (2001) The recognition of human movement using temporal templates. IEEE Trans Patt Anal Mach intell 23(3):257–267
Delamarre Q, Faugeras OD (1999) 3D articulated models and multi-view tracking with silhouettes. In:Proceedings of international conference on computer vision (ICCV’99), pp 716–721
Demirdjian D.(2003) Enforcing constraints for human body tracking. In: Proceedings of workshop on multi-object tracking, Madison, Wisconsin
Demirdjian D, Darrell T (2002) 3D articulated pose tracking for untethered deictic reference. In: Proceedings of international conference on multimodal interfaces (ICMI’02), Pittsburgh, PA
Fua P, Brechbuhler C (1996) Imposing hard constraints on soft snakes. In: Proceedings of european conference on computer vision (ECCV’96), pp 495–506
Gavrila D, Davis L (1996) 3D model-based tracking of humans in action: A multi-view approach. In:Proceedings of computer vision and pattern recognition (CVPR’96)
Hall D, Le Gal C, Martin J, Chomat O, Crowley JL (2001) Magicboard: a contribution to an intelligent office environment. In: Intelligent robotic systems
Isard M, Blake A (1998) Icondensation: unifying low-level and high-level tracking in a stochastic framework. In: Proceedings of european conference on computer vision (ECCV’98)
Ivanov YA, Bobick AF (2000) Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8):852–872
Johnston M, Bangalore S (2000) Finite-state multimodal parsing and understanding. In: Proceedings of international conference on computational linguisitics, pp 369–375
Jojic N, Turk M, Huang T (1999) Tracking articulated objects in dense disparity maps. In: International conference on computer vision, pp 123–130
Kaiser E, Olwal A, McGee D, Benko H, Corradini A, Li X, Feiner S, Cohen P (2003) Mutual disambiguation of 3d multimodal interaction in augmented and virtual reality. In: Proceedings of international conference on multimodal interfaces (ICMI’03). Vancouver, BC, pp 12–19
Kakadiaris I, Metaxas D (1998) 3D human body model acquisition from multiple views. Int Jf Comput Vis 30(3):191-218
Koons D, Sparrell C, Thrisson K (1993) Integrating simultaneous input from speech, gaze and hand gestures. Intell Multimedia Interfaces, pp 257–276
Krahnstoever N, Kettebekov S, Yeasin M, Sharma R (2002) A real-time framework for natural multimodal interaction with large screen displays. In: Proceedings of international conference on multimodal interfaces (ICMI’02). Pittsburgh, PA
Oka K, Sato Y, Koike H (2002) Real-time tracking of multiple fingertips and gesture recognition for augmented desk interface systems. In: IEEE international conference on automatic face and gesture recognition
Rabiner L, Juang B (1986) An introduction to hidden markov models. IEEE ASSP Mag 3(1):4–16
Scholkopf B, Burges C, Smola A (1998) Advances in kernel methods. MIT Press, Cambridge
Seneff S, Hurley E, Lau R, Pao C, Schmid P, Zue V (1998) Galaxy-ii: a reference architecture for conversational system development. In: ICSLP, vol 3. Sydney, Australia, pp 931–934
Sidenbladh H, Black MJ, Fleet DJ (2000) Stochastic tracking of 3D human figures using 2d image motion. In:Proceedings of European conference on computer vision (ECCV’00), pp 702–718
Sminchisescu C, Triggs B (2001) Covariance scaled sampling for monocular 3D body tracking. In:Proceedings of the conference on computer vision and pattern recognition (CVPR’01), Kauai, Hawaii
Vogler C, Metaxas D (1999) Parallel hidden markov models for american sign language recognition. In:International conference on computer vision, Kerkyra, Greece
Wilson A, Bobick A (1999) Parametric hidden markov models for gesture recognition. IEEE Trans Pattern Anal Mach Intell 21(9):884–900
Wren C, Azarbayejani A, Darrell T, Pentland A (1997) Pfinder: Real-time tracking of the human body. IEEE Trans Pattern Anal and Mach Intell 19(7):780–785
Yamamoto M, Yagishita K (2000) Scene constraints-aided tracking of human body. In:Proceedings of computer vision and pattern recognition (CVPR’00)
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Demirdjian, D., Ko, T. & Darrell, T. Untethered gesture acquisition and recognition for virtual world manipulation. Virtual Reality 8, 222–230 (2005). https://doi.org/10.1007/s10055-005-0155-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10055-005-0155-3