Multimodal human–computer interaction: A survey
Introduction
Multimodal human–computer interaction (MMHCI) lies at the crossroads of several research areas including computer vision, psychology, artificial intelligence, and many others. We study MMHCI to determine how we can make computer technology more usable by people, which invariably requires the understanding of at least three things: the user who interacts with it, the system (the computer technology and its usability), and the interaction between the user and the system. By considering these aspects, it is obvious that MMHCI is a multi-disciplinary subject since the designer of an interactive system should have expertise in a range of topics: psychology and cognitive science to understand the user’s perceptual, cognitive, and problem solving skills, sociology to understand the wider context of interaction, ergonomics to understand the user’s physical capabilities, graphic design to produce effective interface presentation, computer science and engineering to be able to build the necessary technology, etc.
The multidisciplinary nature of MMHCI motivates our approach to this survey. Instead of focusing only on Computer Vision techniques for MMHCI, we give a general overview of the field, discussing the major approaches and issues in MMHCI from a computer vision perspective. Our contribution, therefore, is giving researchers in Computer Vision or any other area who are interested in MMHCI a broad view of the state of the art and outlining opportunities and challenges in this exciting area.
In human–human communication, interpreting the mix of audio–visual signals is essential in communicating. Researchers in many fields recognize this, and thanks to advances in the development of unimodal techniques (in speech and audio processing, computer vision, etc.), and in hardware technologies (inexpensive cameras and other types of sensors), there has been a significant growth in MMHCI research. Unlike in traditional HCI applications (a single user facing a computer and interacting with it via a mouse or a keyboard), in the new applications (e.g., intelligent homes [105], remote collaboration, arts, etc.), interactions are not always explicit commands, and often involve multiple users. This is due in part to the remarkable progress in the last few years in computer processor speed, memory, and storage capabilities, matched by the availability of many new input and output devices that are making ubiquitous computing [185], [67], [66] a reality. Devices include phones, embedded systems, PDAs, laptops, wall size displays, and many others. The wide range of computing devices available, with differing computational power and input/output capabilities, means that the future of computing is likely to include novel ways of interaction. Some of the methods include gestures [136], speech [143], haptics [9], eye blinks [58], and many others. Glove mounted devices [19] and graspable user interfaces [48], for example, seem now ripe for exploration. Pointing devices with haptic feedback, eye tracking, and gaze detection [69] are also currently emerging. As in human–human communication, however, effective communication is likely to take place when different input devices are used in combination.
Multimodal interfaces have been shown to have many advantages [34]: they prevent errors, bring robustness to the interface, help the user to correct errors or recover from them more easily, bring more bandwidth to the communication, and add alternative communication methods to different situations and environments. Disambiguation of error-prone modalities using multimodal interfaces is one important motivation for the use of multiple modalities in many systems. As shown by Oviatt [123], error-prone technologies can compensate each other, rather than bring redundancy to the interface and reduce the need for error correction. It should be noted, however, that multiple modalities alone do not bring benefits to the interface: the use of multiple modalities may be ineffective or even disadvantageous. In this context, Oviatt [124] has presented the common misconceptions (myths) of multimodal interfaces, most of them related to the use of speech as an input modality.
In this paper, we review the research areas we consider essential for MMHCI, giving an overview of the state of the art, and based on the results of our survey, identify major trends and open issues in MMHCI. We group vision techniques according to the human body (Fig. 1). Large-scale body movement, gesture (e.g., hands), and gaze analysis are used for tasks such as emotion recognition in affective interaction, and for a variety of applications. We discuss affective computer interaction, issues in multi-modal fusion, modeling, and data collection, and a variety of emerging MMHCI applications. Since MMHCI is a very dynamic and broad research area we do not intend to present a complete survey. The main contribution of this paper, therefore, is to provide an overview of the main computer vision techniques used in the context of MMHCI while giving an overview of the main research areas, techniques, applications, and open issues in MMHCI.
Extensive surveys have been previously published in several areas such as face detection [190], [63], face recognition [196], facial expression analysis [47], [131], vocal emotion [119], [109], gesture recognition [96], [174], [136], human motion analysis [65], [182], [182], [56], [3], [46], [107], audio–visual automatic speech recognition [143], and eye tracking [41], [36]. Reviews of vision-based HCI are presented in [142], [73] with a focus on head tracking, face and facial expression recognition, eye tracking, and gesture recognition. Adaptive and intelligent HCI is discussed in [40] with a review of computer vision for human motion analysis, and a discussion of techniques for lower arm movement detection, face processing, and gaze analysis. Multimodal interfaces are discussed in [125], [126], [127], [128], [144], [158], [135], [171]. Real-time vision for HCI (gestures, object tracking, hand posture, gaze, face pose) is discussed in [84], [77]. Here, we discuss work not included in previous surveys, expand the discussion to areas not covered previously (e.g., in [84], [40], [142], [126], [115]), and discuss new applications in emerging areas while highlighting the main research issues.
Related conferences and workshops include the following: ACM CHI, IFIP Interact, IEEE CVPR, IEEE ICCV, ACM Multimedia, International Workshop on Human-Centered Multimedia (HCM) in conjunction with ACM Multimedia, International Workshops on Human–Computer Interaction in conjunction with ICCV and ECCV, Intelligent User Interfaces (IUI) conference, and International Conference on Multimodal Interfaces (ICMI), among others.
The rest of the paper is organized as follows. In Section 2, we give an overview of MMHCI. Section 3 covers core computer vision techniques. Section 4 surveys affective HCI, and Section 5 deals with modeling, fusion, and data collection, while Section 6 discusses relevant application areas for MMHCI. We conclude with Section 7.
Section snippets
Overview of multimodal interaction
The term multimodal has been used in many contexts and across several disciplines (see [10], [11], [12] for a taxonomy of modalities). For our interests, a multimodal HCI system is simply one that responds to inputs in more than one modality or communication channel (e.g., speech, gesture, writing, and others). We use a human-centered approach and by modality we mean mode of communication according to human senses and computer input devices activated by humans or measuring human qualities
Human-centered vision
We classify vision techniques for MMHCI using a human-centered approach and we divide them according to the human body: (1) large-scale body movements, (2) hand gestures, and (3) gaze. We make a distinction between command (actions can be used to explicitly execute commands: select menus, etc.) and non-command interfaces (actions or events used to indirectly tune the system to the user’s needs) [111], [23].
In general, vision-based human motion analysis systems used for MMHCI can be thought of
Affective human–computer interaction
Most current MMHCI systems do not account for the fact that human–human communication is always socially situated and that we use emotion to enhance our communication. However, since emotion is often expressed in a multimodal way, it is an important area for MMHCI and we will discuss it in some detail. HCI systems that can sense the affective states of the human (e.g., stress, inattention, anger, boredom, etc.) and are capable of adapting and responding to these affective states are likely to
Modeling, fusion, and data collection
Multimodal interface design [146] is important because the principles and techniques used in traditional GUI-based interaction do not necessarily apply in MMHCI systems. Issues to consider, as identified in Section 2, include the design of inputs and outputs, adaptability, consistency, and error handling, among others. In addition, one must consider dependency of a person’s behavior on his/her personality, cultural, and social vicinity, current mood, and the context in which the observed
Applications
Throughout the paper we have discussed techniques in a wide variety of application scenarios, including video conferencing and remote collaboration, intelligent homes, and driver monitoring. The types of modalities used, as well as the integration models vary widely from application to application. The literature on applications that use MMHCI is vast and could well deserve a survey of its own [74]. Therefore, we do not attempt a complete survey of MMHCI applications. Instead we give a general
Conclusion
We have highlighted major vision approaches for multimodal human–computer interaction. We discussed techniques for large-scale body movement, gesture recognition, and gaze detection. We discussed facial expression recognition, emotion analysis from audio, user and task modeling, multimodal fusion, and a variety of emerging applications.
One of the major conclusions of this survey is that most researchers process each channel (visual, audio) independently, and multimodal fusion is still in its
Acknowledgment
The work of Nicu Sebe was partially supported by the Muscle NoE and MIAUCE and CHORUS FP6 EU projects.
References (193)
- et al.
Human motion analysis: a review
CVIU
(1999) Foundations of multimodal representations: a taxonomy of representational modalities
Interacting with Computers
(1994)Defining a taxonomy of output modalities from an HCI perspective
Computer Standards and Interfaces, Special Double Issue
(1997)- et al.
Facial expression recognition from video sequences: temporal and static modeling
CVIU
(2003) - et al.
Automatic facial expression analysis: a survey
Pattern Recognition
(2003) - et al.
A survey of socially interactive robots
Robotics and Autonomous Systems
(2003) The visual analysis of human movement: a survey
CVIU
(1999)- et al.
Face detection: a survey
CVIU
(2001) - et al.
Real-time eye, gaze, and face pose tracking for monitoring driver vigilance
Real-Time Imaging
(2002) - et al.
Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie
Analytica Chimica Acta
(2005)
On the integration of auditory and visual parameters in an HMM-based ASR
Building the design studio of the future
AAAI Fall Symposium on Making Pen-Based Interaction Intelligent and Natural
Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis
Psychological Bulletin
Ambient intelligence: a multimedia perspective
IEEE Multimedia
Exploring versus exploiting when learning user models for text recommendations
User Modeling and User-adapted Interaction
Emotion analysis in man–machine interaction systems
Workshop on Machine Learning for Multimodal Interaction
Human activity recognition using multidimensional indexing
IEEE Transactions on PAMI
Tactile interfaces: a state-of-the-art survey
International Symposium on Robotics
A reference model for output information in intelligent multimedia presentation systems
European Conference on Artificial Intelligence
Multimodality in language and speech systems—from theory to design support tool
Modeling multimodal expression of user’s affective subjective experience
User Modeling and User-adapted Interaction
Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion
International Conference on Computer Vision
Multimodal integration
IEEE Multimedia
The recognition of human movement using temporal templates
IEEE Transaction on PAMI
Put-That-There: voice and gesture at the graphics interface
Computer Graphics
Evaluation of a haptic mixed reality system for interactions with a virtual control panel
Presence: Teleoperators and Virtual Environments
Hands on cooking: towards an attentive kitchen
ACM Conference Human Factors in Computing Systems (CHI)
Eigenlips for robust speech recognition
International Conference on Acoustics, Speech, and Signal Processing
Multimodal ‘Eyes-Free’ interaction techniques for wearable devices
ACM Conference Human Factors in Computer Systems (CHI)
A robust algorithm for reading detection
ACM Workshop on Perceptive User Interfaces
The Psychology of Human–Computer Interaction
The Logic of Typed Feature Structures
VACE multimodal meeting corpus
MLMI
Multimodal detection of human interaction events in a nursing home environment
Conference on Multimodal Interfaces (ICMI)
MVIEWS: multimodal tools for the video analyst
Conference on Intelligent User Interfaces (IUI)
Semi-supervised learning of classifiers: theory, algorithms, and their applications to human–computer interaction
IEEE Transactions on PAMI
QuickSet: multimodal interaction for distributed applications
ACM Multimedia
Tangible multimodal interfaces for safety-critical applications
Communications of the ACM
The role of voice in human–machine communication
Multimodal input fusion in human–computer interaction
NATO-ASI Conference on Data Fusion for Situation Monitoring, Incident Detection, Alert, and Response Management
Augmenting and sharing memory with eye
Blog in CARPE
Human–computer Interaction
Integrating perceptual and cognitive modeling for adaptive and intelligent human–computer interaction
Proceedings of the IEEE
A breadth-first survey of eye tracking applications
Behavior Research Methods, Instruments, and Computing
Multimodal interaction on PDA’s integrating speech and pen inputs
Eurospeech
Architecture and implementation of multimodal plug and play
ICMI
Coding, analysis, interpretation, and recognition of facial expressions
IEEE Transactions on PAMI
Cited by (0)
- 1
This work was performed while Alejandro Jaimes was with FXPAL Japan, Fuji Xerox Co., Ltd.