Multimodal human–computer interaction: A survey

https://doi.org/10.1016/j.cviu.2006.10.019Get rights and content

Abstract

In this paper, we review the major approaches to multimodal human–computer interaction, giving an overview of the field from a computer vision perspective. In particular, we focus on body, gesture, gaze, and affective interaction (facial expression recognition and emotion in audio). We discuss user and task modeling, and multimodal fusion, highlighting challenges, open issues, and emerging applications for multimodal human–computer interaction (MMHCI) research.

Introduction

Multimodal human–computer interaction (MMHCI) lies at the crossroads of several research areas including computer vision, psychology, artificial intelligence, and many others. We study MMHCI to determine how we can make computer technology more usable by people, which invariably requires the understanding of at least three things: the user who interacts with it, the system (the computer technology and its usability), and the interaction between the user and the system. By considering these aspects, it is obvious that MMHCI is a multi-disciplinary subject since the designer of an interactive system should have expertise in a range of topics: psychology and cognitive science to understand the user’s perceptual, cognitive, and problem solving skills, sociology to understand the wider context of interaction, ergonomics to understand the user’s physical capabilities, graphic design to produce effective interface presentation, computer science and engineering to be able to build the necessary technology, etc.

The multidisciplinary nature of MMHCI motivates our approach to this survey. Instead of focusing only on Computer Vision techniques for MMHCI, we give a general overview of the field, discussing the major approaches and issues in MMHCI from a computer vision perspective. Our contribution, therefore, is giving researchers in Computer Vision or any other area who are interested in MMHCI a broad view of the state of the art and outlining opportunities and challenges in this exciting area.

In human–human communication, interpreting the mix of audio–visual signals is essential in communicating. Researchers in many fields recognize this, and thanks to advances in the development of unimodal techniques (in speech and audio processing, computer vision, etc.), and in hardware technologies (inexpensive cameras and other types of sensors), there has been a significant growth in MMHCI research. Unlike in traditional HCI applications (a single user facing a computer and interacting with it via a mouse or a keyboard), in the new applications (e.g., intelligent homes [105], remote collaboration, arts, etc.), interactions are not always explicit commands, and often involve multiple users. This is due in part to the remarkable progress in the last few years in computer processor speed, memory, and storage capabilities, matched by the availability of many new input and output devices that are making ubiquitous computing [185], [67], [66] a reality. Devices include phones, embedded systems, PDAs, laptops, wall size displays, and many others. The wide range of computing devices available, with differing computational power and input/output capabilities, means that the future of computing is likely to include novel ways of interaction. Some of the methods include gestures [136], speech [143], haptics [9], eye blinks [58], and many others. Glove mounted devices [19] and graspable user interfaces [48], for example, seem now ripe for exploration. Pointing devices with haptic feedback, eye tracking, and gaze detection [69] are also currently emerging. As in human–human communication, however, effective communication is likely to take place when different input devices are used in combination.

Multimodal interfaces have been shown to have many advantages [34]: they prevent errors, bring robustness to the interface, help the user to correct errors or recover from them more easily, bring more bandwidth to the communication, and add alternative communication methods to different situations and environments. Disambiguation of error-prone modalities using multimodal interfaces is one important motivation for the use of multiple modalities in many systems. As shown by Oviatt [123], error-prone technologies can compensate each other, rather than bring redundancy to the interface and reduce the need for error correction. It should be noted, however, that multiple modalities alone do not bring benefits to the interface: the use of multiple modalities may be ineffective or even disadvantageous. In this context, Oviatt [124] has presented the common misconceptions (myths) of multimodal interfaces, most of them related to the use of speech as an input modality.

In this paper, we review the research areas we consider essential for MMHCI, giving an overview of the state of the art, and based on the results of our survey, identify major trends and open issues in MMHCI. We group vision techniques according to the human body (Fig. 1). Large-scale body movement, gesture (e.g., hands), and gaze analysis are used for tasks such as emotion recognition in affective interaction, and for a variety of applications. We discuss affective computer interaction, issues in multi-modal fusion, modeling, and data collection, and a variety of emerging MMHCI applications. Since MMHCI is a very dynamic and broad research area we do not intend to present a complete survey. The main contribution of this paper, therefore, is to provide an overview of the main computer vision techniques used in the context of MMHCI while giving an overview of the main research areas, techniques, applications, and open issues in MMHCI.

Extensive surveys have been previously published in several areas such as face detection [190], [63], face recognition [196], facial expression analysis [47], [131], vocal emotion [119], [109], gesture recognition [96], [174], [136], human motion analysis [65], [182], [182], [56], [3], [46], [107], audio–visual automatic speech recognition [143], and eye tracking [41], [36]. Reviews of vision-based HCI are presented in [142], [73] with a focus on head tracking, face and facial expression recognition, eye tracking, and gesture recognition. Adaptive and intelligent HCI is discussed in [40] with a review of computer vision for human motion analysis, and a discussion of techniques for lower arm movement detection, face processing, and gaze analysis. Multimodal interfaces are discussed in [125], [126], [127], [128], [144], [158], [135], [171]. Real-time vision for HCI (gestures, object tracking, hand posture, gaze, face pose) is discussed in [84], [77]. Here, we discuss work not included in previous surveys, expand the discussion to areas not covered previously (e.g., in [84], [40], [142], [126], [115]), and discuss new applications in emerging areas while highlighting the main research issues.

Related conferences and workshops include the following: ACM CHI, IFIP Interact, IEEE CVPR, IEEE ICCV, ACM Multimedia, International Workshop on Human-Centered Multimedia (HCM) in conjunction with ACM Multimedia, International Workshops on Human–Computer Interaction in conjunction with ICCV and ECCV, Intelligent User Interfaces (IUI) conference, and International Conference on Multimodal Interfaces (ICMI), among others.

The rest of the paper is organized as follows. In Section 2, we give an overview of MMHCI. Section 3 covers core computer vision techniques. Section 4 surveys affective HCI, and Section 5 deals with modeling, fusion, and data collection, while Section 6 discusses relevant application areas for MMHCI. We conclude with Section 7.

Section snippets

Overview of multimodal interaction

The term multimodal has been used in many contexts and across several disciplines (see [10], [11], [12] for a taxonomy of modalities). For our interests, a multimodal HCI system is simply one that responds to inputs in more than one modality or communication channel (e.g., speech, gesture, writing, and others). We use a human-centered approach and by modality we mean mode of communication according to human senses and computer input devices activated by humans or measuring human qualities

Human-centered vision

We classify vision techniques for MMHCI using a human-centered approach and we divide them according to the human body: (1) large-scale body movements, (2) hand gestures, and (3) gaze. We make a distinction between command (actions can be used to explicitly execute commands: select menus, etc.) and non-command interfaces (actions or events used to indirectly tune the system to the user’s needs) [111], [23].

In general, vision-based human motion analysis systems used for MMHCI can be thought of

Affective human–computer interaction

Most current MMHCI systems do not account for the fact that human–human communication is always socially situated and that we use emotion to enhance our communication. However, since emotion is often expressed in a multimodal way, it is an important area for MMHCI and we will discuss it in some detail. HCI systems that can sense the affective states of the human (e.g., stress, inattention, anger, boredom, etc.) and are capable of adapting and responding to these affective states are likely to

Modeling, fusion, and data collection

Multimodal interface design [146] is important because the principles and techniques used in traditional GUI-based interaction do not necessarily apply in MMHCI systems. Issues to consider, as identified in Section 2, include the design of inputs and outputs, adaptability, consistency, and error handling, among others. In addition, one must consider dependency of a person’s behavior on his/her personality, cultural, and social vicinity, current mood, and the context in which the observed

Applications

Throughout the paper we have discussed techniques in a wide variety of application scenarios, including video conferencing and remote collaboration, intelligent homes, and driver monitoring. The types of modalities used, as well as the integration models vary widely from application to application. The literature on applications that use MMHCI is vast and could well deserve a survey of its own [74]. Therefore, we do not attempt a complete survey of MMHCI applications. Instead we give a general

Conclusion

We have highlighted major vision approaches for multimodal human–computer interaction. We discussed techniques for large-scale body movement, gesture recognition, and gaze detection. We discussed facial expression recognition, emotion analysis from audio, user and task modeling, multimodal fusion, and a variety of emerging applications.

One of the major conclusions of this survey is that most researchers process each channel (visual, audio) independently, and multimodal fusion is still in its

Acknowledgment

The work of Nicu Sebe was partially supported by the Muscle NoE and MIAUCE and CHORUS FP6 EU projects.

References (193)

  • A. Adjoudani et al.

    On the integration of auditory and visual parameters in an HMM-based ASR

  • A. Adler et al.

    Building the design studio of the future

    AAAI Fall Symposium on Making Pen-Based Interaction Intelligent and Natural

    (2004)
  • N. Ambady et al.

    Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis

    Psychological Bulletin

    (1992)
  • E. Arts

    Ambient intelligence: a multimedia perspective

    IEEE Multimedia

    (2004)
  • M. Balabanovic

    Exploring versus exploiting when learning user models for text recommendations

    User Modeling and User-adapted Interaction

    (1998)
  • T. Balomenos et al.

    Emotion analysis in man–machine interaction systems

    Workshop on Machine Learning for Multimodal Interaction

    (2005)
  • J. Ben-Arie et al.

    Human activity recognition using multidimensional indexing

    IEEE Transactions on PAMI

    (2002)
  • M. Benali-Khoudja et al.

    Tactile interfaces: a state-of-the-art survey

    International Symposium on Robotics

    (2004)
  • N.O. Bernsen

    A reference model for output information in intelligent multimedia presentation systems

    European Conference on Artificial Intelligence

    (1996)
  • N.O. Bernsen

    Multimodality in language and speech systems—from theory to design support tool

  • N. Bianchi-Berthouze et al.

    Modeling multimodal expression of user’s affective subjective experience

    User Modeling and User-adapted Interaction

    (2002)
  • M. Black et al.

    Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion

    International Conference on Computer Vision

    (1995)
  • M. Blattner et al.

    Multimodal integration

    IEEE Multimedia

    (1996)
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Transaction on PAMI

    (2001)
  • R. Bolt

    Put-That-There: voice and gesture at the graphics interface

    Computer Graphics

    (1980)
  • C. Borst et al.

    Evaluation of a haptic mixed reality system for interactions with a virtual control panel

    Presence: Teleoperators and Virtual Environments

    (2005)
  • J.S. Bradbury et al.

    Hands on cooking: towards an attentive kitchen

    ACM Conference Human Factors in Computing Systems (CHI)

    (2003)
  • C. Bregler et al.

    Eigenlips for robust speech recognition

    International Conference on Acoustics, Speech, and Signal Processing

    (1994)
  • S.A. Brewster et al.

    Multimodal ‘Eyes-Free’ interaction techniques for wearable devices

    ACM Conference Human Factors in Computer Systems (CHI)

    (2003)
  • C.S. Campbell et al.

    A robust algorithm for reading detection

    ACM Workshop on Perceptive User Interfaces

    (2001)
  • S.K. Card et al.

    The Psychology of Human–Computer Interaction

    (1983)
  • R. Carpenter

    The Logic of Typed Feature Structures

    (1992)
  • L.S. Chen, Joint Processing of Audio–visual Information for the Recognition of Emotional Expressions in Human–computer...
  • L.S. Chen et al.

    VACE multimodal meeting corpus

    MLMI

    (2005)
  • D. Chen et al.

    Multimodal detection of human interaction events in a nursing home environment

    Conference on Multimodal Interfaces (ICMI)

    (2004)
  • A. Cheyer et al.

    MVIEWS: multimodal tools for the video analyst

    Conference on Intelligent User Interfaces (IUI)

    (1998)
  • I. Cohen et al.

    Semi-supervised learning of classifiers: theory, algorithms, and their applications to human–computer interaction

    IEEE Transactions on PAMI

    (2004)
  • P.R. Cohen et al.

    QuickSet: multimodal interaction for distributed applications

    ACM Multimedia

    (1997)
  • P.R. Cohen et al.

    Tangible multimodal interfaces for safety-critical applications

    Communications of the ACM

    (2004)
  • P.R. Cohen et al.

    The role of voice in human–machine communication

  • Computer Vision and Image Understanding, Special Issue on Eye Detection and Tracking 98 (1)...
  • A. Corradini et al.

    Multimodal input fusion in human–computer interaction

    NATO-ASI Conference on Data Fusion for Situation Monitoring, Incident Detection, Alert, and Response Management

    (2003)
  • C. Dickie et al.

    Augmenting and sharing memory with eye

    Blog in CARPE

    (2004)
  • A. Dix et al.

    Human–computer Interaction

    (2003)
  • Z. Duric et al.

    Integrating perceptual and cognitive modeling for adaptive and intelligent human–computer interaction

    Proceedings of the IEEE

    (2002)
  • A.T. Duchowski

    A breadth-first survey of eye tracking applications

    Behavior Research Methods, Instruments, and Computing

    (2002)
  • S. Dusan et al.

    Multimodal interaction on PDA’s integrating speech and pen inputs

    Eurospeech

    (2003)
  • C. Elting et al.

    Architecture and implementation of multimodal plug and play

    ICMI

    (2003)
  • I. Essa et al.

    Coding, analysis, interpretation, and recognition of facial expressions

    IEEE Transactions on PAMI

    (1997)
  • Cited by (0)

    1

    This work was performed while Alejandro Jaimes was with FXPAL Japan, Fuji Xerox Co., Ltd.

    View full text