skip to main content
10.1145/1452392.1452446acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
poster

A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

Published: 20 October 2008 Publication History

Abstract

This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. "who is looking at whom", in addition to speaker diarization, i.e. "who is speaking and when". First, a novel tabletop sensing device for round-table meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people's faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering. This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.

References

[1]
S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada, and S. Makino. A DOA based speaker diarization system for real meetings. In Proc. HSCMA2008, pages 29--32, 2008.
[2]
M. Argyle. Bodily Communication - 2nd ed. Routledge, London and New York, 1988.
[3]
S. O. Ba and J.-M. Odobez. A study on visual focus of attention recognition from head pose in a meeting room. In Proc. MLMI2006, pages 75--87, 2006.
[4]
C. Busso, P. G. Georgiou, and S. S. Narayanan. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. In Proc. ICASSP2007, pages 685--688, 2007.
[5]
D. Douxchamps and N. Campbell. Robust real time face tracking for the analysis of human behaviour. In Proc. MLMI2007, pages 1--10, 2007.
[6]
M. Fujimoto, K. Ishizuka, and T. Nakatani. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme. In Proc. ICASSP2008, pages 4441--4444, 2008.
[7]
D. Gatica-Perez. Analyzing group interactions in conversations: a review. In Proc. IEEE Int. Conf. Multisensor Fusion and Integration for Intelligent Systems '06, pages 41--46, 2006.
[8]
D. Gatica-Perez, J.-M. Odobez, S. Ba, K. Smith, and G. Lathoud. Tracking people in meetings with particles. Technical Report IDIAP-RR 04-71, IDIAP, 2004.
[9]
A. Kendon. Some functions of gaze-direction in social interaction. Acta Psychologica, 26:22--63, 1967.
[10]
C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. ASSP, 24(4):320--327, 1976.
[11]
L. Chen, et al. Vace multimodal meeting corpus. In Proc. MLMI2006, pages 40--51, 2006.
[12]
O. Mateo Lozano and K. Otsuka. Real-time visual tracker by stream processing. Journal of Signal Processing Systems, DOI 10.1007/s11265-008-0250-2, 2008.
[13]
O. Mateo Lozano and K. Otsuka. Simultaneous and fast 3D tracking of multiple faces in video by GPU-based stream processing. In Proc. ICASSP2008, pages 713--716, 2008.
[14]
Y. Matsusaka, H. Asoh, and F. Asano. Multi human trajectory estimation using stochastic sampling and its application to meeting recognition. In Proc. MVA2007, pages 16--18, 2007.
[15]
NIST Speech Group. Spring 2007 (RT-07) rich transcription meeting recognition evaluation plan. Technical Report rt07-meeting-eval-plan-v2, NIST, 2007.
[16]
K. Otsuka, Y. Takemae, J. Yamato, and H. Murase. A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In Proc. ICMI'05, pages 191--198, 2005.
[17]
K. Otsuka and J. Yamato. Fast and robust face tracking for analyzing multiparty face-to-face meetings. In Proc. MLMI2008, 2008.
[18]
K. Otsuka, J. Yamato, and H. Murase. Conversation scene analysis with dynamic Bayesian network based on visual head tracking. In Proc. ICME'06, pages 949--952, 2006.
[19]
S. Renals, T. Hain, and H. Bourlard. Interpretation of multiparty meetings the AMI and AMIDA projects. In Proc. HSCMA2008, pages 115--118, 2008.
[20]
K. Smith, S. Schreiber, I. Potúcek, V. Beran, G. Rigoll, and D. Gatica-Perez. Real-time monitoring of participants' interaction in a meeting using audio-visual sensors. In Proc. MLMI2006, pages 88--101, 2006.
[21]
R. Stiefelhagen, J. Yang, and A. Waibel. Modeling focus of attention for meeting index based on multiple cues. IEEE Trans. Neural Networks, 13(4), 2002.
[22]
P. Viola and M. Jones. Robust real-time face detection. IJCV, 57(2):137--154, 2004.
[23]
M. Voit and R. Stiefelhagen. Tracking head pose and focus of attention with multiple far-field cameras. In Proc. ICMI2006, pages 281--286, 2006.
[24]
F. Wallhoff, M. Zobl, G. Rigoll, and I. Potucek. Face tracking in meeting room scenarios using omnidirectional views. In Proc. ICPR2004, 2004.

Cited By

View all
  • (2021)Improved Gazing Transition Patterns for Predicting Turn-Taking in Multiparty ConversationProceedings of the 2021 5th International Conference on Video and Image Processing10.1145/3511176.3511208(215-219)Online publication date: 22-Dec-2021
  • (2020)Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices EcosystemsProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology10.1145/3379337.3415588(1121-1131)Online publication date: 20-Oct-2020
  • (2020)Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddingsThe Journal of the Acoustical Society of America10.1121/10.0002924148:6(3751-3761)Online publication date: Dec-2020
  • Show More Cited By

Index Terms

  1. A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces
      October 2008
      322 pages
      ISBN:9781605581989
      DOI:10.1145/1452392
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 October 2008

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. face tracking
      2. fisheye lens
      3. focus of attention
      4. meeting analysis
      5. microphone array
      6. omnidirectional cameras
      7. realtime system
      8. speaker diarization

      Qualifiers

      • Poster

      Conference

      ICMI '08
      Sponsor:
      ICMI '08: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES
      October 20 - 22, 2008
      Crete, Chania, Greece

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Improved Gazing Transition Patterns for Predicting Turn-Taking in Multiparty ConversationProceedings of the 2021 5th International Conference on Video and Image Processing10.1145/3511176.3511208(215-219)Online publication date: 22-Dec-2021
      • (2020)Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices EcosystemsProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology10.1145/3379337.3415588(1121-1131)Online publication date: 20-Oct-2020
      • (2020)Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddingsThe Journal of the Acoustical Society of America10.1121/10.0002924148:6(3751-3761)Online publication date: Dec-2020
      • (2020)Modeling the Behaviors of Participants in Meetings for Decision Making Using OpenPoseThe Impact of the 4th Industrial Revolution on Engineering Education10.1007/978-3-030-40274-7_3(27-38)Online publication date: 18-Mar-2020
      • (2019)Modeling of Non-verbal Behaviors of Students in Cooperative Learning by Using OpenPoseCollaboration Technologies and Social Computing10.1007/978-3-030-28011-6_13(191-201)Online publication date: 8-Aug-2019
      • (2018)Unobtrusive Analysis of Group Interactions without CamerasProceedings of the 20th ACM International Conference on Multimodal Interaction10.1145/3242969.3264973(501-505)Online publication date: 2-Oct-2018
      • (2018)Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural NetworksProceedings of the 20th ACM International Conference on Multimodal Interaction10.1145/3242969.3242973(191-199)Online publication date: 2-Oct-2018
      • (2018)Image-based Attention Level Estimation of Interaction Scene by Head Pose and Gaze Information2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS)10.1109/ICIS.2018.8466462(497-501)Online publication date: Jun-2018
      • (2018)Experimental Observation of Nodding Motion in Remote Communication Using ARM-COMSHuman Interface and the Management of Information. Interaction, Visualization, and Analytics10.1007/978-3-319-92043-6_17(194-203)Online publication date: 7-Jun-2018
      • (2018)Predicting Turn-Taking by Compact Gazing Transition Patterns in Multiparty ConversationImage and Video Technology10.1007/978-3-319-75786-5_35(437-447)Online publication date: 15-Feb-2018
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media