skip to main content
10.1145/1180995.1181037acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

EM detection of common origin of multi-modal cues

Published: 02 November 2006 Publication History

Abstract

Content analysis of clips containing people speaking involves processing informative cues coming from different modalities. These cues are usually the words extracted from the audio modality, and the identity of the persons appearing in the video modality of the clip. To achieve efficient assignment of these cues to the person that created them, we propose a Bayesian network model that utilizes the extracted feature characteristics, their relations and their temporal patterns. We use the EM algorithm in which the E-step estimates the expectation of the complete-data log-likelihood with respect to the hidden variables - that is the identity of the speakers and the visible persons. In the M-step, the person models that maximize this expectation are computed. This framework produces excellent results, exhibiting exceptional robustness when dealing with low quality data.

References

[1]
J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.
[2]
O. Chapelle, P. Haffner, and V. Vapnik. Svms for histogram-based image classification, 1999.
[3]
S. Chen and P. Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion, 1998.
[4]
T. Darrell, J. W. Fisher, III, P. Viola, and W. Freeman. Audio-visual segmentation and "the cocktail party effect".
[5]
J. G. Fiscus, N. Radde, J. S. Garofolo, A. Le, J. Ajot, and C. Laprun. Rich transcription 2005 spring meeting recognition evaluation, 2005.
[6]
J. W. F. III and T. Darrel. Probabilistic models and informative subspaces for audiovisual correspondence. TUGboat, 12(2):291--301, June 1991.
[7]
J. W. F. III and T. Darrell. Probabalistic models and informative subspaces for audiovisual correspondence. In ECCV (3), pages 592--603, 2002.
[8]
J. W. F. III, T. Darrell, W. T. Freeman, and P. A. Viola. Learning joint statistical models for audio-visual fusion and segregation. In NIPS, pages 772--778, 2000.
[9]
L. Lu and H.-J. Zhang. Speaker change detection and tracking in real-time news broadcasting analysis. In MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia, pages 602--610, New York, NY, USA, 2002. ACM Press.
[10]
J. Ming, J. Lin, and F. J. Smith. A posterior union model with applications to robust speech and speaker recognition. EURASIP Journal on Applied Signal Processing, 2006, December 2006.
[11]
B. Moghaddam and A. Pentland. Face recognition using view-based and modular eigenspaces. In Automatic Systems for the Identification and Inspection of Humans, SPIE'94, volume 2257, 1994.
[12]
H. J. Nock, G. Iyengar, and C. Neti. Multimodal processing by finding common cause. Commun. ACM, 47(1):51--56, 2004.
[13]
L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Apllications in Speech Recognition. Kaufmann, San Mateo, CA, 1990.
[14]
K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T. Darrell. Visual speech recognition with loosely synchronized feature streams. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 2, pages 1424--1431, Washington, DC, USA, 2005. IEEE Computer Society.
[15]
M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 586--591, 1991.
[16]
P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 2002.
[17]
M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 24(1):34--58, 2002.
[18]
W. Zhao, R. Chellappa, A. Rosenfeld, and P. Phillips. Face recognition: A literature survey, 2000.

Cited By

View all
  • (2024)Role of human physiology and facial biomechanics towards building robust deepfake detectors: A comprehensive survey and analysisComputer Science Review10.1016/j.cosrev.2024.10067754(100677)Online publication date: Nov-2024
  • (2023)Analysis of vital signs using remote photoplethysmography (RPPG)Journal of Ambient Intelligence and Humanized Computing10.1007/s12652-023-04683-w14:12(16729-16736)Online publication date: 8-Sep-2023
  • (2022)Heart Rate Estimation from Facial Video Sequences using Fast Independent Component Analysis2022 National Conference on Communications (NCC)10.1109/NCC55593.2022.9806810(88-93)Online publication date: 24-May-2022
  • Show More Cited By

Index Terms

  1. EM detection of common origin of multi-modal cues

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces
    November 2006
    404 pages
    ISBN:159593541X
    DOI:10.1145/1180995
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 November 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual synchrony
    2. content extraction
    3. multi-modal
    4. multi-modal cue assignment
    5. speaker detection

    Qualifiers

    • Article

    Conference

    ICMI06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Role of human physiology and facial biomechanics towards building robust deepfake detectors: A comprehensive survey and analysisComputer Science Review10.1016/j.cosrev.2024.10067754(100677)Online publication date: Nov-2024
    • (2023)Analysis of vital signs using remote photoplethysmography (RPPG)Journal of Ambient Intelligence and Humanized Computing10.1007/s12652-023-04683-w14:12(16729-16736)Online publication date: 8-Sep-2023
    • (2022)Heart Rate Estimation from Facial Video Sequences using Fast Independent Component Analysis2022 National Conference on Communications (NCC)10.1109/NCC55593.2022.9806810(88-93)Online publication date: 24-May-2022
    • (2021)Camera-based heart rate estimation for hospitalized newborns in the presence of motion artifactsBioMedical Engineering OnLine10.1186/s12938-021-00958-520:1Online publication date: 4-Dec-2021
    • (2021)Lip-Reading Driven Deep Learning Approach for Speech EnhancementIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2019.29170395:3(481-490)Online publication date: Jun-2021
    • (2019)Non-contact Measurement of Heart Rate Based on Facial Video2019 Photonics & Electromagnetics Research Symposium - Fall (PIERS - Fall)10.1109/PIERS-Fall48861.2019.9021402(2269-2275)Online publication date: Dec-2019
    • (2018)Free-Form Deformation Approach for Registration of Visible and Infrared Facial Images in Fever ScreeningSensors10.3390/s1801012518:1(125)Online publication date: 4-Jan-2018
    • (2017)Heart Rate Variability Extraction From Videos Signals: ICA vs. EVM ComparisonIEEE Access10.1109/ACCESS.2017.26785215(4711-4719)Online publication date: 2017
    • (2015)Audiovisual Fusion: Challenges and New ApproachesProceedings of the IEEE10.1109/JPROC.2015.2459017103:9(1635-1653)Online publication date: Sep-2015
    • (2013)Emotion recognition from multi-modal information2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference10.1109/APSIPA.2013.6694347(1-8)Online publication date: Oct-2013
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media