Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones

Wang, Ce; Griebel, Scott; Brandstein, Michael; Hsu, Bo-June (Paul)

doi:10.1023/A:1011127615679

Ce Wang¹,
Scott Griebel¹,
Michael Brandstein¹ &
…
Bo-June (Paul) Hsu²

122 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

This work presents the acoustic and visual-based tracking system functioning at the Harvard Intelligent Multi-Media Environments Laboratory (HIMMEL). The environment is populated with a number of microphones and steerable video cameras. Acoustic source localization, video-based face tracking and pose estimation, and multi-channel speech enhancement methods are applied in combination to detect and track individuals in a practical environment while also providing an improved audio signal to accompany the video stream. The video portion of the system tracks talkers by utilizing source motion, contour geometry, color data, and simple facial features. Decisions involving which camera to use are based on an estimate of the head's gazing angle. This head pose estimation is achieved using a very general head model which employs hairline features and a learned network classification procedure. Finally, a beamforming and postfiltering microphone array technique is used to create an enhanced speech waveform to accompany the recorded video signal. The system presented in this paper is robust to both visual clutter (e.g. ovals in the scene of interest which are not faces) and audible noise (e.g. reverberations and background noise).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integration of Audio and Video Clues for Source Localization by a Robotic Head

Attentional Mechanism Based on a Microphone Array for Embedded Devices and a Single Camera

The CAMETRON Lecture Recording System: High Quality Video Recording and Editing with Minimal Human Supervision

References

M. Brandstein and H. Silverman, “A Practical Methodology for Speech Source Localization with Microphone Arrays, ” Computer, Speech, and Language, vol. 11, no. 2, 1997, pp. 91–126.
Article Google Scholar
M. Brandstein, J. Adcock, and H. Silverman, “A Closed-Form Location Estimator for Use with Room Environment Microphone Arrays, ” IEEE Trans. Speech Audio Proc., vol. 5, no. 1, 1997, pp. 45–50.
Article Google Scholar
G. Carter (Ed.), Coherence and Time Delay Estimation, 1st edn., IEEE Press, 1993.
M. Omologo and P. Svaizer, “Acoustic Event Localization Using a Crosspower-Spectrum Phase Based Technique, ” in Proceedings of ICASSP97, IEEE, 1994, pp. II-273–II-276.
H. Wang and P. Chu, “Voice Source Localization for Automatic Camera Pointing System in Videoconferencing, ” in Proceedings of ICASSP97, Munich, Germany, IEEE, April 20–24, 1997, pp. 187–190.
B. Champagne, S. Bédard, and A. Stéphenne, “Performance of Time-Delay Estimation in the Presence of RoomReverberation, ” IEEE Trans. Speech Audio Proc., vol. 4, no. 2, 1996, pp. 148–152.
Article Google Scholar
M. Brandstein, “Time-Delay Estimation of Reverberated Speech Exploiting Harmonic Structure, ” J. Acoust. Soc. Am., vol. 105, no. 5, 1999, pp. 2914–2919.
Article Google Scholar
J. Benesty, “Adaptive Eigenvalue Decomposition Algorithm for Passive Acoustic Source Localization, ” J. Acoust. Soc. Am., vol.107, no. 1, 2000, pp. 384–391.
Article Google Scholar
D. Mansour and A. Gray, Jr., “Unconstrained Frequency-Domain Adaptive Filter, ” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, no. 3, 1981, pp. 561–571.
Google Scholar
R. Martin, “AnEfficient Algorithm to Estimate the Instantaneous SNR of Speech Signals, ” in Proceedings of EUROSPEECH-93, Berlin, Germany, Sept. 1993, pp. 1093–1096.
D. Ward and R. Williamson, “Beamforming for a Source Located in the Interior of a Sensor Array, ” in Proc. IEEE Int. Symp. Sig. Process. Applicat. (ISSPA-99), Brisbane, Australia, IEEE, Aug. 1999, pp. 873–876.
B. Van Veen and K. Buckley, “Beamforming: A Versatile Approach to Spatial Filtering, ” IEEE ASSP Mag., vol. 5, no. 2, 1988, pp. 4–24.
Article Google Scholar
D. Ward and M. Brandstein, “Grid-Based Beamformer Design for Room-Environment Microphone Arrays, ” 1999 Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17–20, 1997.
S. Boll, “Suppresion of Acoustic Noise in Speech Using Spectral Subtraction, ” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 2, 1979, pp. 113–120.
M. Brandstein, J. Adcock, and H. Silverman, “Microphone Array Localization Error Estimation with Application to Sensor Placement, ” J. Acoust. Soc. Am., vol. 99, no. 6, 1996, pp. 3807–3816.
Article Google Scholar
H. Rowley, S. Baluja, and T. Kanade, “Neural Network-Based Face Detection, ” IEEE CVPR'96, 1996.
A. Blake and A. Yuille, Active Vision, MIT Press, Cambridge, MA 1992.
Google Scholar
M.C. Burl, T.K. Leung, and P. Perona, “Face Localization via Shape Statistics, ” Intl. Workshop on Automatic Face and Gesture Recognition, Zurich, Switz., 1995.
J. Yang and A. Waibel, “A Real-Time Face Tracker, ” in Proceedings of the Third IEEE Workshop on Applications of Computer Vision, Sarasota, Florida, Dec. 1996, pp. 142–147.
N. Oliver, A.P. Pentland, and F. Berard, “Lafter: Lips and Face Real Time Tracker, ” IEEE CVPR, 1997, pp. 123–129.
T. Jebara and A. Pentland, “Parameterized Structure from Motion for 3D Adaptive Feedback Tracking of Faces, ” IEEE CVPR, 1997, pp. 144–150.
A.K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, Englewood Cliffs, NJ 1989.
MATH Google Scholar
R. Lopez and T.S. Huang, “Head Pose Computation for Very Low Bit-Rate Video Coding, ” 6th International Conference on Computer Analysis of Images and Patterns, Berlin Heidelberg, Springer-Verlag, 1995, pp. 440–447.
Chapter Google Scholar
N. Kruger, M. Potzsch, and C. Malsburg, “Determination of Face Position and Pose with a Learned Representation Based on Labeled Graphs, ” Image and Vision Computing, vol. 15, no. 8, 1997, pp. 665–673.
Article Google Scholar
I. Shimizu, Z. Zhang, S. Akamatsu, and K. Deguchi, “Head Pose Determination from One Image Using a Generic Model, ” 3rd IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 100–105.
V. Vapnik, Statistical Learning Theory, New York: John Wiley, 1998.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA, 02138, USA
Ce Wang, Scott Griebel & Michael Brandstein
Microsoft Corporation, Redmond, WA, 98052, USA
Bo-June (Paul) Hsu

Authors

Ce Wang
View author publications
You can also search for this author in PubMed Google Scholar
Scott Griebel
View author publications
You can also search for this author in PubMed Google Scholar
Michael Brandstein
View author publications
You can also search for this author in PubMed Google Scholar
Bo-June (Paul) Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Griebel, S., Brandstein, M. et al. Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 29, 81–99 (2001). https://doi.org/10.1023/A:1011127615679

Download citation

Published: 01 August 2001
Issue Date: August 2001
DOI: https://doi.org/10.1023/A:1011127615679

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones

Abstract

Access this article

Similar content being viewed by others

Integration of Audio and Video Clues for Source Localization by a Robotic Head

Attentional Mechanism Based on a Microphone Array for Embedded Devices and a Single Camera

The CAMETRON Lecture Recording System: High Quality Video Recording and Editing with Minimal Human Supervision

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones

Abstract

Access this article

Similar content being viewed by others

Integration of Audio and Video Clues for Source Localization by a Robotic Head

Attentional Mechanism Based on a Microphone Array for Embedded Devices and a Single Camera

The CAMETRON Lecture Recording System: High Quality Video Recording and Editing with Minimal Human Supervision

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation