Abstract
This work presents the acoustic and visual-based tracking system functioning at the Harvard Intelligent Multi-Media Environments Laboratory (HIMMEL). The environment is populated with a number of microphones and steerable video cameras. Acoustic source localization, video-based face tracking and pose estimation, and multi-channel speech enhancement methods are applied in combination to detect and track individuals in a practical environment while also providing an improved audio signal to accompany the video stream. The video portion of the system tracks talkers by utilizing source motion, contour geometry, color data, and simple facial features. Decisions involving which camera to use are based on an estimate of the head's gazing angle. This head pose estimation is achieved using a very general head model which employs hairline features and a learned network classification procedure. Finally, a beamforming and postfiltering microphone array technique is used to create an enhanced speech waveform to accompany the recorded video signal. The system presented in this paper is robust to both visual clutter (e.g. ovals in the scene of interest which are not faces) and audible noise (e.g. reverberations and background noise).
Similar content being viewed by others
References
M. Brandstein and H. Silverman, “A Practical Methodology for Speech Source Localization with Microphone Arrays, ” Computer, Speech, and Language, vol. 11, no. 2, 1997, pp. 91–126.
M. Brandstein, J. Adcock, and H. Silverman, “A Closed-Form Location Estimator for Use with Room Environment Microphone Arrays, ” IEEE Trans. Speech Audio Proc., vol. 5, no. 1, 1997, pp. 45–50.
G. Carter (Ed.), Coherence and Time Delay Estimation, 1st edn., IEEE Press, 1993.
M. Omologo and P. Svaizer, “Acoustic Event Localization Using a Crosspower-Spectrum Phase Based Technique, ” in Proceedings of ICASSP97, IEEE, 1994, pp. II-273–II-276.
H. Wang and P. Chu, “Voice Source Localization for Automatic Camera Pointing System in Videoconferencing, ” in Proceedings of ICASSP97, Munich, Germany, IEEE, April 20–24, 1997, pp. 187–190.
B. Champagne, S. Bédard, and A. Stéphenne, “Performance of Time-Delay Estimation in the Presence of RoomReverberation, ” IEEE Trans. Speech Audio Proc., vol. 4, no. 2, 1996, pp. 148–152.
M. Brandstein, “Time-Delay Estimation of Reverberated Speech Exploiting Harmonic Structure, ” J. Acoust. Soc. Am., vol. 105, no. 5, 1999, pp. 2914–2919.
J. Benesty, “Adaptive Eigenvalue Decomposition Algorithm for Passive Acoustic Source Localization, ” J. Acoust. Soc. Am., vol.107, no. 1, 2000, pp. 384–391.
D. Mansour and A. Gray, Jr., “Unconstrained Frequency-Domain Adaptive Filter, ” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, no. 3, 1981, pp. 561–571.
R. Martin, “AnEfficient Algorithm to Estimate the Instantaneous SNR of Speech Signals, ” in Proceedings of EUROSPEECH-93, Berlin, Germany, Sept. 1993, pp. 1093–1096.
D. Ward and R. Williamson, “Beamforming for a Source Located in the Interior of a Sensor Array, ” in Proc. IEEE Int. Symp. Sig. Process. Applicat. (ISSPA-99), Brisbane, Australia, IEEE, Aug. 1999, pp. 873–876.
B. Van Veen and K. Buckley, “Beamforming: A Versatile Approach to Spatial Filtering, ” IEEE ASSP Mag., vol. 5, no. 2, 1988, pp. 4–24.
D. Ward and M. Brandstein, “Grid-Based Beamformer Design for Room-Environment Microphone Arrays, ” 1999 Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17–20, 1997.
S. Boll, “Suppresion of Acoustic Noise in Speech Using Spectral Subtraction, ” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 2, 1979, pp. 113–120.
M. Brandstein, J. Adcock, and H. Silverman, “Microphone Array Localization Error Estimation with Application to Sensor Placement, ” J. Acoust. Soc. Am., vol. 99, no. 6, 1996, pp. 3807–3816.
H. Rowley, S. Baluja, and T. Kanade, “Neural Network-Based Face Detection, ” IEEE CVPR'96, 1996.
A. Blake and A. Yuille, Active Vision, MIT Press, Cambridge, MA 1992.
M.C. Burl, T.K. Leung, and P. Perona, “Face Localization via Shape Statistics, ” Intl. Workshop on Automatic Face and Gesture Recognition, Zurich, Switz., 1995.
J. Yang and A. Waibel, “A Real-Time Face Tracker, ” in Proceedings of the Third IEEE Workshop on Applications of Computer Vision, Sarasota, Florida, Dec. 1996, pp. 142–147.
N. Oliver, A.P. Pentland, and F. Berard, “Lafter: Lips and Face Real Time Tracker, ” IEEE CVPR, 1997, pp. 123–129.
T. Jebara and A. Pentland, “Parameterized Structure from Motion for 3D Adaptive Feedback Tracking of Faces, ” IEEE CVPR, 1997, pp. 144–150.
A.K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, Englewood Cliffs, NJ 1989.
R. Lopez and T.S. Huang, “Head Pose Computation for Very Low Bit-Rate Video Coding, ” 6th International Conference on Computer Analysis of Images and Patterns, Berlin Heidelberg, Springer-Verlag, 1995, pp. 440–447.
N. Kruger, M. Potzsch, and C. Malsburg, “Determination of Face Position and Pose with a Learned Representation Based on Labeled Graphs, ” Image and Vision Computing, vol. 15, no. 8, 1997, pp. 665–673.
I. Shimizu, Z. Zhang, S. Akamatsu, and K. Deguchi, “Head Pose Determination from One Image Using a Generic Model, ” 3rd IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 100–105.
V. Vapnik, Statistical Learning Theory, New York: John Wiley, 1998.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wang, C., Griebel, S., Brandstein, M. et al. Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 29, 81–99 (2001). https://doi.org/10.1023/A:1011127615679
Published:
Issue Date:
DOI: https://doi.org/10.1023/A:1011127615679