Skip to main content
Log in

Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones

  • Published:
Journal of VLSI signal processing systems for signal, image and video technology Aims and scope Submit manuscript

Abstract

This work presents the acoustic and visual-based tracking system functioning at the Harvard Intelligent Multi-Media Environments Laboratory (HIMMEL). The environment is populated with a number of microphones and steerable video cameras. Acoustic source localization, video-based face tracking and pose estimation, and multi-channel speech enhancement methods are applied in combination to detect and track individuals in a practical environment while also providing an improved audio signal to accompany the video stream. The video portion of the system tracks talkers by utilizing source motion, contour geometry, color data, and simple facial features. Decisions involving which camera to use are based on an estimate of the head's gazing angle. This head pose estimation is achieved using a very general head model which employs hairline features and a learned network classification procedure. Finally, a beamforming and postfiltering microphone array technique is used to create an enhanced speech waveform to accompany the recorded video signal. The system presented in this paper is robust to both visual clutter (e.g. ovals in the scene of interest which are not faces) and audible noise (e.g. reverberations and background noise).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. M. Brandstein and H. Silverman, “A Practical Methodology for Speech Source Localization with Microphone Arrays, ” Computer, Speech, and Language, vol. 11, no. 2, 1997, pp. 91–126.

    Article  Google Scholar 

  2. M. Brandstein, J. Adcock, and H. Silverman, “A Closed-Form Location Estimator for Use with Room Environment Microphone Arrays, ” IEEE Trans. Speech Audio Proc., vol. 5, no. 1, 1997, pp. 45–50.

    Article  Google Scholar 

  3. G. Carter (Ed.), Coherence and Time Delay Estimation, 1st edn., IEEE Press, 1993.

  4. M. Omologo and P. Svaizer, “Acoustic Event Localization Using a Crosspower-Spectrum Phase Based Technique, ” in Proceedings of ICASSP97, IEEE, 1994, pp. II-273–II-276.

  5. H. Wang and P. Chu, “Voice Source Localization for Automatic Camera Pointing System in Videoconferencing, ” in Proceedings of ICASSP97, Munich, Germany, IEEE, April 20–24, 1997, pp. 187–190.

  6. B. Champagne, S. Bédard, and A. Stéphenne, “Performance of Time-Delay Estimation in the Presence of RoomReverberation, ” IEEE Trans. Speech Audio Proc., vol. 4, no. 2, 1996, pp. 148–152.

    Article  Google Scholar 

  7. M. Brandstein, “Time-Delay Estimation of Reverberated Speech Exploiting Harmonic Structure, ” J. Acoust. Soc. Am., vol. 105, no. 5, 1999, pp. 2914–2919.

    Article  Google Scholar 

  8. J. Benesty, “Adaptive Eigenvalue Decomposition Algorithm for Passive Acoustic Source Localization, ” J. Acoust. Soc. Am., vol.107, no. 1, 2000, pp. 384–391.

    Article  Google Scholar 

  9. D. Mansour and A. Gray, Jr., “Unconstrained Frequency-Domain Adaptive Filter, ” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, no. 3, 1981, pp. 561–571.

    Google Scholar 

  10. R. Martin, “AnEfficient Algorithm to Estimate the Instantaneous SNR of Speech Signals, ” in Proceedings of EUROSPEECH-93, Berlin, Germany, Sept. 1993, pp. 1093–1096.

  11. D. Ward and R. Williamson, “Beamforming for a Source Located in the Interior of a Sensor Array, ” in Proc. IEEE Int. Symp. Sig. Process. Applicat. (ISSPA-99), Brisbane, Australia, IEEE, Aug. 1999, pp. 873–876.

  12. B. Van Veen and K. Buckley, “Beamforming: A Versatile Approach to Spatial Filtering, ” IEEE ASSP Mag., vol. 5, no. 2, 1988, pp. 4–24.

    Article  Google Scholar 

  13. D. Ward and M. Brandstein, “Grid-Based Beamformer Design for Room-Environment Microphone Arrays, ” 1999 Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, Oct. 17–20, 1997.

  14. S. Boll, “Suppresion of Acoustic Noise in Speech Using Spectral Subtraction, ” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, no. 2, 1979, pp. 113–120.

  15. M. Brandstein, J. Adcock, and H. Silverman, “Microphone Array Localization Error Estimation with Application to Sensor Placement, ” J. Acoust. Soc. Am., vol. 99, no. 6, 1996, pp. 3807–3816.

    Article  Google Scholar 

  16. H. Rowley, S. Baluja, and T. Kanade, “Neural Network-Based Face Detection, ” IEEE CVPR'96, 1996.

  17. A. Blake and A. Yuille, Active Vision, MIT Press, Cambridge, MA 1992.

    Google Scholar 

  18. M.C. Burl, T.K. Leung, and P. Perona, “Face Localization via Shape Statistics, ” Intl. Workshop on Automatic Face and Gesture Recognition, Zurich, Switz., 1995.

  19. J. Yang and A. Waibel, “A Real-Time Face Tracker, ” in Proceedings of the Third IEEE Workshop on Applications of Computer Vision, Sarasota, Florida, Dec. 1996, pp. 142–147.

  20. N. Oliver, A.P. Pentland, and F. Berard, “Lafter: Lips and Face Real Time Tracker, ” IEEE CVPR, 1997, pp. 123–129.

  21. T. Jebara and A. Pentland, “Parameterized Structure from Motion for 3D Adaptive Feedback Tracking of Faces, ” IEEE CVPR, 1997, pp. 144–150.

  22. A.K. Jain, Fundamentals of Digital Image Processing, Prentice Hall, Englewood Cliffs, NJ 1989.

    MATH  Google Scholar 

  23. R. Lopez and T.S. Huang, “Head Pose Computation for Very Low Bit-Rate Video Coding, ” 6th International Conference on Computer Analysis of Images and Patterns, Berlin Heidelberg, Springer-Verlag, 1995, pp. 440–447.

    Chapter  Google Scholar 

  24. N. Kruger, M. Potzsch, and C. Malsburg, “Determination of Face Position and Pose with a Learned Representation Based on Labeled Graphs, ” Image and Vision Computing, vol. 15, no. 8, 1997, pp. 665–673.

    Article  Google Scholar 

  25. I. Shimizu, Z. Zhang, S. Akamatsu, and K. Deguchi, “Head Pose Determination from One Image Using a Generic Model, ” 3rd IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 100–105.

  26. V. Vapnik, Statistical Learning Theory, New York: John Wiley, 1998.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Griebel, S., Brandstein, M. et al. Real-Time Automated Video and Audio Capture with Multiple Cameras and Microphones. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 29, 81–99 (2001). https://doi.org/10.1023/A:1011127615679

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011127615679

Navigation