Abstract
Applications such as videoconferencing, automatic scene analysis, or security surveillance involving acoustic sources can benefit from object localization within a complex scene. Many single-sensor techniques already exist for this purpose. They are, e.g., based on microphone arrays, video cameras, or range sensors. Since all of these sensors have their specific strengths and weaknesses, it is often advantageous to combine information from various sensor modalities to arrive at more robust position estimates.
This chapter presents a joint audio-video signal processing methodology for object localizing and tracking. The approach is based on a decentralized Kalman filter structure modified such that different sensor measurement models can be incorporated. Such a situation is typical for combined audio-video sensing, since different coordinate systems are usually used for the camera system and the microphone array.
At first, the decentralized estimation algorithm is presented. Then a speaker localization example is discussed. Finally, some estimation results are shown.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
W. Bangs and P. Schultheiss, “Space-time processing for optimal parameter estimation,” in Signal Processing (J. Griffiths, P. Stocklin, and C. Schooneveld, eds.), pp. 577–591, Academic Press, 1973.
W. Hahn and S. Tretter, “Optimum processing for delay-vector estimation in passive signal arrays,” IEEE Trans. on Information Theory, vol. 19, no. 5, pp. 608–614, 1973.
W. Hahn, “Optimum signal processing for passive sonar range and bearing estimation,” Journal of the Acoustical Society of America, vol. 58, no. 1, pp. 201207, 1975.
G. Carter, “Time delay estimation for passive sonar signal processing,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 463–470, 1981.
N. Owsley and G. Swope, “Time delay estimation in a sensor array,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 519–523, 1981.
W. Kellermann, “A self-steering digital microphone array,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-91), Toronto, Canada, pp. 3581–3584, June 1991.
H. Silverman and S. Kirtman, “A two-stage algorithm for determining talker location from linear microphone array data,” Computer Speech and Language, vol. 6, no. 2, pp. 129–152, 1992.
N. Strobel, T. Meier, and R. Rabenstein, “Speaker localization using steered filtered-and-sum beamformers,” in Proceedings Vision, Modeling, and Visualization ‘89 (B. Girod, H. Niemann, and H.-P. Seidel, eds.), (Erlangen), pp. 195202, 1999.
N. Strobel and R. Rabenstein, “Robust speaker localization using a microphone array,” in Proceedings of the X European Signal Processing Conference, vol. I II, 2000.
M. Brandstein and H. Silverman, “A practical methodology for speech source localization with microphone arrays,” Computer Speech and Language, vol. 11, no. 2, pp. 91–126, April 1997.
D. Sturim, M. Brandstein, and H. Silverman, “Tracking multiple talkers using microphone-array measurements,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 371–374, April 1997.
Y. Chan and K. Ho, “A simple and efficient estimator for hyperbolic location,” IEEE Trans. on Signal Processing, vol. 42, no. 8, pp. 1905–1915, 1994.
H. Wang and P. Chu, “Voice source localization for automatic camera pointing system in videoconferencing,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 187–90, April 1997.
D. Rabinkin, R. Renomeron, A. Dahl, et al.,“A DSP implementation of source location using microphone arrays,” in SPIE Proceedings ‘86,vol. 2846, pp. 8898, 1996.
R. Chellappa, C. Wilson, and A. Sirohey, “Human and machine recognition of faces: A survey,” IEEE Proceedings, vol. 83, no. 5, pp. 705–740, 1995.
A. Eleftheriadis and A. Jacquin, “Automatic face location, detection and tracking for model-assisted coding of video teleconferencing sequences at low bit-rates,” Signal Processing: Image Communication, vol. 7, no. 3, pp. 231–248, 1995.
L. Bala, K. Talmi, and J. Liu, “Automatic detection and tracking of faces and facial features in video sequences,” in Proceedings of the 1997 Picture Coding Symposium, no. 143 in ITG-Fachberichte, pp. 251–256, 1997.
P. Fieguth and D. Terzopoulos, “Color-based tracking of heads and other mobile objects at video frame rates,” in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 21–27, 1997.
J. Crowley and P. Berard, “Multi-modal tracking of faces for video communications,” in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 640–645, 1997.
R. Quian, M. Sezan, and K. Matthews, “A robust real-time face tracking algorithm,” in Proceedings of the 1998 IEEE International Conference on Image Processing, vol. 1, pp. 131–135, 1998.
R. Tsai, “A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses,” IEEE Trans. Robot. Autom., vol. 3, pp. 323–344, 1987.
O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993.
D. Koller, D. Daniilidis, and H. Nagel, “Model-based object tracking in monocular image sequences and road traffic scenes,” International Journal of Computer Vision, vol. 10, no. 3, pp. 257–281, 1993.
P. Arnoul, M. Viala, J. Guerin, and M. Mergy, “Traffic signs localisation for highway inventory from a video camera on board a moving collecting van,” in Proceedings of the 1996 IEEE Intelligent Vehicle Symposium, pp. 682–687, 1996.
G. L. Foresti, “Object recognition and tracking for remote video surveillance,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 9, no. 7, pp. 1045–1062, 1999.
U. Bub, M. Hunke, and A. Waibel, “Knowing who to listen to in speech recognition: Visually guided beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-95), Detroit MI, USA, pp. 848–851, May 1995.
M. Collobert, R. Freauc, G. Tourneur, et al.,“LISTEN: a system for locating and tracking individual speakers,” in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition,pp. 283–288, 1996.
G. Pingali, “Integrated audio-visual processing for object localization and tracking,” in Proceedings of the SPIE, vol. 3310, pp. 206–213, 1997.
C. Wang and M. Brandstein, “A hybrid real-time face tracking system,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-98), Seattle WA, USA, pp. 3737–3740, May 1998.
C. Wang and M. Brandstein, “Multi-source face tracking with audio and visual data,” in IEEE Int. Workshop on Multimedia Signal Processing, Copenhagen, Denmark, pp. 169–174, September 1999.
Y. Huang, J. Benesty, and G. Elko, “Passive acoustic source localization for video camera steering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-2000), Instanbul, Turkey, pp. 1384–1387, June 2000.
N. Strobel and R. Rabenstein, “Fusion of multisensor data,” in Principles of 3D Image Analysis and Synthesis (B. Girod, G. Greiner, and H. Niemann, eds.), pp. 309–322, Kluwer, 2000.
J. Richardson and K. Marsh, “Fusion of multisensor data,” International Journal of Robotics Research, vol. 7, no. 6, pp. 78–96, 1988.
N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video object localization using a recursive multi-state, multi-sensor estimator,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-2000), Istanbul, Turkey, pp. 37813784, June 2000.
N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video object localization and tracking,” IEEE Signal Processing Magazine, Jan. 2001.
D. H. Johnson and D. E. Dudgeon, Array Signal Processing - Concepts and Techniques, Prentice Hall, 1993.
L. Scharf, Statistical Signal Processing-Detection, Estimation, and Time Series Analysis, Addison-Wesley, 1991.
J. Mendel, Lessons in Estimation Theory for Signal Processing, Communications, and Control, Prentice Hall, 1995.
R. G. Brown and P. Y. Hwang, Introduction to random signals and applied Kalman filtering, Wiley, 1997.
T. Broida, “Kinematic and statistical models for data fusion using Kalman filtering,” in Data Fusion in Robotics and Machine Intelligence (Abidi and Gonzales, eds.), pp. 311–365, Academic Press, 1992.
A. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, 1970.
B. Rao, H. Durrant-Whyte, and J. Sheen, “A fully decentralized multi-sensor system for tracking and surveillance,” International Journal of Robotics Research, vol. 12, no. 1, pp. 20–44, 1993.
M. Brandstein, J. Adcock, and H. Silverman, “Microphone array localization error estimation with application to sensor placement,” J. Acoust. Soc. Am., vol. 99, no. 6, pp. 3807–3816, 1996.
M. Brandstein, J. Adcock, and H. Silverman, “A closed-form location estimator for use with room environment microphone arrays,” IEEE Trans. on Speech and Audio Processing, vol. 5, no. 1, pp. 45–50, 1997.
Y. Bar-Shalom and T. Fortman, Tracking and Data Association, Academic Press, 1988.
S. Blackman, “Association and fusion of multiple sensor data,” in MultitargetMultisensor Tracking: Advanced Applications (Y. Bar-Shalom, ed.), pp. 187218, Artech House, 1990.
Y. Bar-Shalom and X. Li, Multitarget-Multisensor Tracking: Principles and Techniques, Univ. of Conneticutt, 1995.
M. Yeddanapudi, Y. Bar-Shalom, and K. Pittipati, “IMM estimation for multitarget-multisensor air traffic surveillance,” IEEE Proceedings, vol. 85, no. 1, pp. 80–94, 1997.
R. Mahler, “A unified foundation for data fusion,” in Seventh Joint Service Data Fusion Symposium, 1994.
I. Goodman, “A general theory for the fusion of data,” in Tri-Service Data Fusion Symposium, 1987.
A. Poore, “Multi-dimensional assignment formulation of data association problems arising from multi-target and multi-sensor tracking,” Computational Optimization Applicat., vol. 3, pp. 27–57, 1994.
G. Wang, R. Rabenstein, N. Strobel, and S. Spors, “Object localization by joint audio-video signal processing,” in Proceedings Vision, Modeling, and Visualization 2000 (B. Girod, G. Greiner, H. Niemann, and H.-P. Seidel, eds.), Saarbrücken, Germany, pp. 97–104, Nov. 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Strobel, N., Spors, S., Rabenstein, R. (2001). Joint Audio-Video Signal Processing for Object Localization and Tracking. In: Brandstein, M., Ward, D. (eds) Microphone Arrays. Digital Signal Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-04619-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-04619-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07547-6
Online ISBN: 978-3-662-04619-7
eBook Packages: Springer Book Archive