Joint Audio-Video Signal Processing for Object Localization and Tracking

Strobel, Norbert; Spors, Sascha; Rabenstein, Rudolf

doi:10.1007/978-3-662-04619-7_10

Norbert Strobel⁵,
Sascha Spors⁶ &
Rudolf Rabenstein⁶

Part of the book series: Digital Signal Processing ((DIGSIGNAL))

2080 Accesses
7 Citations

Abstract

Applications such as videoconferencing, automatic scene analysis, or security surveillance involving acoustic sources can benefit from object localization within a complex scene. Many single-sensor techniques already exist for this purpose. They are, e.g., based on microphone arrays, video cameras, or range sensors. Since all of these sensors have their specific strengths and weaknesses, it is often advantageous to combine information from various sensor modalities to arrive at more robust position estimates.

This chapter presents a joint audio-video signal processing methodology for object localizing and tracking. The approach is based on a decentralized Kalman filter structure modified such that different sensor measurement models can be incorporated. Such a situation is typical for combined audio-video sensing, since different coordinate systems are usually used for the camera system and the microphone array.

At first, the decentralized estimation algorithm is presented. Then a speaker localization example is discussed. Finally, some estimation results are shown.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

W. Bangs and P. Schultheiss, “Space-time processing for optimal parameter estimation,” in Signal Processing (J. Griffiths, P. Stocklin, and C. Schooneveld, eds.), pp. 577–591, Academic Press, 1973.
Google Scholar
W. Hahn and S. Tretter, “Optimum processing for delay-vector estimation in passive signal arrays,” IEEE Trans. on Information Theory, vol. 19, no. 5, pp. 608–614, 1973.
Article MATH Google Scholar
W. Hahn, “Optimum signal processing for passive sonar range and bearing estimation,” Journal of the Acoustical Society of America, vol. 58, no. 1, pp. 201207, 1975.
Google Scholar
G. Carter, “Time delay estimation for passive sonar signal processing,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 463–470, 1981.
Google Scholar
N. Owsley and G. Swope, “Time delay estimation in a sensor array,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 519–523, 1981.
Google Scholar
W. Kellermann, “A self-steering digital microphone array,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-91), Toronto, Canada, pp. 3581–3584, June 1991.
Google Scholar
H. Silverman and S. Kirtman, “A two-stage algorithm for determining talker location from linear microphone array data,” Computer Speech and Language, vol. 6, no. 2, pp. 129–152, 1992.
Article Google Scholar
N. Strobel, T. Meier, and R. Rabenstein, “Speaker localization using steered filtered-and-sum beamformers,” in Proceedings Vision, Modeling, and Visualization ‘89 (B. Girod, H. Niemann, and H.-P. Seidel, eds.), (Erlangen), pp. 195202, 1999.
Google Scholar
N. Strobel and R. Rabenstein, “Robust speaker localization using a microphone array,” in Proceedings of the X European Signal Processing Conference, vol. I II, 2000.
Google Scholar
M. Brandstein and H. Silverman, “A practical methodology for speech source localization with microphone arrays,” Computer Speech and Language, vol. 11, no. 2, pp. 91–126, April 1997.
Google Scholar
D. Sturim, M. Brandstein, and H. Silverman, “Tracking multiple talkers using microphone-array measurements,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 371–374, April 1997.
Google Scholar
Y. Chan and K. Ho, “A simple and efficient estimator for hyperbolic location,” IEEE Trans. on Signal Processing, vol. 42, no. 8, pp. 1905–1915, 1994.
Article MathSciNet Google Scholar
H. Wang and P. Chu, “Voice source localization for automatic camera pointing system in videoconferencing,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 187–90, April 1997.
Google Scholar
D. Rabinkin, R. Renomeron, A. Dahl, et al.,“A DSP implementation of source location using microphone arrays,” in SPIE Proceedings ‘86,vol. 2846, pp. 8898, 1996.
Google Scholar
R. Chellappa, C. Wilson, and A. Sirohey, “Human and machine recognition of faces: A survey,” IEEE Proceedings, vol. 83, no. 5, pp. 705–740, 1995.
Article Google Scholar
A. Eleftheriadis and A. Jacquin, “Automatic face location, detection and tracking for model-assisted coding of video teleconferencing sequences at low bit-rates,” Signal Processing: Image Communication, vol. 7, no. 3, pp. 231–248, 1995.
Article Google Scholar
L. Bala, K. Talmi, and J. Liu, “Automatic detection and tracking of faces and facial features in video sequences,” in Proceedings of the 1997 Picture Coding Symposium, no. 143 in ITG-Fachberichte, pp. 251–256, 1997.
Google Scholar
P. Fieguth and D. Terzopoulos, “Color-based tracking of heads and other mobile objects at video frame rates,” in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 21–27, 1997.
Chapter Google Scholar
J. Crowley and P. Berard, “Multi-modal tracking of faces for video communications,” in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 640–645, 1997.
Chapter Google Scholar
R. Quian, M. Sezan, and K. Matthews, “A robust real-time face tracking algorithm,” in Proceedings of the 1998 IEEE International Conference on Image Processing, vol. 1, pp. 131–135, 1998.
Google Scholar
R. Tsai, “A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses,” IEEE Trans. Robot. Autom., vol. 3, pp. 323–344, 1987.
Article Google Scholar
O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993.
Google Scholar
D. Koller, D. Daniilidis, and H. Nagel, “Model-based object tracking in monocular image sequences and road traffic scenes,” International Journal of Computer Vision, vol. 10, no. 3, pp. 257–281, 1993.
Article Google Scholar
P. Arnoul, M. Viala, J. Guerin, and M. Mergy, “Traffic signs localisation for highway inventory from a video camera on board a moving collecting van,” in Proceedings of the 1996 IEEE Intelligent Vehicle Symposium, pp. 682–687, 1996.
Google Scholar
G. L. Foresti, “Object recognition and tracking for remote video surveillance,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 9, no. 7, pp. 1045–1062, 1999.
Article Google Scholar
U. Bub, M. Hunke, and A. Waibel, “Knowing who to listen to in speech recognition: Visually guided beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-95), Detroit MI, USA, pp. 848–851, May 1995.
Google Scholar
M. Collobert, R. Freauc, G. Tourneur, et al.,“LISTEN: a system for locating and tracking individual speakers,” in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition,pp. 283–288, 1996.
Google Scholar
G. Pingali, “Integrated audio-visual processing for object localization and tracking,” in Proceedings of the SPIE, vol. 3310, pp. 206–213, 1997.
Google Scholar
C. Wang and M. Brandstein, “A hybrid real-time face tracking system,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-98), Seattle WA, USA, pp. 3737–3740, May 1998.
Google Scholar
C. Wang and M. Brandstein, “Multi-source face tracking with audio and visual data,” in IEEE Int. Workshop on Multimedia Signal Processing, Copenhagen, Denmark, pp. 169–174, September 1999.
Google Scholar
Y. Huang, J. Benesty, and G. Elko, “Passive acoustic source localization for video camera steering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-2000), Instanbul, Turkey, pp. 1384–1387, June 2000.
Google Scholar
N. Strobel and R. Rabenstein, “Fusion of multisensor data,” in Principles of 3D Image Analysis and Synthesis (B. Girod, G. Greiner, and H. Niemann, eds.), pp. 309–322, Kluwer, 2000.
Google Scholar
J. Richardson and K. Marsh, “Fusion of multisensor data,” International Journal of Robotics Research, vol. 7, no. 6, pp. 78–96, 1988.
Article Google Scholar
N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video object localization using a recursive multi-state, multi-sensor estimator,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-2000), Istanbul, Turkey, pp. 37813784, June 2000.
Google Scholar
N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video object localization and tracking,” IEEE Signal Processing Magazine, Jan. 2001.
Google Scholar
D. H. Johnson and D. E. Dudgeon, Array Signal Processing - Concepts and Techniques, Prentice Hall, 1993.
Google Scholar
L. Scharf, Statistical Signal Processing-Detection, Estimation, and Time Series Analysis, Addison-Wesley, 1991.
Google Scholar
J. Mendel, Lessons in Estimation Theory for Signal Processing, Communications, and Control, Prentice Hall, 1995.
MATH Google Scholar
R. G. Brown and P. Y. Hwang, Introduction to random signals and applied Kalman filtering, Wiley, 1997.
Google Scholar
T. Broida, “Kinematic and statistical models for data fusion using Kalman filtering,” in Data Fusion in Robotics and Machine Intelligence (Abidi and Gonzales, eds.), pp. 311–365, Academic Press, 1992.
Google Scholar
A. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, 1970.
Google Scholar
B. Rao, H. Durrant-Whyte, and J. Sheen, “A fully decentralized multi-sensor system for tracking and surveillance,” International Journal of Robotics Research, vol. 12, no. 1, pp. 20–44, 1993.
Article Google Scholar
M. Brandstein, J. Adcock, and H. Silverman, “Microphone array localization error estimation with application to sensor placement,” J. Acoust. Soc. Am., vol. 99, no. 6, pp. 3807–3816, 1996.
Article Google Scholar
M. Brandstein, J. Adcock, and H. Silverman, “A closed-form location estimator for use with room environment microphone arrays,” IEEE Trans. on Speech and Audio Processing, vol. 5, no. 1, pp. 45–50, 1997.
Article Google Scholar
Y. Bar-Shalom and T. Fortman, Tracking and Data Association, Academic Press, 1988.
Google Scholar
S. Blackman, “Association and fusion of multiple sensor data,” in MultitargetMultisensor Tracking: Advanced Applications (Y. Bar-Shalom, ed.), pp. 187218, Artech House, 1990.
Google Scholar
Y. Bar-Shalom and X. Li, Multitarget-Multisensor Tracking: Principles and Techniques, Univ. of Conneticutt, 1995.
Google Scholar
M. Yeddanapudi, Y. Bar-Shalom, and K. Pittipati, “IMM estimation for multitarget-multisensor air traffic surveillance,” IEEE Proceedings, vol. 85, no. 1, pp. 80–94, 1997.
Article Google Scholar
R. Mahler, “A unified foundation for data fusion,” in Seventh Joint Service Data Fusion Symposium, 1994.
Google Scholar
I. Goodman, “A general theory for the fusion of data,” in Tri-Service Data Fusion Symposium, 1987.
Google Scholar
A. Poore, “Multi-dimensional assignment formulation of data association problems arising from multi-target and multi-sensor tracking,” Computational Optimization Applicat., vol. 3, pp. 27–57, 1994.
Article MathSciNet MATH Google Scholar
G. Wang, R. Rabenstein, N. Strobel, and S. Spors, “Object localization by joint audio-video signal processing,” in Proceedings Vision, Modeling, and Visualization 2000 (B. Girod, G. Greiner, H. Niemann, and H.-P. Seidel, eds.), Saarbrücken, Germany, pp. 97–104, Nov. 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Siemens Medical Solutions, Erlangen, Germany
Norbert Strobel
University Erlangen-Nuremberg, Erlangen, Germany
Sascha Spors & Rudolf Rabenstein

Authors

Norbert Strobel
View author publications
You can also search for this author in PubMed Google Scholar
Sascha Spors
View author publications
You can also search for this author in PubMed Google Scholar
Rudolf Rabenstein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Div. of Eng. and Applied Scciences, Harvard University, 33 Oxford Street, 02138, Cambridge, MA, USA
Michael Brandstein
Dept. of Electrical Engineering, Imperial College, Exhibition Road, SW7 2AZ, London, GB
Darren Ward

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Strobel, N., Spors, S., Rabenstein, R. (2001). Joint Audio-Video Signal Processing for Object Localization and Tracking. In: Brandstein, M., Ward, D. (eds) Microphone Arrays. Digital Signal Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-04619-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-662-04619-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07547-6
Online ISBN: 978-3-662-04619-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics