Skip to main content

Joint Audio-Video Signal Processing for Object Localization and Tracking

  • Chapter
Microphone Arrays

Part of the book series: Digital Signal Processing ((DIGSIGNAL))

Abstract

Applications such as videoconferencing, automatic scene analysis, or security surveillance involving acoustic sources can benefit from object localization within a complex scene. Many single-sensor techniques already exist for this purpose. They are, e.g., based on microphone arrays, video cameras, or range sensors. Since all of these sensors have their specific strengths and weaknesses, it is often advantageous to combine information from various sensor modalities to arrive at more robust position estimates.

This chapter presents a joint audio-video signal processing methodology for object localizing and tracking. The approach is based on a decentralized Kalman filter structure modified such that different sensor measurement models can be incorporated. Such a situation is typical for combined audio-video sensing, since different coordinate systems are usually used for the camera system and the microphone array.

At first, the decentralized estimation algorithm is presented. Then a speaker localization example is discussed. Finally, some estimation results are shown.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. W. Bangs and P. Schultheiss, “Space-time processing for optimal parameter estimation,” in Signal Processing (J. Griffiths, P. Stocklin, and C. Schooneveld, eds.), pp. 577–591, Academic Press, 1973.

    Google Scholar 

  2. W. Hahn and S. Tretter, “Optimum processing for delay-vector estimation in passive signal arrays,” IEEE Trans. on Information Theory, vol. 19, no. 5, pp. 608–614, 1973.

    Article  MATH  Google Scholar 

  3. W. Hahn, “Optimum signal processing for passive sonar range and bearing estimation,” Journal of the Acoustical Society of America, vol. 58, no. 1, pp. 201207, 1975.

    Google Scholar 

  4. G. Carter, “Time delay estimation for passive sonar signal processing,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 463–470, 1981.

    Google Scholar 

  5. N. Owsley and G. Swope, “Time delay estimation in a sensor array,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 519–523, 1981.

    Google Scholar 

  6. W. Kellermann, “A self-steering digital microphone array,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-91), Toronto, Canada, pp. 3581–3584, June 1991.

    Google Scholar 

  7. H. Silverman and S. Kirtman, “A two-stage algorithm for determining talker location from linear microphone array data,” Computer Speech and Language, vol. 6, no. 2, pp. 129–152, 1992.

    Article  Google Scholar 

  8. N. Strobel, T. Meier, and R. Rabenstein, “Speaker localization using steered filtered-and-sum beamformers,” in Proceedings Vision, Modeling, and Visualization ‘89 (B. Girod, H. Niemann, and H.-P. Seidel, eds.), (Erlangen), pp. 195202, 1999.

    Google Scholar 

  9. N. Strobel and R. Rabenstein, “Robust speaker localization using a microphone array,” in Proceedings of the X European Signal Processing Conference, vol. I II, 2000.

    Google Scholar 

  10. M. Brandstein and H. Silverman, “A practical methodology for speech source localization with microphone arrays,” Computer Speech and Language, vol. 11, no. 2, pp. 91–126, April 1997.

    Google Scholar 

  11. D. Sturim, M. Brandstein, and H. Silverman, “Tracking multiple talkers using microphone-array measurements,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 371–374, April 1997.

    Google Scholar 

  12. Y. Chan and K. Ho, “A simple and efficient estimator for hyperbolic location,” IEEE Trans. on Signal Processing, vol. 42, no. 8, pp. 1905–1915, 1994.

    Article  MathSciNet  Google Scholar 

  13. H. Wang and P. Chu, “Voice source localization for automatic camera pointing system in videoconferencing,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-97), Munich, Germany, pp. 187–90, April 1997.

    Google Scholar 

  14. D. Rabinkin, R. Renomeron, A. Dahl, et al.,“A DSP implementation of source location using microphone arrays,” in SPIE Proceedings ‘86,vol. 2846, pp. 8898, 1996.

    Google Scholar 

  15. R. Chellappa, C. Wilson, and A. Sirohey, “Human and machine recognition of faces: A survey,” IEEE Proceedings, vol. 83, no. 5, pp. 705–740, 1995.

    Article  Google Scholar 

  16. A. Eleftheriadis and A. Jacquin, “Automatic face location, detection and tracking for model-assisted coding of video teleconferencing sequences at low bit-rates,” Signal Processing: Image Communication, vol. 7, no. 3, pp. 231–248, 1995.

    Article  Google Scholar 

  17. L. Bala, K. Talmi, and J. Liu, “Automatic detection and tracking of faces and facial features in video sequences,” in Proceedings of the 1997 Picture Coding Symposium, no. 143 in ITG-Fachberichte, pp. 251–256, 1997.

    Google Scholar 

  18. P. Fieguth and D. Terzopoulos, “Color-based tracking of heads and other mobile objects at video frame rates,” in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 21–27, 1997.

    Chapter  Google Scholar 

  19. J. Crowley and P. Berard, “Multi-modal tracking of faces for video communications,” in Proceedings of the 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 640–645, 1997.

    Chapter  Google Scholar 

  20. R. Quian, M. Sezan, and K. Matthews, “A robust real-time face tracking algorithm,” in Proceedings of the 1998 IEEE International Conference on Image Processing, vol. 1, pp. 131–135, 1998.

    Google Scholar 

  21. R. Tsai, “A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses,” IEEE Trans. Robot. Autom., vol. 3, pp. 323–344, 1987.

    Article  Google Scholar 

  22. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993.

    Google Scholar 

  23. D. Koller, D. Daniilidis, and H. Nagel, “Model-based object tracking in monocular image sequences and road traffic scenes,” International Journal of Computer Vision, vol. 10, no. 3, pp. 257–281, 1993.

    Article  Google Scholar 

  24. P. Arnoul, M. Viala, J. Guerin, and M. Mergy, “Traffic signs localisation for highway inventory from a video camera on board a moving collecting van,” in Proceedings of the 1996 IEEE Intelligent Vehicle Symposium, pp. 682–687, 1996.

    Google Scholar 

  25. G. L. Foresti, “Object recognition and tracking for remote video surveillance,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 9, no. 7, pp. 1045–1062, 1999.

    Article  Google Scholar 

  26. U. Bub, M. Hunke, and A. Waibel, “Knowing who to listen to in speech recognition: Visually guided beamforming,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-95), Detroit MI, USA, pp. 848–851, May 1995.

    Google Scholar 

  27. M. Collobert, R. Freauc, G. Tourneur, et al.,“LISTEN: a system for locating and tracking individual speakers,” in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition,pp. 283–288, 1996.

    Google Scholar 

  28. G. Pingali, “Integrated audio-visual processing for object localization and tracking,” in Proceedings of the SPIE, vol. 3310, pp. 206–213, 1997.

    Google Scholar 

  29. C. Wang and M. Brandstein, “A hybrid real-time face tracking system,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-98), Seattle WA, USA, pp. 3737–3740, May 1998.

    Google Scholar 

  30. C. Wang and M. Brandstein, “Multi-source face tracking with audio and visual data,” in IEEE Int. Workshop on Multimedia Signal Processing, Copenhagen, Denmark, pp. 169–174, September 1999.

    Google Scholar 

  31. Y. Huang, J. Benesty, and G. Elko, “Passive acoustic source localization for video camera steering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-2000), Instanbul, Turkey, pp. 1384–1387, June 2000.

    Google Scholar 

  32. N. Strobel and R. Rabenstein, “Fusion of multisensor data,” in Principles of 3D Image Analysis and Synthesis (B. Girod, G. Greiner, and H. Niemann, eds.), pp. 309–322, Kluwer, 2000.

    Google Scholar 

  33. J. Richardson and K. Marsh, “Fusion of multisensor data,” International Journal of Robotics Research, vol. 7, no. 6, pp. 78–96, 1988.

    Article  Google Scholar 

  34. N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video object localization using a recursive multi-state, multi-sensor estimator,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP-2000), Istanbul, Turkey, pp. 37813784, June 2000.

    Google Scholar 

  35. N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video object localization and tracking,” IEEE Signal Processing Magazine, Jan. 2001.

    Google Scholar 

  36. D. H. Johnson and D. E. Dudgeon, Array Signal Processing - Concepts and Techniques, Prentice Hall, 1993.

    Google Scholar 

  37. L. Scharf, Statistical Signal Processing-Detection, Estimation, and Time Series Analysis, Addison-Wesley, 1991.

    Google Scholar 

  38. J. Mendel, Lessons in Estimation Theory for Signal Processing, Communications, and Control, Prentice Hall, 1995.

    MATH  Google Scholar 

  39. R. G. Brown and P. Y. Hwang, Introduction to random signals and applied Kalman filtering, Wiley, 1997.

    Google Scholar 

  40. T. Broida, “Kinematic and statistical models for data fusion using Kalman filtering,” in Data Fusion in Robotics and Machine Intelligence (Abidi and Gonzales, eds.), pp. 311–365, Academic Press, 1992.

    Google Scholar 

  41. A. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, 1970.

    Google Scholar 

  42. B. Rao, H. Durrant-Whyte, and J. Sheen, “A fully decentralized multi-sensor system for tracking and surveillance,” International Journal of Robotics Research, vol. 12, no. 1, pp. 20–44, 1993.

    Article  Google Scholar 

  43. M. Brandstein, J. Adcock, and H. Silverman, “Microphone array localization error estimation with application to sensor placement,” J. Acoust. Soc. Am., vol. 99, no. 6, pp. 3807–3816, 1996.

    Article  Google Scholar 

  44. M. Brandstein, J. Adcock, and H. Silverman, “A closed-form location estimator for use with room environment microphone arrays,” IEEE Trans. on Speech and Audio Processing, vol. 5, no. 1, pp. 45–50, 1997.

    Article  Google Scholar 

  45. Y. Bar-Shalom and T. Fortman, Tracking and Data Association, Academic Press, 1988.

    Google Scholar 

  46. S. Blackman, “Association and fusion of multiple sensor data,” in MultitargetMultisensor Tracking: Advanced Applications (Y. Bar-Shalom, ed.), pp. 187218, Artech House, 1990.

    Google Scholar 

  47. Y. Bar-Shalom and X. Li, Multitarget-Multisensor Tracking: Principles and Techniques, Univ. of Conneticutt, 1995.

    Google Scholar 

  48. M. Yeddanapudi, Y. Bar-Shalom, and K. Pittipati, “IMM estimation for multitarget-multisensor air traffic surveillance,” IEEE Proceedings, vol. 85, no. 1, pp. 80–94, 1997.

    Article  Google Scholar 

  49. R. Mahler, “A unified foundation for data fusion,” in Seventh Joint Service Data Fusion Symposium, 1994.

    Google Scholar 

  50. I. Goodman, “A general theory for the fusion of data,” in Tri-Service Data Fusion Symposium, 1987.

    Google Scholar 

  51. A. Poore, “Multi-dimensional assignment formulation of data association problems arising from multi-target and multi-sensor tracking,” Computational Optimization Applicat., vol. 3, pp. 27–57, 1994.

    Article  MathSciNet  MATH  Google Scholar 

  52. G. Wang, R. Rabenstein, N. Strobel, and S. Spors, “Object localization by joint audio-video signal processing,” in Proceedings Vision, Modeling, and Visualization 2000 (B. Girod, G. Greiner, H. Niemann, and H.-P. Seidel, eds.), Saarbrücken, Germany, pp. 97–104, Nov. 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Strobel, N., Spors, S., Rabenstein, R. (2001). Joint Audio-Video Signal Processing for Object Localization and Tracking. In: Brandstein, M., Ward, D. (eds) Microphone Arrays. Digital Signal Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-04619-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-04619-7_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-07547-6

  • Online ISBN: 978-3-662-04619-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics