Skip to main content

Robot Audition: Missing Feature Theory Approach and Active Audition

  • Conference paper
Robotics Research

Part of the book series: Springer Tracts in Advanced Robotics ((STAR,volume 70))

Abstract

Robot capability of listening to several things at once by its own ears, that is,robot audition, is important in improving interaction and symbiosis between humans and robots. The critical issue in robot audition is real-time processing and robustness against noisy environments with high flexibility to support various kinds of robots and hardware configurations. This paper presents two important aspects of robot audition; Missing-Feature-Theory (MFT) approach and active audition. HARK open-source robot audition incorporates MFT approach to recognize speech signals that are localized and separated from a mixture of sound captured by 8- channel microphone array. HARK is ported to four robots, Honda ASIMO, SIG2, Robovie-R2 and HRP-2, with different microphone configurations and recognizes three simultaneous utterances with 1.9 sec latency. In binaural hearing, the most famous problem is a front-back confusion of sound sources. Active binaural robot audition implemented on SIG2 disambiguates the problem well by rotating its head with pitting. This active audition improves the localization for the periphery.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aloimonos, Y., Weiss, I., Bandyopadhyay, A.: Active vision. Intern’l J. of Computer Vision 1(4), 333–356 (1999)

    Article  Google Scholar 

  2. Asano, F., Asoh, H., Matsui, T.: Sound source localization and signal separation for office robot “Jijo-2”. In: Proc. of IEEE Intern’l Conf. on Multisensor Fusion and Integration for Intelligent Systems, pp. 243–248 (1999)

    Google Scholar 

  3. Bahoura, M., Pelletier, C.: Respiratory Sound Classification using Cepstral Analysis and Gaussian Mixture Models. In: IEEE/EMBS Intern’l Conf., San Francisco, USA (2004)

    Google Scholar 

  4. Berglund, E.J.: Active Audition for Robots using Parameter-Less Self-Organising Maps. Ph. D Thesis, The University of Queensland, Australia (2005)

    Google Scholar 

  5. Barker, J., Cooke, M., Green, P.: Robust ASR Based on Clean Speech Models: An Evaluation of Missing Data Techniques for Connected Digit Recognition in Noise. In: 7th European Conference on Speech Communication Technology, pp. 213–216 (2001)

    Google Scholar 

  6. Blauert, J.: Spatial Hearing – The Psychophysics of Human Sound Localization. The MIT Press, Cambridge (1996) (revised edition)

    Google Scholar 

  7. Breazeal, C.: Emotive Qualities in Robot Speech. In: Proceeding of IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, Hawaii, pp. 1389–1394 (2001)

    Google Scholar 

  8. Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)

    Google Scholar 

  9. Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust Automatic Speech Recognition with Missing and Unreliable Acoustic Data. Speech Communication, vol. 34, pp. 267–285. Elsevier, Amsterdam (2001)

    Google Scholar 

  10. Ephraim, Y., Malah, D.: Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. on ASSP 32(6), 1109–1121 (1984)

    Article  Google Scholar 

  11. Hara, I., Asano, F., Kawai, Y., Kanehiro, F., Yamamoto, K.: Robust speech interface based on audio and video information fusion for humanoid HRP-2. In: Proceeding of IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, Sendai, Japan, pp. 2404–2410 (2004)

    Google Scholar 

  12. Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Real-Time Auditory and Visual Talker Tracking through integrating EM algorithm and Particle Filter. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS (LNAI), vol. 4570, pp. 280–290. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  13. Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Human Tracking System Integrating Sound and Face Localization using EM Algorithm in Real Environments. Advanced Robotics, 23(6):629–653 (2009) doi: 10.1163/156855309X431659

    Google Scholar 

  14. Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Binaural Active Audition for Humanoid Robots to Localize Speech over Entire Azimuth Range. Applied Bionic and Biomechanics (2009) (to appear)

    Google Scholar 

  15. Lu, L., Zhang, H.G., Jiang, H.: Content Analysis for Audio Classification and Segmentation. IEEE Trans. on Speech and Audio Processing 10(7), 504–516 (2002)

    Article  Google Scholar 

  16. Michaud, F., et al.: Robust Recognition of Simultaneous Speech by a Mobile Robot. IEEE Trans. on Robotics 23(4), 742–752 (2007)

    Article  Google Scholar 

  17. Moon, T.K.: The Expectation-Maximization algorithm. IEEE Signal Processing Magazine 13(6), 47–60 (1996)

    Article  Google Scholar 

  18. Nakadai, K., et al.: Active audition for humanoid. In: Proc. of 17th National Conference on Artificial Intelligence, pp. 832–839. AAAI, Menlo Park (2000)

    Google Scholar 

  19. Nakadai, K., Hidai, K., Okuno, H.G., Kitano, H.: Real-Time Speaker Localization and Speech Separation by Audio-Visual Integration. In: Proc. of IEEE-RAS Intern’l Conf. on Robotics and Automation, May 2002, pp. 1043–1049 (2002), doi:10.1109/ROBOT.2002.1013493

    Google Scholar 

  20. Nakadai, K., Matasuura, D., Okuno, H.G., Tsujino, H.: Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots. Speech Communication 44(4), 97–112 (2004), doi:10.1016/j.specom.2004.10.010

    Article  Google Scholar 

  21. Nakadai, K., Okuno, H.G.: An Open Source Software System for Robot Audition HARK and Its Evaluation. In: IEEE/RAS Intern’l Conf. on Humanoid Robots, pp. 561–566 (2008), doi:10.1109/ICHR.2008.4756031

    Google Scholar 

  22. Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: Adaptive Step-Size Parameter Control for Real-World Blind Source Separation. In: IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, pp. 149–152 (2008)

    Google Scholar 

  23. Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: High performance sound source separation adaptable to environmental changes for robot audition. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp.2165–2171 (2008)

    Google Scholar 

  24. Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: Sound source separation of moving speakers for robot audition. In: IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, pp. 3685–3688 (2009)

    Google Scholar 

  25. Nishiura, T., Yamada, T., Nakamura, S., Shikano, K.: Localization of multiple sound sources based on a CSP analysis with a microphone array. In: Proceeding of IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, Istanbul, Turkey, pp. 1053–1056 (2000)

    Google Scholar 

  26. Nishimura, Y., Shinozaki, T., Iwano, K., Furui, S.: Noise-robust speech recognition using multi-band spectral features. In: 148th ASA Meeting, 1aSC7, ASA (2004)

    Google Scholar 

  27. Okuno, H.G., Nakadai, K., Hidai, K., Mizoguchi, H., Kitano, H.: Human-Robot Non-Verbal Interaction Empowered by Real-Time Auditory and Visual Multiple-Talker Tracking. Advanced Robotics 17(2), 115–130 (2003), VSP and RSJ, doi:10.1163/156855303321165088

    Article  Google Scholar 

  28. Parra, L.C., Alvino, C.V.: Geometric source separation: Mergin convolutive source separation with geometric beamforming. IEEE Trans. on SAP 10(6), 352–362 (2002)

    Google Scholar 

  29. Raj, H., Sterm, R.M.: Missing-feature approaches in speech recognition. IEEE Signal Processing Magazine 22(5),101–116 (2005)

    Google Scholar 

  30. Rosenthal, D., Okuno, H.G.: Computational Auditory Scene Analysis. Lawrence Erlbaum Associates, Mahwah, New Jersey (1998)

    Google Scholar 

  31. Schmidt, R.O.: Multiple Emitter Location and Signals Parameter Estimation. IEEE Transactions on Antennas and Propagation, AP-34, 276–280 (1986)

    Google Scholar 

  32. Shah, J.K., Iyer, A.N., Smolenski, B.Y., Yantormo, R.E.: Robust Voiced/Unvoiced classification using novel feature and Gaussian Mixture Model. In: Proc. of IEEE Intern’l Conf. on Acoustics, Speech, and Signal Processing, Montreal, Canada (2004)

    Google Scholar 

  33. Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.: Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach. In: IEEE Intern’l Conf. on Robotics and Automation, pp. 1033–1038 (2004)

    Google Scholar 

  34. Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.: Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp.2123–2128 (2004)

    Google Scholar 

  35. Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J., Nakadai, K., Okuno, H.G.: Robust Recognition of Simultaneous Speech by a Mobile Robot. IEEE Transactions on Robotics 23(4), 742–752 (2007), doi:10.1109/TRO.2007.900612

    Article  Google Scholar 

  36. Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J., Nakadai, K., Okuno, H.G.: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems J. 55(3), 216–228 (2007)

    Article  Google Scholar 

  37. Yamada, S., Lee, A., Saruwatari, H., Shikano, K.: Unsupervided speaker adaptation based on HMM sufficient statistics in various noisy environments. In: Proc. of Eurospeech 2003. ESCA, pp. 1493–1496 (2003)

    Google Scholar 

  38. Yamamoto, S., Valin, J.-M., Nakadai, K., Ogata, T., Okuno, H.G.: Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory. In: IEEE-RAS Intern’l Conf. on Robotics and Automation, pp. 1477–1482 (April 2005)

    Google Scholar 

  39. Yamamoto, S., et al.: Making A Robot Recognize Three Simultaneous Sentences in Real-Time. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp. 4040–4045 (2005)

    Google Scholar 

  40. Yamamoto, S., et al.: Real-time robot audition system that recognizes simultaneous speech in the real world. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5333–5338 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Okuno, H.G., Nakadai, K., Kim, HD. (2011). Robot Audition: Missing Feature Theory Approach and Active Audition. In: Pradalier, C., Siegwart, R., Hirzinger, G. (eds) Robotics Research. Springer Tracts in Advanced Robotics, vol 70. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19457-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19457-3_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19456-6

  • Online ISBN: 978-3-642-19457-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics