Abstract
Robot capability of listening to several things at once by its own ears, that is,robot audition, is important in improving interaction and symbiosis between humans and robots. The critical issue in robot audition is real-time processing and robustness against noisy environments with high flexibility to support various kinds of robots and hardware configurations. This paper presents two important aspects of robot audition; Missing-Feature-Theory (MFT) approach and active audition. HARK open-source robot audition incorporates MFT approach to recognize speech signals that are localized and separated from a mixture of sound captured by 8- channel microphone array. HARK is ported to four robots, Honda ASIMO, SIG2, Robovie-R2 and HRP-2, with different microphone configurations and recognizes three simultaneous utterances with 1.9 sec latency. In binaural hearing, the most famous problem is a front-back confusion of sound sources. Active binaural robot audition implemented on SIG2 disambiguates the problem well by rotating its head with pitting. This active audition improves the localization for the periphery.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aloimonos, Y., Weiss, I., Bandyopadhyay, A.: Active vision. Intern’l J. of Computer Vision 1(4), 333–356 (1999)
Asano, F., Asoh, H., Matsui, T.: Sound source localization and signal separation for office robot “Jijo-2”. In: Proc. of IEEE Intern’l Conf. on Multisensor Fusion and Integration for Intelligent Systems, pp. 243–248 (1999)
Bahoura, M., Pelletier, C.: Respiratory Sound Classification using Cepstral Analysis and Gaussian Mixture Models. In: IEEE/EMBS Intern’l Conf., San Francisco, USA (2004)
Berglund, E.J.: Active Audition for Robots using Parameter-Less Self-Organising Maps. Ph. D Thesis, The University of Queensland, Australia (2005)
Barker, J., Cooke, M., Green, P.: Robust ASR Based on Clean Speech Models: An Evaluation of Missing Data Techniques for Connected Digit Recognition in Noise. In: 7th European Conference on Speech Communication Technology, pp. 213–216 (2001)
Blauert, J.: Spatial Hearing – The Psychophysics of Human Sound Localization. The MIT Press, Cambridge (1996) (revised edition)
Breazeal, C.: Emotive Qualities in Robot Speech. In: Proceeding of IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, Hawaii, pp. 1389–1394 (2001)
Bregman, A.S.: Auditory Scene Analysis. The MIT Press, Cambridge (1990)
Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust Automatic Speech Recognition with Missing and Unreliable Acoustic Data. Speech Communication, vol. 34, pp. 267–285. Elsevier, Amsterdam (2001)
Ephraim, Y., Malah, D.: Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. on ASSP 32(6), 1109–1121 (1984)
Hara, I., Asano, F., Kawai, Y., Kanehiro, F., Yamamoto, K.: Robust speech interface based on audio and video information fusion for humanoid HRP-2. In: Proceeding of IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, Sendai, Japan, pp. 2404–2410 (2004)
Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Real-Time Auditory and Visual Talker Tracking through integrating EM algorithm and Particle Filter. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS (LNAI), vol. 4570, pp. 280–290. Springer, Heidelberg (2007)
Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Human Tracking System Integrating Sound and Face Localization using EM Algorithm in Real Environments. Advanced Robotics, 23(6):629–653 (2009) doi: 10.1163/156855309X431659
Kim, H.D., Komatani, K., Ogata, T., Okuno, H.G.: Binaural Active Audition for Humanoid Robots to Localize Speech over Entire Azimuth Range. Applied Bionic and Biomechanics (2009) (to appear)
Lu, L., Zhang, H.G., Jiang, H.: Content Analysis for Audio Classification and Segmentation. IEEE Trans. on Speech and Audio Processing 10(7), 504–516 (2002)
Michaud, F., et al.: Robust Recognition of Simultaneous Speech by a Mobile Robot. IEEE Trans. on Robotics 23(4), 742–752 (2007)
Moon, T.K.: The Expectation-Maximization algorithm. IEEE Signal Processing Magazine 13(6), 47–60 (1996)
Nakadai, K., et al.: Active audition for humanoid. In: Proc. of 17th National Conference on Artificial Intelligence, pp. 832–839. AAAI, Menlo Park (2000)
Nakadai, K., Hidai, K., Okuno, H.G., Kitano, H.: Real-Time Speaker Localization and Speech Separation by Audio-Visual Integration. In: Proc. of IEEE-RAS Intern’l Conf. on Robotics and Automation, May 2002, pp. 1043–1049 (2002), doi:10.1109/ROBOT.2002.1013493
Nakadai, K., Matasuura, D., Okuno, H.G., Tsujino, H.: Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots. Speech Communication 44(4), 97–112 (2004), doi:10.1016/j.specom.2004.10.010
Nakadai, K., Okuno, H.G.: An Open Source Software System for Robot Audition HARK and Its Evaluation. In: IEEE/RAS Intern’l Conf. on Humanoid Robots, pp. 561–566 (2008), doi:10.1109/ICHR.2008.4756031
Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: Adaptive Step-Size Parameter Control for Real-World Blind Source Separation. In: IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, pp. 149–152 (2008)
Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: High performance sound source separation adaptable to environmental changes for robot audition. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp.2165–2171 (2008)
Nakajima, H., Nakadai, K., Hasegawa, Y., Tsujino, H.: Sound source separation of moving speakers for robot audition. In: IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, pp. 3685–3688 (2009)
Nishiura, T., Yamada, T., Nakamura, S., Shikano, K.: Localization of multiple sound sources based on a CSP analysis with a microphone array. In: Proceeding of IEEE Intern’l Conf. on Acoustics, Speech and Signal Processing, Istanbul, Turkey, pp. 1053–1056 (2000)
Nishimura, Y., Shinozaki, T., Iwano, K., Furui, S.: Noise-robust speech recognition using multi-band spectral features. In: 148th ASA Meeting, 1aSC7, ASA (2004)
Okuno, H.G., Nakadai, K., Hidai, K., Mizoguchi, H., Kitano, H.: Human-Robot Non-Verbal Interaction Empowered by Real-Time Auditory and Visual Multiple-Talker Tracking. Advanced Robotics 17(2), 115–130 (2003), VSP and RSJ, doi:10.1163/156855303321165088
Parra, L.C., Alvino, C.V.: Geometric source separation: Mergin convolutive source separation with geometric beamforming. IEEE Trans. on SAP 10(6), 352–362 (2002)
Raj, H., Sterm, R.M.: Missing-feature approaches in speech recognition. IEEE Signal Processing Magazine 22(5),101–116 (2005)
Rosenthal, D., Okuno, H.G.: Computational Auditory Scene Analysis. Lawrence Erlbaum Associates, Mahwah, New Jersey (1998)
Schmidt, R.O.: Multiple Emitter Location and Signals Parameter Estimation. IEEE Transactions on Antennas and Propagation, AP-34, 276–280 (1986)
Shah, J.K., Iyer, A.N., Smolenski, B.Y., Yantormo, R.E.: Robust Voiced/Unvoiced classification using novel feature and Gaussian Mixture Model. In: Proc. of IEEE Intern’l Conf. on Acoustics, Speech, and Signal Processing, Montreal, Canada (2004)
Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.: Localization of simultaneous moving sound sources for mobile robot using a frequency-domain steered beamformer approach. In: IEEE Intern’l Conf. on Robotics and Automation, pp. 1033–1038 (2004)
Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J.: Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp.2123–2128 (2004)
Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J., Nakadai, K., Okuno, H.G.: Robust Recognition of Simultaneous Speech by a Mobile Robot. IEEE Transactions on Robotics 23(4), 742–752 (2007), doi:10.1109/TRO.2007.900612
Valin, J.-M., Michaud, F., Hadjou, B., Rouat, J., Nakadai, K., Okuno, H.G.: Robust localization and tracking of simultaneous moving sound sources using beamforming and particle filtering. Robotics and Autonomous Systems J. 55(3), 216–228 (2007)
Yamada, S., Lee, A., Saruwatari, H., Shikano, K.: Unsupervided speaker adaptation based on HMM sufficient statistics in various noisy environments. In: Proc. of Eurospeech 2003. ESCA, pp. 1493–1496 (2003)
Yamamoto, S., Valin, J.-M., Nakadai, K., Ogata, T., Okuno, H.G.: Enhanced Robot Speech Recognition Based on Microphone Array Source Separation and Missing Feature Theory. In: IEEE-RAS Intern’l Conf. on Robotics and Automation, pp. 1477–1482 (April 2005)
Yamamoto, S., et al.: Making A Robot Recognize Three Simultaneous Sentences in Real-Time. In: IEEE/RSJ Intern’l Conf. on Intelligent Robots and Systems, pp. 4040–4045 (2005)
Yamamoto, S., et al.: Real-time robot audition system that recognizes simultaneous speech in the real world. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5333–5338 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Okuno, H.G., Nakadai, K., Kim, HD. (2011). Robot Audition: Missing Feature Theory Approach and Active Audition. In: Pradalier, C., Siegwart, R., Hirzinger, G. (eds) Robotics Research. Springer Tracts in Advanced Robotics, vol 70. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19457-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-19457-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19456-6
Online ISBN: 978-3-642-19457-3
eBook Packages: EngineeringEngineering (R0)