Abstract
We describe an architecture that gives a robot the capability to recognize speech by cancelling ego noise, even while the robot is moving. The system consists of three blocks: (1) a multi-channel noise reduction block, comprising consequent stages of microphone-array-based sound localization, geometric source separation and post-filtering; (2) a single-channel noise reduction block utilizing template subtraction; and (3) an automatic speech recognition block. In this work, we specifically investigate a missing feature theory-based automatic speech recognition (MFT-ASR) approach in block (3). This approach makes use of spectro-temporal elements derived from (1) and (2) to measure the reliability of the acoustic features, and generates masks to filter unreliable acoustic features. We then evaluated this system on a robot using word correct rates. Furthermore, we present a detailed analysis of recognition accuracy to determine optimal parameters. Implementation of the proposed MFT-ASR approach resulted in significantly higher recognition performance than single or multi-channel noise reduction methods.
Similar content being viewed by others
Abbreviations
- ANN:
-
Artificial Neural Network
- ASR:
-
Automatic Speech Recognition
- BSS:
-
Blind Source Separation
- DoA:
-
Direction of Arrival
- GSS:
-
Geometric Source Separation
- HMM:
-
Hidden Markov Model
- MCRA:
-
Minima Controlled Recursive Averaging
- MFCC:
-
Mel-Frequency Cepstral Coefficients
- MFM:
-
Missing Feature Mask
- MFT:
-
Missing Feature Theory
- MMSE:
-
Minimum Mean Square Estimation
- MSLS:
-
Mel-Scale Log Spectrum
- MUSIC:
-
MUltiple SIgnal Classification
- NN:
-
Nearest Neighbour
- PF:
-
Post-Filtering
- SE:
-
Speech Enhancement
- SS:
-
Spectral Subtraction
- SSL:
-
Sound Source Localization
- SSS:
-
Sound Source Separation
- TS:
-
Template Subtraction
- WCR:
-
Word Correct Rate
- WF:
-
Wiener Filtering
References
Sato M, Sugiyama A, Ohnaka S (2004) An adaptive noise canceller with low signal-distortion based on variable stepsize subfilters for human-robot communication. IEICE Trans Fundam Electron Commun Comput Sci E88-A(8):2055–2061
Brandstein M, Ward D (2001) Microphone arrays: signal processing techniques and applications. Springer, Berlin
Benesty J, Sondhi MM, Huang Y (2008) Springer handbook of speech processing. Springer, Berlin
Ince G, Nakadai K, Rodemann T, Hasegawa Y, Tsujino H, Imura J (2010) A hybrid framework for ego noise cancellation of a robot. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 3623–3628
Boll S (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 27(2):113–120
Cohen I (2002) Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process Lett 9(1):12–15
Deller J (2000) Discrete-time processing of speech signals. IEEE Press, New York
Martin R (1994) Spectral subtraction based on minimum statistics. In: Proceedings European signal processing, pp 1182–1185
Cohen I, Berdugo B (2001) Speech enhancement for non-stationary noise environments. Signal Process 81:2403–2481
Nakajima H, Ince G, Nakadai K, Hasegawa Y (2010) An easily-configurable robot audition system using histogram-based recursive level estimation. In: Proceedings of the IEEE/RSJ international conference on robots and intelligent systems (IROS), pp 958–963
Nakadai K, Okuno HG, Kitano H (2000) Humanoid active audition system improved by the cover acoustics. In: PRICAI 2000 topics in artificial intelligence (sixth pacific rim international conference on artificial intelligence). Springer lecture notes in artificial intelligence, vol. 1886. Springer, Berlin, pp 544–554
Ito A, Kanayama T, Suzuki M, Makino S (2005) Internal noise suppression for speech recognition by small robots. In: Proceedings of the interspeech, pp 2685–2688
Ince G, Nakadai K, Rodemann T, Hasegawa Y, Tsujino H, Imura J (2009) Ego noise suppression of a robot using template subtraction. In: Proceedings of the IEEE/RSJ international conference on robots and intelligent systems (IROS), pp 199–204
Yamamoto S, Nakadai K, Nakano M, Tsujino H, Valin J-M, Komatani K, Ogata T, Okuno HG (2006) Real-time robot audition system that recognizes simultaneous speech in the real world. In: Proceedings of the IEEE/RSJ international conference on robots and intelligent systems (IROS), pp 5333–5338
Valin J-M, Rouat J, Michaud F (2004) Enhanced robot audition based on microphone array source separation with post-filter. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 2123–2128
Even J, Sawada H, Saruwatari H, Shikano K, Takatani T (2009) Semi-blind suppression of internal noise for hands-free robot spoken dialog system. In: Proceedings of the IEEE/RSJ international conference on robots and intelligent systems (IROS), pp 659–663
Mizumachi M, Nakamura S (2004) Passive subtractive beamformer for near-field sound sources. In: Proceedings of the IEEE sensor array and multichannel signal processing workshop, pp 74–78
Zheng YR, Goubran RA, El-Tanany M (2003) A nested sensor array focusing on near field targets. In: Proceedings of the IEEE sensors, vol 2, pp 843–848
Raj B, Stern RM (2005) Missing-feature approaches in speech recognition. IEEE Signal Process Mag 22:101–116
Takahashi T, Yamamoto S, Nakadai K, Komatani K, Ogata T, Okuno HG (2008) Soft missing-feature mask generation for simultaneous speech recognition system in robots. In: Proceedings of the interspeech, pp 992–997
Nishimura Y, Ishizuka M, Nakadai K, Nakano M, Tsujino H (2006) Speech recognition for a robot under its motor noises by selective application of missing feature theory and MLLR. In: Proceedings of the IEEE-RAS international conference on humanoid robots, pp 26–33
Parra LC, Alvino CV (2002) Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans Speech Audio Process 10(6):352–362
Schmidt R (1986) Multiple emitter location and signal parameter estimation. IEEE Trans Antennas Propag 34(3):276–280
Nakajima H, Nakadai K, Hasegawa Y, Tsujino H (2008) Adaptive step-size parameter control for real-world blind source separation. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 149–152
Nakadai K, Nakajima H, Hasegawa Y, Tsujino H (2009) Sound source separation of moving speakers for robot audition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3685–3688
Ephraim Y, Malah D (1984) Speech enhancement using minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 32(6):1109–1121
Cohen I, Berdugo B (2002) Microphone array post-filtering for non-stationary noise suppression. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 901–904
Nishimura Y, Shinozaki T, Iwano K, Furui S (2004) Noise-robust speech recognition using multi-band spectral features. In: Proceedings of the 148th acoustical society of America meetings 1aSC7
Nakadai K, Takahashi T, Okuno H, Nakajima H, Hasegawa Y, Tsujino H (2010) Design and implementation of robot audition system “HARK”—open source software for listening to three simultaneous speakers. Adv Robot 24:739–761
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ince, G., Nakadai, K., Rodemann, T. et al. Ego noise cancellation of a robot using missing feature masks. Appl Intell 34, 360–371 (2011). https://doi.org/10.1007/s10489-011-0285-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-011-0285-0