Abstract
The inclusion of two or more microphones in smartphones is becoming quite common. These were originally intended to perform noise reduction and few benefit is still being taken from this feature for noise-robust automatic speech recognition (ASR). In this paper we propose a novel system to estimate missing-data masks for robust ASR on dual-microphone smartphones. This novel system is based on deep neural networks (DNNs), which have proven to be a powerful tool in the field of ASR in different ways. To assess the performance of the proposed technique, spectral reconstruction experiments are carried out on a dual-channel database derived from Aurora-2. Our results demonstrate that the DNN is better able to exploit the dual-channel information and yields an improvement on word accuracy of more than 6% over state-of-the-art single-channel mask estimation techniques.
This work has been supported by the MICINN TEC2013-46690-P project.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
López-Espejo, I., et al.: Feature Enhancement for Robust Speech Recognition on Smartphones with Dual-Microphone. In: EUSIPCO, Lisbon (2014)
Zhang, J., et al.: A Fast Two-Microphone Noise Reduction Algorithm Based on Power Level Ratio for Mobile Phone. In: ISCSLP, Hong-Kong, pp. 206–209 (2012)
Hinton, G., et al.: Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine 29(6) (2012)
Seltzer, M.L., Yu, D., Wang, Y.: An Investigation of Deep Neural Networks for Noise Robust Speech Recognition. In: ICASSP, Vancouver, pp. 7398–7402 (2013)
Wang, Y., Wang, D.L.: Towards Scaling Up Classification-Based Speech Separation. IEEE Trans. on Audio, Speech, and Language Processing 21(7) (2013)
Narayanan, A., Wang, D.L.: Ideal Ratio Mask Estimation Using Deep Neural Networks for Robust Speech Recognition. In: ICASSP, Vancouver (2013)
Raj, B., Seltzer, M.L., Stern, R.M.: Reconstruction of Missing Features for Robust Speech Recognition. Speech Comm. 48(4), 275–296 (2004)
González, J.A., Peinado, A.M., Ma, N., Gomez, A.M., Barker, J.: MMSE-Based Missing-Feature Reconstruction with Temporal Modeling for Robust Speech Recognition. IEEE Trans. on Audio, Speech and Language Proc. 21(3) (2013)
Cooke, M., et al.: Robust Automatic Speech Recognition with Missing Data and Unreliable Acoustic Data. Speech Communication 34, 267–285 (2001)
Pearce, D., Hirsch, H.G.: The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems Under Noisy Conditions. In: ICSLP, Beijing (2000)
Roweis, S.T.: Factorial Models and Refiltering for Speech Separation and Denoising. In: EUROSPEECH, Geneva, pp. 1009–1012 (2003)
Hinton, G., Salakhutdinov, R.: Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) (2006)
Hinton, G.: Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14, 1771–1800 (2002)
ETSI ES 201 108 - Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms
Ephraim, Y., Malah, D.: Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator. IEEE Trans. on Acoustics, Speech, and Signal Processing ASSP-32(6), 1109–1121 (1984)
Hinton, G.: A Practical Guide to Training Restricted Boltzmann Machines. UTML TR 2010-003 (2010)
Tanaka, M.: Deep Neural Network Toolbox for MatLab (2013)
ETSI ES 202 050 - Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms
Deng, L., et al.: Large-Vocabulary Speech Recognition Under Adverse Acoustic Environments. In: ICSLP, Beijing, pp. 806–809 (2000)
González, J.A., et al.: Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition. IEEE Trans. on Audio, Speech, and Language Proc. 19(5), 1206–1220 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
López-Espejo, I., González, J.A., Gómez, Á.M., Peinado, A.M. (2014). A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition. In: Navarro Mesa, J.L., et al. Advances in Speech and Language Technologies for Iberian Languages. Lecture Notes in Computer Science(), vol 8854. Springer, Cham. https://doi.org/10.1007/978-3-319-13623-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-13623-3_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13622-6
Online ISBN: 978-3-319-13623-3
eBook Packages: Computer ScienceComputer Science (R0)