Abstract
Two main categories of speech recognition robustness through missing data are spectral imputation and classifier modification. In this paper, we introduce a novel technique that could combine methods from these two categories while improving the accuracy of the combined methods. Methods in these two categories are rarely employed together due to their incompatible structures. Based on our previous work, we propose a technique to solve the problem of incompatibility. The technique is based on the idea of partial restoration of the log-spectrum. We decide to whether restore or estimate a possible range for the missing component. We also propose a method to more effectively employ dynamic features. The combined techniques are a classic spectral imputation method and our previously proposed classifier modification technique, namely spectral variance learning. The experiments show that the proposed technique is able to improve the accuracies of both combined techniques significantly, leading to improvements in recognition accuracy as high as nearly four percent on Aurora 2.0 data and more than two percent on a noisy version of TIMIT data.
Similar content being viewed by others
Notes
It is possible to employ SI in spectral domain, but the performance falls drastically.
Soft mask estimation techniques give each part a number to indicate its reliability.
References
R.K. Aggarwal, M. Dave, Recent trends in speech recognition systems, in Speech, Image, and Language Processing for Human Computer Interaction: Multi-modal Advancements, ed. by T.J. Siddiqui (International Science Reference, Hershey, Tiwary, U.S., 2012), pp. 101–127
S. Ahmadi, S.M. Ahadi, B. Cranen, L. Boves, Sparse coding of the modulation spectrum for noise-robust automatic speech recognition. EURASIP J. Audio Speech Music Process. 36, 1–20 (2014)
R.F. Astudillo, D. Kolossa, P. Mandelartz, R. Orglmeister, An uncertainty propagation approach to robust ASR using the ETSI advanced front end. IEEE J. Sel. Top. Signal Process. 4, 824–833 (2010)
R.F. Astudillo, R. Orglmeister, Computing MMSE estimates and residual uncertainty directly in the feature domain of ASR using STFT domain speech distortion models. IEEE Trans. Audio Speech Lang. Process. 21, 1023–1034 (2013)
B. Badiezadegan, R.C. Rose, A wavelet-based thresholding approach to reconstructing unreliable spectrogram components. Speech Commun. 67, 129–142 (2015)
L. Barrault, C. Servan, D. Matrouf, G Linarès, R. De Mori, Frame-based acoustic feature integration for speech understanding, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las Vegas, USA (2015), pp. 4997–5000
C. Cerisara, Towards missing data recognition with cepstral features, in Proceedings of European Conference on Speech Communication and Technology—EUROSPEECH’03, Geneva, Switzerland (2003), pp. 3057–3060
M. Cooke, P. Green, L. Josifovski, A. Vizinho, Robust ASR with unreliable data and minimal assumptions, in Proceedings of International Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland (1999), pp. 195–198
M. Cooke, P. Green, L. Josifovski, A. Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34, 267–285 (2001)
J. Droppo, L. Deng, A. Acero Evaluation of the SPLICE algorithm on the Aurora2 database, in Proceedings of EUROSPEECH, Aalborg, Denmark (2001), pp. 217–220
J. Droppo, A. Acero, L. Deng, Uncertainty decoding with SPLICE for noise robust speech recognition, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, USA (2003), pp. 57–60
K. Ebrahim Kafoori, S.M. Ahadi, A novel classifier modification approach to missing data problem for noisy speech recognition, in Proceedings of International Symposium on Telecommunications (IST), Tehran, Iran (2014), pp. 458–463
K. Ebrahim Kafoori, S.M. Ahadi, Bounded cepstral marginalization of missing data for robust speech recognition. Comput. Speech Lang. 36, 1–23 (2016)
ETSI Standard, Extended advanced front-end feature extraction algorithm, ETSI ES 202 212, V1.1.1. (2003)
G. Farahani, S.M. Ahadi, M.M. Homayounpour, Features based on filtering and spectral peaks in autocorrelation domain for robust speech recognition. Comput. Speech Lang. 21, 187–205 (2007)
J.G. Fiscus, A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER), in Proceeding of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Santa Barbara, USA (1997), pp. 347–354
S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29, 254–272 (1981)
S. Furui, Toward robust speech recognition and understanding. J. VLSI Signal Process. Syst. Signal Image Video Technol. 41, 245–254 (2005)
M.J.F. Gales, Model-based techniques for noise robust speech recognition. Ph.D. Dissertation, University of Cambridge, UK (1993)
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 (Linguistic Data Consortium, Philadelphia, 1993)
J.F. Gemmeke, H. Van Hamme, B. Cranen, L. Boves, Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J. Sel. Top. Signal Process. 4, 272–287 (2010)
J.A. González, A.M. Peinado, N. Ma, A.M. Gómez, J. Barker, MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 21, 624–635 (2013)
M.M. Goodarzi, F. Almasganj, S.M. Ahadi, Reconstructing missing speech spectral components using both temporal and statistical correlations, in Proceedings of International Conference on Information Sciences, Signal Processing and their Applications, (ISSPA), Kuala Lumpur, Malaysia (2010), pp. 125–128
J. Hakkinen, H. Haverinen, On the use of missing feature theory with cepstral features, in proceedings of CRAC workshop, Aalborg, Denmark (2001)
W. Hartmann, N. Narayanan, E. Fosler-Lussier, D. Wang, A direct masking approach to robust ASR. IEEE Trans. Audio Speech Lang. Process. 21, 1993–2005 (2013)
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2, 578–589 (1994)
H.G Hirsch, D. Pearce, The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in Proceedings of International Conference on Spoken Language Processing (ICSLP), Beijing, China (2000), pp. 29–32
K. Jokinen, M. McTear, Spoken Dialogue Systems (Morgan and Claypool Publishers, San Rafael, 2010)
N. Joshi, L. Guan, Feature fusion applied to missing data ASR with the combination of recognizers. J. Signal Process. Syst. 58, 359–370 (2010)
S. Keronen, H. Kallasjoki, U. Remes, G.J. Brown, J.F. Gemmeke, K.J. Palomäki, Mask estimation and imputation methods for missing data speech recognition in a multisource reverberant environment. Comput. Speech Lang. 27, 798–819 (2013)
D. Kolossa, R. Haeb-Umbach, Robust Speech Recognition of Uncertain or Missing Data: Theory and Applications (Springer, Berlin, 2011)
L. Kim, K. Kim, M. Hasegawa-Johnson, Robust automatic speech recognition with decoder oriented ideal binary mask estimation, in Proceedings of INTERSPEECH, Makuhari, Japan (2010), pp. 2066–2069
B. Lecouteux, G. Linares, Y. Esteve, G. Gravier, Dynamic combination of automatic speech recognition systems by driven decoding. IEEE Trans. Audio Speech Lang. Process. 21, 1251–1260 (2013)
P.J. Moreno, B. Raj, R.M. Stern, A vector Taylor series approach for environment-independent speech recognition, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Atlanta, Georgia, USA (1996), pp. 733–736
A. Neustein (ed.), Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics (Springer, New York, 2010)
B. Raj, M.L. Seltzer, R.M. Stern, Reconstruction of missing features for robust speech recognition. Speech Commun. 43, 275–296 (2004)
B. Raj, R.M. Stern, Missing-feature approaches in speech recognition. IEEE Signal Process. Mag. 22, 101–116 (2005)
R. Rasipuram, M. Magimai Doss, Integrating articulatory features using Kullback-Leibler divergence based acoustic model for phoneme recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague, Czech Republic (2011), pp. 5192–5195
U. Remes, K.J. Palomäki, T. Raiko, A. Honkela, M. Kurimo, Missing-feature reconstruction with a bounded nonlinear state-space model. IEEE Signal Process. Lett. 18, 563–566 (2011)
U. Remes, A. Ramirez Lopez, K. Palomaki, M. Kurimo, Bounded conditional mean imputation with observation uncertainties and acoustic model adaptation. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1198–1208 (2015)
F. Seide, P. Zhao, On using missing-feature theory with cepstral features—approximations to the multivariate Integral, In: Proceedings of INTERSPEECH, Makuhari, Japan (2010), pp. 2094–2097
P. Smaragdis, B. Raj, M. Shashanka, Missing data imputation for time-frequency representations of audio signals. J. Signal Process. Syst. 65, 361–370 (2011)
S. Srinivasan, D. Wang, Transforming binary uncertainties for robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 15, 2130–2140 (2007)
S. Stüker, C. Fügen, S. Burger, M. Wölfel, Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end, in Proceedings of INTERSPEECH, Pittsburg, USA (2006), pp. 521-524
Y. Sun, J.F. Gemmeke, B. Cranen, L. Bosch, L. Boves, Fusion of parametric and non-parametric approaches to noise-robust ASR. Speech Commun. 56, 49–62 (2014)
D.T. Tran, E. Vincent, D. Jouvet, Noise Fusion of multiple uncertainty estimators and propagators for noise robust ASR, in Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy (2014), pp. 5512–5516
D.T. Tran, E. Vincent, D. Jouvet, Nonparametric uncertainty estimation and propagation for noise robust ASR. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 1835–1846 (2015)
F. Valente, Multi-stream speech recognition based on Dempster-Shafer combination rule. Speech Commun. 52, 213–222 (2010)
A.P. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)
T. Virtanen, R. Singh, B. Raj, Techniques for Noise Robustness in Automatic Speech Recognition (Wiley, New Jersey, 2012)
Y. Wang, J.F. Gemmeke, K. Demuynck, H. Van hamme, Missing data solutions for robust speech recognition, in Essential Speech and Language Technology for Dutch, pp. 289–304. Springer, Berlin (2013)
Z. Xiaojia, S. Yang, W. DeLiang, CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20, 1608–1616 (2012)
P. Yi, Y. Ge, A weighted approach of missing data technique in cepstral domain based on S-function, in Proceedings of IEEE International Workshop on Multimedia Signal Processing (MMSP), Saint-Malo, France (2010), pp. 19–23
S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book (Cambridge University Press, Cambridge, 2002)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ebrahim Kafoori, K., Ahadi, S.M. Robust Recognition of Noisy Speech Through Partial Imputation of Missing Data. Circuits Syst Signal Process 37, 1625–1648 (2018). https://doi.org/10.1007/s00034-017-0616-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-017-0616-4