Abstract
Distant speech capture in lecture halls and auditoriums offers unique challenges in algorithm development for automatic speech recognition. In this study, a new adaptation strategy for distant noisy speech is created by the means of phoneme classes. Unlike previous approaches which adapt the acoustic model to the features, the proposed phoneme-class based feature adaptation (PCBFA) strategy adapts the distant data features to the present acoustic model which was previously trained on close microphone speech. The essence of PCBFA is to create a transformation strategy which makes the distributions of phoneme-classes of distant noisy speech similar to those of a close talk microphone acoustic model in a multidimensional MFCC space. To achieve this task, phoneme-classes of distant noisy speech are recognized via artificial neural networks. PCBFA is the adaptation of features rather than adaptation of acoustic models. The main idea behind PCBFA is illustrated via conventional Gaussian mixture model–Hidden Markov model (GMM–HMM) although it can be extended to new structures in automatic speech recognition (ASR). The new adapted features together with the new and improved acoustic models produced by PCBFA are shown to outperform those created only by acoustic model adaptations for ASR and keyword spotting. PCBFA offers a new powerful understanding in acoustic-modeling of distant speech.












Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Arslan, L. M., & Hansen, J. H. L. (1994). A minimum cost based phoneme class detector for improved iterative speech enhancement. IEEE ICASSP-94 Proceedings, Adelaide, Australia, Vol. 2 pp. 45–48.
Brandstein, M. S., & Ward, D. B. (2001). Microphone arrays: Signal processing techniques and applications. Berlin: Springer.
Clarkson, P. R., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge Toolkit. ESCA Eurospeech Proceedings, Rhodes, Greece, Vol. 1, pp. 2707–2710.
CMU Sphinx - Speech Recognition Toolkit. Open source toolkit for speech recognition project by Carnegie Mellon University. http://cmusphinx.sourceforge.net/.
Demiroglu, C., & Anderson, D. V. (2004). Broad phoneme class recognition in noisy environments using the GEMS. ACSSC Proceedings, Vol. 2, pp. 1805–1808.
Dmochowski, J. P., Zicheng, L., & Chou, P. A. (2008). Blind source separation in a distributed microphone meeting environment for improved teleconferencing. ICASSP IEEE international conference on acoustics, speech and signal processing conference proceedings, pp. 89–92.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L., & Zue, V. (1993). TIMIT acoustic-phonetic continuous speech corpus. Philadelphia: Linguistic Data Consortium.
Gehrig, T., Nickel, K., Ekenel, H. K., Klee, U., & McDonough, J. (2005). Kalman filters for audio-video source localization. IEEE workshop of applications of signal processing to audio and acoustics proceedings, pp. 118–121.
Hansen, J. H. L., & Arslan, L. M. (1995). Markov model-based phoneme class partitioning for improved constrained iterative speech enhancement. IEEE Transactions on Speech and Audio Processing, 3(1), 98–104.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Khwaja, M. K., Vikash, P., Arulmozhivarman, P., & Lui, S. (2016). Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model. International Journal of Speech Technology, 19(4), 895–905.
Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., & Narayanan, S. (2004). Emotion recognition based on phoneme classes. ICSLP-04 Proceedings, pp. 889–892.
Liao, H. (2013). Speaker adaptation of context dependent deep neural networks. IEEE international conference on acoustics, speech and signal processing proceedings, pp. 7947–7951.
Maas, A. L., Qi, P., Xie, Z., Hannun, A. Y., Lengerich, C. T., Jurafsky, D., & Ng, A. Y. (2017). Building DNN acoustic models for large vocabulary speech recognition. Computer Speech and Language, 41, 195–213.
Mirsamadi, S., & Hansen, J. H. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. Interspeech Proceedings, Dresden, Germany, pp. 2430–2434.
Mirsamadi, S., & Hansen, J. H. (2016). A generalized nonnegative tensor factorization approach for distant speech recognition with distributed microphones. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1721–1731.
Montanari, A. Principal component analysis, University of Bologna. http://www2.stat.unibo.it/montanari/Didattica/Multivariate/PCA1.pdf.
Palaz, D., Collobert, R., & Magimai, M. (2013). Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. Interspeech Proceedings, Lyon, France, pp. 1766–1770.
Senior Design Day Presentation Videos of University of Texas at Dallas, Erik Jonsson School of Engineering and Computer Science. http://www.youtube.com/user/EE1Events1UTD/.
Swietojanski, P., Li, J., & Renals, S. (2016). Learning hidden unit contributions for unsupervised acoustic model adaptation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(8), 1450–1463.
The Carnegie Mellon University Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
US English WSJ5K Language Model. https://sourceforge.net/projects /cmusphinx/files/Acoustic%20and%20Language%20Models/Archive/US%20English%20WSJ5K%20Language%20Model/.
Wölfel, M., & McDonough, J. W. (2009). Distant speech recognition. New York: Wiley.
Woodland, P. C. (2001). Speaker adaptation for continuous density HMMs: A review. ISCA Workshop on Adaptation, pp. 11–19.
Zhang, C., Wu, X., Zheng, T. F., Wang, L., & Yin, C. (2012). A K-phoneme-class based multi-model method for short utterance speaker recognition. Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC) Proceedings, pp. 1–4.
Acknowledgements
This project was funded by AFRL under contract FA8750-12-1-0188, and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Uluskan, S., Sangwan, A. & Hansen, J.H.L. Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech. Int J Speech Technol 20, 799–811 (2017). https://doi.org/10.1007/s10772-017-9449-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-017-9449-6