Skip to main content
Log in

Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Distant speech capture in lecture halls and auditoriums offers unique challenges in algorithm development for automatic speech recognition. In this study, a new adaptation strategy for distant noisy speech is created by the means of phoneme classes. Unlike previous approaches which adapt the acoustic model to the features, the proposed phoneme-class based feature adaptation (PCBFA) strategy adapts the distant data features to the present acoustic model which was previously trained on close microphone speech. The essence of PCBFA is to create a transformation strategy which makes the distributions of phoneme-classes of distant noisy speech similar to those of a close talk microphone acoustic model in a multidimensional MFCC space. To achieve this task, phoneme-classes of distant noisy speech are recognized via artificial neural networks. PCBFA is the adaptation of features rather than adaptation of acoustic models. The main idea behind PCBFA is illustrated via conventional Gaussian mixture model–Hidden Markov model (GMM–HMM) although it can be extended to new structures in automatic speech recognition (ASR). The new adapted features together with the new and improved acoustic models produced by PCBFA are shown to outperform those created only by acoustic model adaptations for ASR and keyword spotting. PCBFA offers a new powerful understanding in acoustic-modeling of distant speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Arslan, L. M., & Hansen, J. H. L. (1994). A minimum cost based phoneme class detector for improved iterative speech enhancement. IEEE ICASSP-94 Proceedings, Adelaide, Australia, Vol. 2 pp. 45–48.

  • Brandstein, M. S., & Ward, D. B. (2001). Microphone arrays: Signal processing techniques and applications. Berlin: Springer.

    Book  Google Scholar 

  • Clarkson, P. R., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge Toolkit. ESCA Eurospeech Proceedings, Rhodes, Greece, Vol. 1, pp. 2707–2710.

  • CMU Sphinx - Speech Recognition Toolkit. Open source toolkit for speech recognition project by Carnegie Mellon University. http://cmusphinx.sourceforge.net/.

  • Demiroglu, C., & Anderson, D. V. (2004). Broad phoneme class recognition in noisy environments using the GEMS. ACSSC Proceedings, Vol. 2, pp. 1805–1808.

  • Dmochowski, J. P., Zicheng, L., & Chou, P. A. (2008). Blind source separation in a distributed microphone meeting environment for improved teleconferencing. ICASSP IEEE international conference on acoustics, speech and signal processing conference proceedings, pp. 89–92.

  • Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L., & Zue, V. (1993). TIMIT acoustic-phonetic continuous speech corpus. Philadelphia: Linguistic Data Consortium.

    Book  Google Scholar 

  • Gehrig, T., Nickel, K., Ekenel, H. K., Klee, U., & McDonough, J. (2005). Kalman filters for audio-video source localization. IEEE workshop of applications of signal processing to audio and acoustics proceedings, pp. 118–121.

  • Hansen, J. H. L., & Arslan, L. M. (1995). Markov model-based phoneme class partitioning for improved constrained iterative speech enhancement. IEEE Transactions on Speech and Audio Processing, 3(1), 98–104.

    Article  Google Scholar 

  • Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Khwaja, M. K., Vikash, P., Arulmozhivarman, P., & Lui, S. (2016). Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model. International Journal of Speech Technology, 19(4), 895–905.

    Article  Google Scholar 

  • Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., & Narayanan, S. (2004). Emotion recognition based on phoneme classes. ICSLP-04 Proceedings, pp. 889–892.

  • Liao, H. (2013). Speaker adaptation of context dependent deep neural networks. IEEE international conference on acoustics, speech and signal processing proceedings, pp. 7947–7951.

  • Maas, A. L., Qi, P., Xie, Z., Hannun, A. Y., Lengerich, C. T., Jurafsky, D., & Ng, A. Y. (2017). Building DNN acoustic models for large vocabulary speech recognition. Computer Speech and Language, 41, 195–213.

    Article  Google Scholar 

  • Mirsamadi, S., & Hansen, J. H. (2015). A study on deep neural network acoustic model adaptation for robust far-field speech recognition. Interspeech Proceedings, Dresden, Germany, pp. 2430–2434.

  • Mirsamadi, S., & Hansen, J. H. (2016). A generalized nonnegative tensor factorization approach for distant speech recognition with distributed microphones. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1721–1731.

    Article  Google Scholar 

  • Montanari, A. Principal component analysis, University of Bologna. http://www2.stat.unibo.it/montanari/Didattica/Multivariate/PCA1.pdf.

  • Palaz, D., Collobert, R., & Magimai, M. (2013). Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. Interspeech Proceedings, Lyon, France, pp. 1766–1770.

  • Senior Design Day Presentation Videos of University of Texas at Dallas, Erik Jonsson School of Engineering and Computer Science. http://www.youtube.com/user/EE1Events1UTD/.

  • Swietojanski, P., Li, J., & Renals, S. (2016). Learning hidden unit contributions for unsupervised acoustic model adaptation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(8), 1450–1463.

    Article  Google Scholar 

  • The Carnegie Mellon University Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

  • US English WSJ5K Language Model. https://sourceforge.net/projects /cmusphinx/files/Acoustic%20and%20Language%20Models/Archive/US%20English%20WSJ5K%20Language%20Model/.

  • Wölfel, M., & McDonough, J. W. (2009). Distant speech recognition. New York: Wiley.

    Book  Google Scholar 

  • Woodland, P. C. (2001). Speaker adaptation for continuous density HMMs: A review. ISCA Workshop on Adaptation, pp. 11–19.

  • Zhang, C., Wu, X., Zheng, T. F., Wang, L., & Yin, C. (2012). A K-phoneme-class based multi-model method for short utterance speaker recognition. Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC) Proceedings, pp. 1–4.

Download references

Acknowledgements

This project was funded by AFRL under contract FA8750-12-1-0188, and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J.H.L. Hansen.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seçkin Uluskan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Uluskan, S., Sangwan, A. & Hansen, J.H.L. Phoneme class based feature adaptation for mismatch acoustic modeling and recognition of distant noisy speech. Int J Speech Technol 20, 799–811 (2017). https://doi.org/10.1007/s10772-017-9449-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9449-6

Keywords

Navigation