Skip to main content
Log in

Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Combination of multiple acoustic features has great potential to improve Automatic Speech Recognition (ASR) accuracy. Our contribution in this research was to investigate one novel method when using voiced formants’ features in combination with standard MFCC features in order to enhance TIMIT phone recognition. These voiced features provide accurate formants frequencies using a Variable Order LPC Coding (VO-LPC) algorithm that was combined with continuity constraints. The overall estimating formants were concatenated with MFCC features when a voiced frame could be detected. For feature-level combination, Heteroscedastic Linear Discriminant Analysis (HLDA) based approach had been used successfully to find an optimal linear combination of successive vectors of a single feature stream.

A series of experiments on phone recognition speaker-independent continuous-speech had been carried out using a subset of the large read-speech TIMIT phone corpus. Hidden Markov Model Toolkit (HTK) was also used throughout all carried experiments. Using such feature level combination, optimized mixture splitting and a bigram language model, a detailed analysis on this combination performance was discussed for Context-Independent (CI) and Context-Dependent (CD) Hidden Markov Models (HMM). Experimental results from our proposed procedure showed that phone error rate was successfully decreased by about 3%. At phonetic level group, an increase of 8% and of 10% was observed respectively for vowel and liquid group. These results proved clear phone enhancement regarding existing state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Atal, B. S., & Hanauer, S. L. (1971). Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50(2), 637–655.

    Article  Google Scholar 

  • Atal, B. S., & Rabiner, L. R. (1976). A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 201–212.

    Article  Google Scholar 

  • Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., & Wellekens, C. (2007). Automatic speech recognition and speech variability: a review. Speech Communication, 49, 763–786.

    Article  Google Scholar 

  • Ben Messaoud, Z., Gargouri, D., Zribi, S., & Ben Hamida, A. (2009). Formant tracking linear prediction model using HMMs for noisy speech processing. International Journal of Signal Processing, WASET, 291–296.

  • Ben Messaoud, Z., & Ben Hamida, A. (2010). CDHMM parameters selection for speaker-independent phone recognition in continuous speech system. In 15th IEEE mediterranean electrotechnical conference, Valletta, Malta (pp. 253–258).

    Chapter  Google Scholar 

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28, 357–366.

    Article  Google Scholar 

  • De Mori, R., Moisa, L., Gemello, R., Mana, F., & Albensano, D. (2001). Augmenting standard speech recognition features with energy gravity centres. Computer Speech and Language, 15, 341–354.

    Article  Google Scholar 

  • Dunn, H. K. (1961). Methods of measuring vowel formant bandwidths. The Journal of the Acoustical Society of America, 33(12), 1737–1746.

    Article  Google Scholar 

  • Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (rover). In IEEE automatic speech recognition and understanding workshop, Santa Barbara, CA (pp. 347–354).

    Google Scholar 

  • Gales, M. J. F. (2002). Maximum likelihood multiple subspace projections for hidden Markov models. IEEE Transactions on Speech and Audio Processing, 10(2), 37–47.

    Article  Google Scholar 

  • Gargouri, D., Zerzri, M. A., & Ben Hamida, A. (2008). Formants estimation algorithm in noisy environment. GESTS International Transaction on Computer Science and Engineering, 45(1), 221–241.

    Google Scholar 

  • Holmes, J. N., Holmes, W. J., & Garner, P. N. (1997). Using formant frequencies in speech recognition. In European conf. on speech communication and technology, Rhodes, Greece (Vol. 4, pp. 2083–2086).

    Google Scholar 

  • Holmes, W. J., & Garner, P. N. (1998). On robust incorporation of format features into hidden Markov models for automatic speech recognition. In IEEE international conference on acoustics, speech, and signal processing, Istanbul, Turkey (pp. 1–4).

    Google Scholar 

  • Hunt, M. J. (1987). Delayed decisions in speech recognition—the case of formants. Pattern Recognition Letters, 6, 121–137.

    Article  Google Scholar 

  • Kumar, N., & Andreou, A. G. (1996). A generalization of linear discriminant analysis in maximum likelihood framework (Tech. Rep. JHU-CLSP Technical Report No. 16). Johns Hopkins University.

  • Laprie, Y., & Berger, M. O. (1992). Active models for regularizing formant trajectories. In Proc. ICSLP, Banff (pp. 815–818).

    Google Scholar 

  • Lee, K. F., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 1641–1648.

    Article  Google Scholar 

  • Leung, K. Y., & Siu, M. (2004). Integration of acoustic and articulatory information with application to speech recognition. Information Fusion, 5(2), 141–151.

    Article  Google Scholar 

  • Lopes, C., & Perdigão, F. (2009). Phonetic recognition improvements through input feature set combination and acoustic context window widening. In 7th conference on telecommunications, Conftele, Sta Maria da Feira, Portugal (Vol. 1, pp. 449–452).

    Google Scholar 

  • McCandless, S. (1974). An algorithm for automatic formant extraction using linear prediction spectra. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-22, 135–141.

    Article  Google Scholar 

  • Reynolds, T., & Antoniou, C. (2003). Experiments in speech recognition using a modular MLP architecture for acoustic modeling. Information Sciences, 156, 39–54.

    Article  Google Scholar 

  • Stuttle, M. N., & Gales, M. J. F. (2002). Combining a Gaussian mixture model front end with MFCC parameters. In International conference on spoken language processing, Denver, CO (Vol. 3, pp. 1565–1568).

    Google Scholar 

  • Sorin, & Ramabadran, T. (2003). Extended advanced front end (XAFE) algorithm description. Version 1.1 (Tech. Rep. ES 202 212). ETSI STQAurora DSR Working Group.

  • Schmid, P., & Barnard, E. (1995). Robust N-Best Formant Tracking. In Proc. EUROSPEECH’95, Madrid (pp. 737–740).

    Google Scholar 

  • Shire, M. L. (2001). Multi-stream ASR trained with heterogeneous reverberant environments. In IEEE international conference on acoustics, speech, and signal processing, Salt Lake City, UT (Vol. 1, pp. 253–256).

    Google Scholar 

  • Saon, G., Padmanabhan, M., Gopinath, R., & Chen, S. (2000). Maximum likelihood discriminant feature spaces. In IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. II1-129–II1-132).

    Google Scholar 

  • Thomson, D. L., & Chengalvarayan, R. (1998). Use of periodicity and jitter as speech recognition feature. In Proc. IEEE int. conf. on acoustics, speech, and signal processing, Seattle, WA (Vol. 1, pp. 21–24).

    Google Scholar 

  • Welling, L., & Ney, H. (1996). A model for efficient formant estimation. In IEEE international conference on acoustics, speech, and signal processing, Atlanta, GA (Vol. 2, pp. 797–801).

    Google Scholar 

  • Wilkinson, N. J., & Russell, M. J. (2002). Improved phone recognition on TIMIT using formant frequency data and confidence measures. In 7th international conference on spoken language processing, Denver, CO.

    Google Scholar 

  • Weber, K., Bourlard, H., & Bengio, S. (2001). Hmm2-extraction of formant features and their use for robust ASR. In European conference on speech communication and technology, Aalborg, Denmark (p. 607–610).

    Google Scholar 

  • Xia, K., & Wilson, C. E. (2000). A new strategy of formant tracking based on dynamic programming. In International conference on spoken language processing, Beijing, China.

    Google Scholar 

  • Yan, Z. J., Zhu, B., Hu, Y., & Wang, R. H. (2008). Minimum word classification error training of HMM for automatic speech. In IEEE international conference on acoustics, speech, and signal processing, Las Vegas, USA (pp. 4521–4524).

    Google Scholar 

  • Young, S. (1992). The general use of tying in phoneme-based hmm speech recognisers. In IEEE international conference on acoustics, speech and signal processing, San Francisco, California, USA (Vol. 1 pp. 569–572).

    Google Scholar 

  • Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2006). The HTK book (revised for HTK version 3.4). Cambridge University Engineering Department-Speech Group. Available: http://www.htk.eng.cam.ac.uk.

  • Zolnay, A., Schlüter, R., & Ney, H. (2003). Extraction methods of voicing feature for robust speech recognition. In European conference on speech communication and technology, Geneva, Switzerland (Vol. 1, pp. 497–500).

    Google Scholar 

  • Zolnay, A., Kocharov, D., Schlutera, R., & Hermann, N. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zaineb Ben Messaoud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben Messaoud, Z., Ben Hamida, A. Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition. Int J Speech Technol 14, 393–403 (2011). https://doi.org/10.1007/s10772-011-9119-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-011-9119-z

Keywords

Navigation