Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition

Ben Messaoud, Zaineb; Ben Hamida, Ahmed

doi:10.1007/s10772-011-9119-z

Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition

Published: 12 October 2011

Volume 14, pages 393–403, (2011)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Zaineb Ben Messaoud¹ &
Ahmed Ben Hamida¹

212 Accesses
12 Citations
Explore all metrics

Abstract

Combination of multiple acoustic features has great potential to improve Automatic Speech Recognition (ASR) accuracy. Our contribution in this research was to investigate one novel method when using voiced formants’ features in combination with standard MFCC features in order to enhance TIMIT phone recognition. These voiced features provide accurate formants frequencies using a Variable Order LPC Coding (VO-LPC) algorithm that was combined with continuity constraints. The overall estimating formants were concatenated with MFCC features when a voiced frame could be detected. For feature-level combination, Heteroscedastic Linear Discriminant Analysis (HLDA) based approach had been used successfully to find an optimal linear combination of successive vectors of a single feature stream.

A series of experiments on phone recognition speaker-independent continuous-speech had been carried out using a subset of the large read-speech TIMIT phone corpus. Hidden Markov Model Toolkit (HTK) was also used throughout all carried experiments. Using such feature level combination, optimized mixture splitting and a bigram language model, a detailed analysis on this combination performance was discussed for Context-Independent (CI) and Context-Dependent (CD) Hidden Markov Models (HMM). Experimental results from our proposed procedure showed that phone error rate was successfully decreased by about 3%. At phonetic level group, an increase of 8% and of 10% was observed respectively for vowel and liquid group. These results proved clear phone enhancement regarding existing state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prosodic Features and Formant Contribution for Speech Recognition System over Mobile Network

Source and system features for phone recognition

Article 09 December 2014

K. E. Manjunath & K. Sreenivasa Rao

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

References

Atal, B. S., & Hanauer, S. L. (1971). Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50(2), 637–655.
Article Google Scholar
Atal, B. S., & Rabiner, L. R. (1976). A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(3), 201–212.
Article Google Scholar
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., & Wellekens, C. (2007). Automatic speech recognition and speech variability: a review. Speech Communication, 49, 763–786.
Article Google Scholar
Ben Messaoud, Z., Gargouri, D., Zribi, S., & Ben Hamida, A. (2009). Formant tracking linear prediction model using HMMs for noisy speech processing. International Journal of Signal Processing, WASET, 291–296.
Ben Messaoud, Z., & Ben Hamida, A. (2010). CDHMM parameters selection for speaker-independent phone recognition in continuous speech system. In 15th IEEE mediterranean electrotechnical conference, Valletta, Malta (pp. 253–258).
Chapter Google Scholar
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28, 357–366.
Article Google Scholar
De Mori, R., Moisa, L., Gemello, R., Mana, F., & Albensano, D. (2001). Augmenting standard speech recognition features with energy gravity centres. Computer Speech and Language, 15, 341–354.
Article Google Scholar
Dunn, H. K. (1961). Methods of measuring vowel formant bandwidths. The Journal of the Acoustical Society of America, 33(12), 1737–1746.
Article Google Scholar
Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: recognizer output voting error reduction (rover). In IEEE automatic speech recognition and understanding workshop, Santa Barbara, CA (pp. 347–354).
Google Scholar
Gales, M. J. F. (2002). Maximum likelihood multiple subspace projections for hidden Markov models. IEEE Transactions on Speech and Audio Processing, 10(2), 37–47.
Article Google Scholar
Gargouri, D., Zerzri, M. A., & Ben Hamida, A. (2008). Formants estimation algorithm in noisy environment. GESTS International Transaction on Computer Science and Engineering, 45(1), 221–241.
Google Scholar
Holmes, J. N., Holmes, W. J., & Garner, P. N. (1997). Using formant frequencies in speech recognition. In European conf. on speech communication and technology, Rhodes, Greece (Vol. 4, pp. 2083–2086).
Google Scholar
Holmes, W. J., & Garner, P. N. (1998). On robust incorporation of format features into hidden Markov models for automatic speech recognition. In IEEE international conference on acoustics, speech, and signal processing, Istanbul, Turkey (pp. 1–4).
Google Scholar
Hunt, M. J. (1987). Delayed decisions in speech recognition—the case of formants. Pattern Recognition Letters, 6, 121–137.
Article Google Scholar
Kumar, N., & Andreou, A. G. (1996). A generalization of linear discriminant analysis in maximum likelihood framework (Tech. Rep. JHU-CLSP Technical Report No. 16). Johns Hopkins University.
Laprie, Y., & Berger, M. O. (1992). Active models for regularizing formant trajectories. In Proc. ICSLP, Banff (pp. 815–818).
Google Scholar
Lee, K. F., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 1641–1648.
Article Google Scholar
Leung, K. Y., & Siu, M. (2004). Integration of acoustic and articulatory information with application to speech recognition. Information Fusion, 5(2), 141–151.
Article Google Scholar
Lopes, C., & Perdigão, F. (2009). Phonetic recognition improvements through input feature set combination and acoustic context window widening. In 7th conference on telecommunications, Conftele, Sta Maria da Feira, Portugal (Vol. 1, pp. 449–452).
Google Scholar
McCandless, S. (1974). An algorithm for automatic formant extraction using linear prediction spectra. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-22, 135–141.
Article Google Scholar
Reynolds, T., & Antoniou, C. (2003). Experiments in speech recognition using a modular MLP architecture for acoustic modeling. Information Sciences, 156, 39–54.
Article Google Scholar
Stuttle, M. N., & Gales, M. J. F. (2002). Combining a Gaussian mixture model front end with MFCC parameters. In International conference on spoken language processing, Denver, CO (Vol. 3, pp. 1565–1568).
Google Scholar
Sorin, & Ramabadran, T. (2003). Extended advanced front end (XAFE) algorithm description. Version 1.1 (Tech. Rep. ES 202 212). ETSI STQAurora DSR Working Group.
Schmid, P., & Barnard, E. (1995). Robust N-Best Formant Tracking. In Proc. EUROSPEECH’95, Madrid (pp. 737–740).
Google Scholar
Shire, M. L. (2001). Multi-stream ASR trained with heterogeneous reverberant environments. In IEEE international conference on acoustics, speech, and signal processing, Salt Lake City, UT (Vol. 1, pp. 253–256).
Google Scholar
Saon, G., Padmanabhan, M., Gopinath, R., & Chen, S. (2000). Maximum likelihood discriminant feature spaces. In IEEE international conference on acoustics, speech, and signal processing (Vol. 2, pp. II1-129–II1-132).
Google Scholar
Thomson, D. L., & Chengalvarayan, R. (1998). Use of periodicity and jitter as speech recognition feature. In Proc. IEEE int. conf. on acoustics, speech, and signal processing, Seattle, WA (Vol. 1, pp. 21–24).
Google Scholar
Welling, L., & Ney, H. (1996). A model for efficient formant estimation. In IEEE international conference on acoustics, speech, and signal processing, Atlanta, GA (Vol. 2, pp. 797–801).
Google Scholar
Wilkinson, N. J., & Russell, M. J. (2002). Improved phone recognition on TIMIT using formant frequency data and confidence measures. In 7th international conference on spoken language processing, Denver, CO.
Google Scholar
Weber, K., Bourlard, H., & Bengio, S. (2001). Hmm2-extraction of formant features and their use for robust ASR. In European conference on speech communication and technology, Aalborg, Denmark (p. 607–610).
Google Scholar
Xia, K., & Wilson, C. E. (2000). A new strategy of formant tracking based on dynamic programming. In International conference on spoken language processing, Beijing, China.
Google Scholar
Yan, Z. J., Zhu, B., Hu, Y., & Wang, R. H. (2008). Minimum word classification error training of HMM for automatic speech. In IEEE international conference on acoustics, speech, and signal processing, Las Vegas, USA (pp. 4521–4524).
Google Scholar
Young, S. (1992). The general use of tying in phoneme-based hmm speech recognisers. In IEEE international conference on acoustics, speech and signal processing, San Francisco, California, USA (Vol. 1 pp. 569–572).
Google Scholar
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2006). The HTK book (revised for HTK version 3.4). Cambridge University Engineering Department-Speech Group. Available: http://www.htk.eng.cam.ac.uk.
Zolnay, A., Schlüter, R., & Ney, H. (2003). Extraction methods of voicing feature for robust speech recognition. In European conference on speech communication and technology, Geneva, Switzerland (Vol. 1, pp. 497–500).
Google Scholar
Zolnay, A., Kocharov, D., Schlutera, R., & Hermann, N. (2007). Using multiple acoustic feature sets for speech recognition. Speech Communication, 49, 514–525.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Technologie de l’information et électronique médicale, ATMS–LETI, ENIS, Sfax University, Sfax, Tunisie
Zaineb Ben Messaoud & Ahmed Ben Hamida

Authors

Zaineb Ben Messaoud
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Ben Hamida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zaineb Ben Messaoud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben Messaoud, Z., Ben Hamida, A. Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition. Int J Speech Technol 14, 393–403 (2011). https://doi.org/10.1007/s10772-011-9119-z

Download citation

Received: 29 March 2011
Accepted: 22 September 2011
Published: 12 October 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10772-011-9119-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition

Abstract

Access this article

Similar content being viewed by others

Prosodic Features and Formant Contribution for Speech Recognition System over Mobile Network

Source and system features for phone recognition

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining formant frequency based on variable order LPC coding with acoustic features for TIMIT phone recognition

Abstract

Access this article

Similar content being viewed by others

Prosodic Features and Formant Contribution for Speech Recognition System over Mobile Network

Source and system features for phone recognition

Speech Recognition System Based on OLLO French Corpus by Using MFCCs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation