Dual stream speech recognition using articulatory syllable models

Puurula, Antti; Van Compernolle, Dirk

doi:10.1007/s10772-010-9080-2

Dual stream speech recognition using articulatory syllable models

Published: 04 November 2010

Volume 13, pages 219–230, (2010)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Antti Puurula¹ &
Dirk Van Compernolle¹

101 Accesses
5 Citations
Explore all metrics

Abstract

Recent theoretical developments in neuroscience suggest that sublexical speech processing occurs via two parallel processing pathways. According to this Dual Stream Model of Speech Processing speech is processed both as sequences of speech sounds and articulations. We attempt to revise the “beads-on-a-string” paradigm of Hidden Markov Models in Automatic Speech Recognition (ASR) by implementing a system for dual stream speech recognition. A baseline recognition system is enhanced by modeling of articulations as sequences of syllables. An efficient and complementary model to HMMs is developed by formulating Dynamic Time Warping (DTW) as a probabilistic model. The DTW Model (DTWM) is improved by enriching syllable templates with constrained covariance matrices, data imputation, clustering and mixture modeling. The resulting dual stream system is evaluated on the N-Best Southern Dutch Broadcast News benchmark. Promising results are obtained for DTWM classification and ASR tests. We provide a discussion on the remaining problems in implementing dual stream speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahadi, S. M. (2000). Reduced context sensitivity in Persian speech recognition via syllable modeling. In Proceedings of the 8th Australian international conference on speech science and technology (SST-2000) (pp. 492–497). Canberra: Australian Speech Science and Technology Association.
Google Scholar
Aradilla, G., Vepa, J., & Bourlard, H. (2005). Improving speech recognition using a data-driven approach. In Proceedings of Interspeech (Vol. 66, pp. 3333–3336).
Google Scholar
Axelrod, S., & Maison, B. (2004). Combination of hidden Markov models with dynamic time warping for speech recognition. In Proceedings of ICASSP (Vol. 1, pp. 173–176).
Google Scholar
Bellman, R. (1957). Dynamic programming. Princeton: Princeton University Press.
MATH Google Scholar
Beyerlein, P. (1998). Discriminative model combination. In Proceedings of ICASSP (pp. 481–484).
Google Scholar
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Monterey: Wadsworth and Brooks.
MATH Google Scholar
David, C. C., Miller, D., & Walker, K. (2004). The Fisher corpus: a resource for the next generations of speech-to-text. In Proceedings of LREC (pp. 69–71).
Google Scholar
De Wachter, M., Demuynck, K., Wambacq, P., & Van Compernolle, D. (2004). A locally weighted distance measure for example based speech recognition. In Proceedings of ICASSP (Vol. 1, p. I-181-4).
Google Scholar
De Wachter, M., Matton, M., Demuynck, K., Wambacq, P., Cools, R., & Van Compernolle, D. (2007). Template-based continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1377–1390.
Article Google Scholar
Demuynck, K., Roelens, J., Van Compernolle, D., & Wambacq, P. (2008). SPRAAK: an open source speech recognition and automatic annotation kit. In Proceedings of Interspeech (p. 495).
Google Scholar
Demuynck, K., Puurula, A., Van Compernolle, D., & Wambacq, P. (2009). The ESAT 2008 system for N-Best Dutch speech recognition benchmark. In Proceedings of ASRU (pp. 339–344).
Google Scholar
Dupont, S., & Bourlard, H. (1997). Using multiple time scales in a multi-stream speech recognition system. In Proceedings of Eurospeech (pp. 3–6).
Google Scholar
Frankel, J., Wester, M., & King, S. (2004). Articulatory feature recognition using dynamic Bayesian networks. In Proceedings of ICSLP.
Google Scholar
Ganapathiraju, A., Hamaker, J., Ordowski, M., Doddington, G., & Picone, J. (2001). Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing.
Goldwater, S., & Johnson, M. (2005). Representational bias in unsupervised learning of syllable structure. In Proceedings of CoNLL (pp. 112–119).
Chapter Google Scholar
Hämäläinen, A., Bosch, L., & Boves, L. (2007). Modelling pronunciation variation using multi-path HMMs for syllables. In Proceedings of ICASSP (Vol. 4, pp. 781–784).
Google Scholar
Han, Y., Hämäläinen, A., & Boves, L. (2006). Trajectory clustering of syllable-length acoustic models for continous speech recognition. In Proceedings of ICASSP, Toulouse, France (pp. 1169–1172).
Google Scholar
Hasegawa-Johnson, M., Livescu, K., Lal, P., & Saenko, K. (2007). Audiovisual speech recognition with articulator positions as hidden variables. In Proceedings of the ICPhS (pp. 297–302).
Google Scholar
Hetjmánek, J., & Pavelka, T. (2008). Automatic speech recognition using context-dependent syllables. In Proceedings of the 9th international PhD workshop on systems and control, young generation viewpoint.
Google Scholar
Hickok, G., & Poeppel, D. (2004). Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition, 92(1–2), 67–99.
Article Google Scholar
Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393–402.
Article Google Scholar
Kawatani, T. (2000). Handwritten kanji recognition with determinant normalized quadratic discriminant function. In Proceedings of ICPR (Vol. 2, pp. 343–346).
Google Scholar
Kessens, J., & Leeuwen, D. Av. (2007). N-best: the Northern- and Southern-Dutch benchmark evaluation of speech recognition technology. In Proceedings of Interspeech (pp. 1354–1357).
Google Scholar
Kirchhoff, K. (1996). Syllable-level desynchronisation of phonetic features for speech recognition. In Proceedings of Interspeech (pp. 2274–2276).
Google Scholar
Leeuwen, Dv., Kessens, J., Sanders, E., & Heuvel, Hvd. (2009). Results of the N-Best 2008 Dutch speech recognition evaluation. In Proceedings of Interspeech (pp. 2571–2574).
Google Scholar
Leung, K. Y., & Siu, M. (2004). Integration of acoustic and articulatory information with application to speech recognition. Information Fusion, 5(2), 141–151.
Article Google Scholar
Livescu, K., Glass, J., & Bilmes, J. (2003). Hidden feature models for speech recognition using dynamic Bayesian networks. In Proceedings of Eurospeech (pp. 2529–2532).
Google Scholar
Martínez, A. M., & Virtriá, J. (2000). Learning mixture models using a genetic version of the EM algorithm. Pattern Recognition Letters, 21(9), 759–769.
Article Google Scholar
Momayyez, P., Waterhouse, J., & Rose, R. (2007). Exploiting complementary aspects of phonological features in automatic speech recognition. In Proceedings of ASRU (pp. 47–52).
Google Scholar
Ogata, J., & Ariki, Y. (2003). Syllable-based acoustic modeling for Japanese spontaneous speech recognition. In Proceedings of Eurospeech (pp. 2513–2516).
Google Scholar
Pernkopf, F., & Bouchaffra, D. (2005). Genetic-based EM algorithm for learning Gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1344–1348.
Article Google Scholar
Rabiner, L. R., & Wilpon, J. G. (1979). Considerations in applying clustering techniques to speaker-independent word recognition. Journal of the Acoustical Society of America, 66, 663–673.
Article Google Scholar
Rasipuram, R., Hegde, R. M., & Murthy, H. A. (2008). Incorporating acoustic feature diversity into the linguistic search space for syllable based speech recognition. In Proceedings of EUSIPCO.
Google Scholar
Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nature Neuroscience, 12(6), 718–724.
Article Google Scholar
Saenko, K., Darrell, T., & Glass, J. R. (2004). Articulatory features for robust visual speech recognition. In Proceedings of ICMI (pp. 152–158). New York: ACM.
Chapter Google Scholar
Sakoe, H. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26, 43–49.
Article MATH Google Scholar
Sande, I. G. (1982). Imputation in surveys: coping with reality. The American Statistician, 36(3), 145–152.
Article Google Scholar
Saur, D., Kreher, B. W., Schnell, S., Kümmerer, D., Kellmeyer, P., Vry, M. S., Umarova, R., Musso, M., Glauche, V., Abel, S., Huber, W., Rijntjes, M., Hennig, J., & Weiller, C. (2008). Ventral and dorsal pathways for language. Proceedings of the National Academy of Sciences, 105(46), 18,035–18,040.
Article Google Scholar
Sethy, A., Ramabhadran, B., & Narayanan, S. (2003). Improvements in English ASR for the MALACH project using syllable-centric models. In Proceedings of ASRU (pp. 129–134).
Google Scholar
Wang, J. (Ed.) (2003). Data mining: opportunities and challenges. Hershey: IGI Publishing.
Google Scholar
White, G. (1976). Speech recognition experiments with linear predication, bandpass filtering. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(2).
Wu, S., Kingsbury, B. E. D., Morgan, N., & Greenberg, S. (1998). Performance improvements through combining phone- and syllable-scale information in automatic speech recognition. In Proceedings of Interspeech (pp. 854–857).
Google Scholar
Zipf, G. K. (1935). The psycho-biology of language; an introduction to dynamic philology. Boston: Houghton Mifflin.
Google Scholar

Download references

Author information

Authors and Affiliations

ESAT, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001, Leuven, Belgium
Antti Puurula & Dirk Van Compernolle

Authors

Antti Puurula
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Van Compernolle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antti Puurula.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Puurula, A., Van Compernolle, D. Dual stream speech recognition using articulatory syllable models. Int J Speech Technol 13, 219–230 (2010). https://doi.org/10.1007/s10772-010-9080-2

Download citation

Received: 01 July 2010
Accepted: 19 October 2010
Published: 04 November 2010
Issue Date: December 2010
DOI: https://doi.org/10.1007/s10772-010-9080-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual stream speech recognition using articulatory syllable models

Abstract

Access this article

Similar content being viewed by others

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dual stream speech recognition using articulatory syllable models

Abstract

Access this article

Similar content being viewed by others

Keyword Spotting in Continuous Speech Using Spectral and Prosodic Information Fusion

Bimodality Streams Integration for Audio-Visual Speech Recognition Systems

The NECTEC 2015 Thai Open-Domain Automatic Speech Recognition System

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation