A New Word Clustering Method for Building N-Gram Language Models in Continuous Speech Recognition Systems

Bahrani, Mohammad; Sameti, Hossein; Hafezi, Nazila; Momtazi, Saeedeh

doi:10.1007/978-3-540-69052-8_30

Mohammad Bahrani¹,
Hossein Sameti¹,
Nazila Hafezi¹ &
…
Saeedeh Momtazi¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5027))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

2670 Accesses
3 Citations

Abstract

In this paper a new method for automatic word clustering is presented. We used this method for building n-gram language models for Persian continuous speech recognition (CSR) systems. In this method, each word is specified by a feature vector that represents the statistics of parts of speech (POS) of that word. The feature vectors are clustered by k-means algorithm. Using this method causes a reduction in time complexity which is a defect in other automatic clustering methods. Also, the problem of high perplexity in manual clustering methods is abated. The experimental results are based on "Persian Text Corpus" which contains about 9 million words. The extracted language models are evaluated by the perplexity criterion and the results show that a considerable reduction in perplexity has been achieved. Also reduction in word error rate of CSR system is about 16% compared with a manual clustering method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Huang, X., Alleva, F., Hon, H., Hwang, M., Lee, K., Rosenfield, R.: The SPHINX-II Speech Recognition System: An Overview. Computer Speech and Langauge 2, 137–148 (1993)
Article Google Scholar
Young, S.J., Jansen, J., Odell, J.J., Ollason, D., Woodland, P.C.: The HTK Hidden Markov Model Toolkit Book (1995)
Google Scholar
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall, New Jersey (1993)
Google Scholar
Heeman, P.A.: POS tagging versus Classes in Language Modeling, Proc. 6th Workshop on Very Large Corpora, August 1998, pp. 179–187 (1998)
Google Scholar
Brown, P., Della Pietra, V., de Souza, P., Lai, J., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Martin, S., Liermann, J., Ney, H.: Algorithms for bigram and trigram word clustering. Speech Communication 24, 19–37 (1998)
Article Google Scholar
Korkmaz, E.E., Ucoluk, G.: A Method for Improving Automatic Word Categorization, Workshop on Computational Natural Language Learning, Madrid, Spain, pp. 43–49 (1997)
Google Scholar
Harper, M.P., Jamieson, L.H., Mitchell, C.D., Ying, G.: Integrating Language Models with Speech Recognition. In: AAAI-94 Workshop on the Integration of Natural Language and Speech Processing, August 1994, pp. 139–146 (1994)
Google Scholar
Babaali, B., Sameti, H.: The Sharif Speaker-Independent Large Vocabulary Speech Recognition System. In: The 2nd Workshop on Information Technology & Its Disciplines, Kish Island, Iran, February 24-26 (2004)
Google Scholar
Ney, H., Haeb-Umbach, R., Tran, B.H., Oerder, M.: Improvements in Beam Search for 10000-Word Continuous Speech Recognition, IEEE Int. In: Conf. on Acoustics, Speech and Signal Processing, pp. 13–16 (1992)
Google Scholar
Bijankhan, M.: FARSDAT-The Speech Database of Farsi Spoken Language. In: Proc. The 5th Australian Int. Conf. on Speech Science and Tech., Perth, vol. 2 (1994)
Google Scholar
Bahrani, M., Samet, H., Hafezi, N., Movasagh, H.: Building and Incorporating Language Models for Persian Continuous Speech Recognition Systems. In: Proc. 5th international conference on Language Resources and Evaluation, Genoa, Italy, pp. 101–104 (2006)
Google Scholar
BijanKhan, M.: Persian Text Corpus, Technical report, Research Center of Intelligent Signal Processing (2004)
Google Scholar
Fritzke, B.: Some competitive learning methods, System Biophysics Institute for Neural Computation Ruhr-Universität Bochum (1997), ftp://ftp.neuroinformatik.ruhr-unibochum.de/pub/software/NN/DemoGNG/sclm.ps.gz

Download references

Author information

Authors and Affiliations

Speech Processing Lab, Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Mohammad Bahrani, Hossein Sameti, Nazila Hafezi & Saeedeh Momtazi

Authors

Mohammad Bahrani
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Sameti
View author publications
You can also search for this author in PubMed Google Scholar
Nazila Hafezi
View author publications
You can also search for this author in PubMed Google Scholar
Saeedeh Momtazi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ngoc Thanh Nguyen Leszek Borzemski Adam Grzech Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bahrani, M., Sameti, H., Hafezi, N., Momtazi, S. (2008). A New Word Clustering Method for Building N-Gram Language Models in Continuous Speech Recognition Systems. In: Nguyen, N.T., Borzemski, L., Grzech, A., Ali, M. (eds) New Frontiers in Applied Artificial Intelligence. IEA/AIE 2008. Lecture Notes in Computer Science(), vol 5027. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69052-8_30

Download citation

DOI: https://doi.org/10.1007/978-3-540-69052-8_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69045-0
Online ISBN: 978-3-540-69052-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics