Abstract
Most current research and applications on Pinyin to Chinese word conversion employs a hidden Markov model (HMMs) which in turn uses a character-based language model. The reason is because Chinese texts are written without word boundaries. However in some tasks that involve the Pinyin to Chinese conversion, such as Chinese text proofreading, the original Chinese text is known. This enables us to extract the words and a word-based language model can be developed. In this paper we compare the two models and come to a conclusion that using word-based bi-gram language model achieve higher conversion accuracy than character-based bi-gram language model.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Zhou, X., Hu, X., Zhang, X., Shen, X.: A segment-based hidden markov model for real-setting pinyin-to-chinese conversion. In: CIKM 20007: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 1027–1030. ACM, New York (2007)
Chen, Z., Lee, K.F.: A new statistical approach to chinese pinyin input. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, pp. 241–247 (2000)
Sen, Z., Laprie, Y.: Mandarin text-to-pinyin conversion based on context knowledge and d-tree. In: Natural Language Processing and Knowledge Engineering, pp. 227–230 (2003)
Poritz, A.B.: Hidden markov models: a guided tour. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1988, pp. 7–13 (1988)
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition, 267–296 (1990)
Gales, M., Young, S.: The application of hidden markov models in speech recognition. Found. Trends Signal Process. 1(3), 195–304 (2007)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Smailbegovic, F., Georgi, N., Gaydadjiev, S.V.: Sparse matrix storage format. In: Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, pp. 445–448 (2005)
Goodman, J.T.: A bit of progress in language modeling, extended version. Technical report, Machine Learning and Applied Statistics Group, Microsoft Research (2001)
James, F.: Modified kneser-ney smoothing of n-gram models. Technical report (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, W., Guthrie, L. (2009). Chinese Pinyin-Text Conversion on Segmented Text. In: Matoušek, V., Mautner, P. (eds) Text, Speech and Dialogue. TSD 2009. Lecture Notes in Computer Science(), vol 5729. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04208-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-04208-9_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04207-2
Online ISBN: 978-3-642-04208-9
eBook Packages: Computer ScienceComputer Science (R0)