Abstract
Data sparseness problem is inherent and severe in language modeling. Smoothing techniques are usually widely used to solve this problem. However, traditional smoothing techniques are all based on statistical hypotheses without concerning about linguistic knowledge. This paper introduces semantic information into smoothing technique and proposes a similarity-based smoothing method which is based on both statistical hypothesis and linguistic hypothesis. An experiential iterative algorithm is presented to optimize system parameters. Experiment results prove that compared with traditional smoothing techniques, our method can greatly improve the performance of language model.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Jelinek, F.: Self-Organized Language Modeling for Speech Recognition. In: IEEE ICASSP (1989)
Nagy, G.: At the Frontier of OCR. Processing of IEEE 80(7) (1992)
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2) (1992)
Bingquan, L., Xiaolong, W., Yuying, W.: Incorporating Linguistic Rules in Statistical Chinese Language Model for Pinyin-to-Character Conversion. High Technology Letters 7(2), 8–13 (2001)
Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Jeffreys, H.: Theory of Probability, 2nd edn. Clarendon Press, Oxford (1948)
Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)
Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing 35(3), 400–401 (1987)
Jelinek, F., Mercer, R.L.: Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, 381–397 (1980)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–394 (1999)
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database[EB], Cognitive Science Laboratory Princeton University (1993)
Jiaju, M.: Chinese thesaurus “Tongyici Cilin”. Shanghai thesaurus Press (1983)
Essen, U., Steinbiss, V.: Coocurrence smoothing for stochastic language modeling. In: Proceedings of ICASSP, vol. I, pp. 161–164 (1992)
Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1-3), 43–69 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiao, J., Liu, B., Wang, X., Li, B. (2005). A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling. In: Gelbukh, A., de Albornoz, Á., Terashima-Marín, H. (eds) MICAI 2005: Advances in Artificial Intelligence. MICAI 2005. Lecture Notes in Computer Science(), vol 3789. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11579427_77
Download citation
DOI: https://doi.org/10.1007/11579427_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29896-0
Online ISBN: 978-3-540-31653-4
eBook Packages: Computer ScienceComputer Science (R0)