Abstract
In this paper, we discuss the properties of statistical behavior and entropies of three smoothing methods; two well-known and one proposed smoothing method will be used on three language models in Mandarin data sets. Because of the problem of data sparseness, smoothing methods are employed to estimate the probability for each event (including all the seen and unseen events) in a language model. A set of properties used to analyze the statistical behaviors of three smoothing methods are proposed. Our proposed smoothing methods comply with all the properties. We implement three language models in Mandarin data sets and then discuss the entropy. In general, the entropies of proposed smoothing method for three models are lower than that of other two methods.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brown, P.F., Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-Based n-gram Models of Natural Language. Computational Linguistics 18, 467–479 (1992)
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Lai, J.C., Mercer, R.L.: An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics 18, 31–40 (1992)
Chen Standy, F., Goodman, J.: An Empirical study of smoothing Techniques for Language Modeling. Computer Speech and Language 13, 359–394 (1999)
Church, K.W., Gale, W.A.: A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating Probabilies of English Bigrams. Computer Speech and Language 5, 19–54 (1991)
Dagan, I., Maucus, S., Markovitch, S.: Contextual Word Similarity and Estimation from Sparse Data. Computer Speech and Language 9, 123–152 (1995)
Essen, U., Steinbiss, V.: Cooccurrence Smoothing for Stochastic Language Modelling. In: IEEE International conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 161–164 (1992)
Good, I.J.: The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 40, 237–264 (1953)
Jelinek, F.: Automatic Speech Recognition-Statistical Methods. MIT, Redmond (1997)
Jelinek, F., Mercer, R.L.: Interpolated Estimation of Markov Source Parameters from Spars Data. In: Proceedings of the Workshop on Pattern Recognition in Practice, pp. 381–397. North- Holland, Amsterdam (1980)
Juraskey, D., Martin James, H.: Speech and Language Processing. Prentice Hall, Englewood Cliffs (2000)
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Models Component of a Speech Recognizer. IEEE Trans. on Acoustic, Speech and Signal Processing ASSP-35, 400–401 (1987)
Knerser, R., Ney, H.: Improved Backing-Off for M-gram Language Modeling. In: IEEE International conference on Acoustic, Speech and Signal Processing, pp. 181–184 (1995)
Nádas, A.: Estimation of Probabilities in the Language Model of the IBM Speech Recognition System. IEEE Transactions on Acoustics, Speech, Signals Processing 32(4), 859–861 (1984)
Nádas, A.: On Turing’s Formula for Word Probabilities. IEEE Trans. on Acoustic, Speech and Signal Processing 33, 1414–1416 (1985)
Ney, H., Essen, U.: On Smoothing Techniques for Bigram-Based Natural Language Modeling. In: IEEE International conference on Acoustic, Speech and Signal Processing, pp. 825–828 (1991)
Su, K.Y., Chiang, T.H., Chang, J.S.: A Overview of Corpus-Based Statistical-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing 1(1), 101–157 (1996)
Witten, L.H., Bell, T.C.: The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transaction on Information theory 37(4), 1085–1094 (1991)
Huang, C.-R.: Introduction to the Academic Sinica Balance Corpus. In: Proceeding of ROCLLING VII, pp. 81–99 (1995)
Algort, P.H., Cover, T.M.: A Sandwich Proof of the Shannon- McMillan- Breiman Theorem. Ahe Annals of Probability 16(2), 899–909 (1988)
Jurafsky, D., Martin, J.H.: Speech and Language Processing, Ch. 6. Prentice Hall, Englewood Cliffs (2000)
Juang, B.H., Lo, S.H.: On the Bias if the Turing-Good Estimate of Probabilities. IEEE Trans. on Signal Processing 42(2), 496–498 (1994)
Ney, H., Essen, U., Kneser, R.: On the Estimation of ‘Small’ Probabilities by Leaving-One-Out. IEEE PAMI 17(12), 1202–1212 (1995)
Shen, X., Zhai, C.X.: Active Feedback in Ad Hoc Information Retrieval. In: Proceedings of ACM SIGIR 2005 (2005)
Na, S.-H., Kang, I.-S., Roh, J.-E., Lee, J.-H.: An Empirical Study of Query Expansion and Cluster-Based Retrieval in Language Modeling Approach. In: AIRS 2005 (2005)
Cao, G., Nie, J.-Y., Bai, J.: Integrating Word Relationships into Language. In: Models Proceedings of ACM SIGIR 2005 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yu, MS., Huang, FL., Tsai, P. (2006). Statistical Behavior Analysis of Smoothing Methods for Language Models of Mandarin Data Sets. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_14
Download citation
DOI: https://doi.org/10.1007/11880592_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)