Statistical Behavior Analysis of Smoothing Methods for Language Models of Mandarin Data Sets

Yu, Ming-Shing; Huang, Feng-Long; Tsai, Piyu

doi:10.1007/11880592_14

Ming-Shing Yu²⁰,
Feng-Long Huang²¹ &
Piyu Tsai²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

Asia Information Retrieval Symposium

990 Accesses

Abstract

In this paper, we discuss the properties of statistical behavior and entropies of three smoothing methods; two well-known and one proposed smoothing method will be used on three language models in Mandarin data sets. Because of the problem of data sparseness, smoothing methods are employed to estimate the probability for each event (including all the seen and unseen events) in a language model. A set of properties used to analyze the statistical behaviors of three smoothing methods are proposed. Our proposed smoothing methods comply with all the properties. We implement three language models in Mandarin data sets and then discuss the entropy. In general, the entropies of proposed smoothing method for three models are lower than that of other two methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Statistical Modeling of Current Linguistic Realities Around the World: The Case of Singapore

History, development, and principles of large language models: an introductory survey

Article 14 October 2024

Language Modeling for Turkish Text and Speech Processing

References

Brown, P.F., Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-Based n-gram Models of Natural Language. Computational Linguistics 18, 467–479 (1992)
Google Scholar
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Lai, J.C., Mercer, R.L.: An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics 18, 31–40 (1992)
Google Scholar
Chen Standy, F., Goodman, J.: An Empirical study of smoothing Techniques for Language Modeling. Computer Speech and Language 13, 359–394 (1999)
Article Google Scholar
Church, K.W., Gale, W.A.: A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating Probabilies of English Bigrams. Computer Speech and Language 5, 19–54 (1991)
Article Google Scholar
Dagan, I., Maucus, S., Markovitch, S.: Contextual Word Similarity and Estimation from Sparse Data. Computer Speech and Language 9, 123–152 (1995)
Article Google Scholar
Essen, U., Steinbiss, V.: Cooccurrence Smoothing for Stochastic Language Modelling. In: IEEE International conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 161–164 (1992)
Google Scholar
Good, I.J.: The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 40, 237–264 (1953)
MATH MathSciNet Google Scholar
Jelinek, F.: Automatic Speech Recognition-Statistical Methods. MIT, Redmond (1997)
Google Scholar
Jelinek, F., Mercer, R.L.: Interpolated Estimation of Markov Source Parameters from Spars Data. In: Proceedings of the Workshop on Pattern Recognition in Practice, pp. 381–397. North- Holland, Amsterdam (1980)
Google Scholar
Juraskey, D., Martin James, H.: Speech and Language Processing. Prentice Hall, Englewood Cliffs (2000)
Google Scholar
Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Models Component of a Speech Recognizer. IEEE Trans. on Acoustic, Speech and Signal Processing ASSP-35, 400–401 (1987)
Article Google Scholar
Knerser, R., Ney, H.: Improved Backing-Off for M-gram Language Modeling. In: IEEE International conference on Acoustic, Speech and Signal Processing, pp. 181–184 (1995)
Google Scholar
Nádas, A.: Estimation of Probabilities in the Language Model of the IBM Speech Recognition System. IEEE Transactions on Acoustics, Speech, Signals Processing 32(4), 859–861 (1984)
Article Google Scholar
Nádas, A.: On Turing’s Formula for Word Probabilities. IEEE Trans. on Acoustic, Speech and Signal Processing 33, 1414–1416 (1985)
Article MATH Google Scholar
Ney, H., Essen, U.: On Smoothing Techniques for Bigram-Based Natural Language Modeling. In: IEEE International conference on Acoustic, Speech and Signal Processing, pp. 825–828 (1991)
Google Scholar
Su, K.Y., Chiang, T.H., Chang, J.S.: A Overview of Corpus-Based Statistical-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing 1(1), 101–157 (1996)
Google Scholar
Witten, L.H., Bell, T.C.: The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transaction on Information theory 37(4), 1085–1094 (1991)
Article Google Scholar
Huang, C.-R.: Introduction to the Academic Sinica Balance Corpus. In: Proceeding of ROCLLING VII, pp. 81–99 (1995)
Google Scholar
Algort, P.H., Cover, T.M.: A Sandwich Proof of the Shannon- McMillan- Breiman Theorem. Ahe Annals of Probability 16(2), 899–909 (1988)
Article Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, Ch. 6. Prentice Hall, Englewood Cliffs (2000)
Google Scholar
Juang, B.H., Lo, S.H.: On the Bias if the Turing-Good Estimate of Probabilities. IEEE Trans. on Signal Processing 42(2), 496–498 (1994)
Article Google Scholar
Ney, H., Essen, U., Kneser, R.: On the Estimation of ‘Small’ Probabilities by Leaving-One-Out. IEEE PAMI 17(12), 1202–1212 (1995)
Google Scholar
Shen, X., Zhai, C.X.: Active Feedback in Ad Hoc Information Retrieval. In: Proceedings of ACM SIGIR 2005 (2005)
Google Scholar
Na, S.-H., Kang, I.-S., Roh, J.-E., Lee, J.-H.: An Empirical Study of Query Expansion and Cluster-Based Retrieval in Language Modeling Approach. In: AIRS 2005 (2005)
Google Scholar
Cao, G., Nie, J.-Y., Bai, J.: Integrating Word Relationships into Language. In: Models Proceedings of ACM SIGIR 2005 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science, National Chung-Hsing University, Taichung, 40227, Taiwan
Ming-Shing Yu
Department of Computer Science and Information Engineering, National United University, MiaoLi, 360, Taiwan
Feng-Long Huang & Piyu Tsai

Authors

Ming-Shing Yu
View author publications
You can also search for this author in PubMed Google Scholar
Feng-Long Huang
View author publications
You can also search for this author in PubMed Google Scholar
Piyu Tsai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Hwee Tou Ng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Mun-Kew Leong
Department of Computer Science, School of Computing, National University of Singapore, 117543, Singapore
Min-Yen Kan
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, P.O. Box, 119613, Singapore
Donghong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, MS., Huang, FL., Tsai, P. (2006). Statistical Behavior Analysis of Smoothing Methods for Language Models of Mandarin Data Sets. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_14

Download citation

DOI: https://doi.org/10.1007/11880592_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics