Skip to main content

Statistical Behavior Analysis of Smoothing Methods for Language Models of Mandarin Data Sets

  • Conference paper
  • 948 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Abstract

In this paper, we discuss the properties of statistical behavior and entropies of three smoothing methods; two well-known and one proposed smoothing method will be used on three language models in Mandarin data sets. Because of the problem of data sparseness, smoothing methods are employed to estimate the probability for each event (including all the seen and unseen events) in a language model. A set of properties used to analyze the statistical behaviors of three smoothing methods are proposed. Our proposed smoothing methods comply with all the properties. We implement three language models in Mandarin data sets and then discuss the entropy. In general, the entropies of proposed smoothing method for three models are lower than that of other two methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, P.F., Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-Based n-gram Models of Natural Language. Computational Linguistics 18, 467–479 (1992)

    Google Scholar 

  2. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Lai, J.C., Mercer, R.L.: An Estimate of an Upper Bound for the Entropy of English. Computational Linguistics 18, 31–40 (1992)

    Google Scholar 

  3. Chen Standy, F., Goodman, J.: An Empirical study of smoothing Techniques for Language Modeling. Computer Speech and Language 13, 359–394 (1999)

    Article  Google Scholar 

  4. Church, K.W., Gale, W.A.: A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating Probabilies of English Bigrams. Computer Speech and Language 5, 19–54 (1991)

    Article  Google Scholar 

  5. Dagan, I., Maucus, S., Markovitch, S.: Contextual Word Similarity and Estimation from Sparse Data. Computer Speech and Language 9, 123–152 (1995)

    Article  Google Scholar 

  6. Essen, U., Steinbiss, V.: Cooccurrence Smoothing for Stochastic Language Modelling. In: IEEE International conference on Acoustic, Speech and Signal Processing, vol. 1, pp. 161–164 (1992)

    Google Scholar 

  7. Good, I.J.: The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 40, 237–264 (1953)

    MATH  MathSciNet  Google Scholar 

  8. Jelinek, F.: Automatic Speech Recognition-Statistical Methods. MIT, Redmond (1997)

    Google Scholar 

  9. Jelinek, F., Mercer, R.L.: Interpolated Estimation of Markov Source Parameters from Spars Data. In: Proceedings of the Workshop on Pattern Recognition in Practice, pp. 381–397. North- Holland, Amsterdam (1980)

    Google Scholar 

  10. Juraskey, D., Martin James, H.: Speech and Language Processing. Prentice Hall, Englewood Cliffs (2000)

    Google Scholar 

  11. Katz, S.M.: Estimation of Probabilities from Sparse Data for the Language Models Component of a Speech Recognizer. IEEE Trans. on Acoustic, Speech and Signal Processing ASSP-35, 400–401 (1987)

    Article  Google Scholar 

  12. Knerser, R., Ney, H.: Improved Backing-Off for M-gram Language Modeling. In: IEEE International conference on Acoustic, Speech and Signal Processing, pp. 181–184 (1995)

    Google Scholar 

  13. Nádas, A.: Estimation of Probabilities in the Language Model of the IBM Speech Recognition System. IEEE Transactions on Acoustics, Speech, Signals Processing 32(4), 859–861 (1984)

    Article  Google Scholar 

  14. Nádas, A.: On Turing’s Formula for Word Probabilities. IEEE Trans. on Acoustic, Speech and Signal Processing 33, 1414–1416 (1985)

    Article  MATH  Google Scholar 

  15. Ney, H., Essen, U.: On Smoothing Techniques for Bigram-Based Natural Language Modeling. In: IEEE International conference on Acoustic, Speech and Signal Processing, pp. 825–828 (1991)

    Google Scholar 

  16. Su, K.Y., Chiang, T.H., Chang, J.S.: A Overview of Corpus-Based Statistical-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing 1(1), 101–157 (1996)

    Google Scholar 

  17. Witten, L.H., Bell, T.C.: The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transaction on Information theory 37(4), 1085–1094 (1991)

    Article  Google Scholar 

  18. Huang, C.-R.: Introduction to the Academic Sinica Balance Corpus. In: Proceeding of ROCLLING VII, pp. 81–99 (1995)

    Google Scholar 

  19. Algort, P.H., Cover, T.M.: A Sandwich Proof of the Shannon- McMillan- Breiman Theorem. Ahe Annals of Probability 16(2), 899–909 (1988)

    Article  Google Scholar 

  20. Jurafsky, D., Martin, J.H.: Speech and Language Processing, Ch. 6. Prentice Hall, Englewood Cliffs (2000)

    Google Scholar 

  21. Juang, B.H., Lo, S.H.: On the Bias if the Turing-Good Estimate of Probabilities. IEEE Trans. on Signal Processing 42(2), 496–498 (1994)

    Article  Google Scholar 

  22. Ney, H., Essen, U., Kneser, R.: On the Estimation of ‘Small’ Probabilities by Leaving-One-Out. IEEE PAMI 17(12), 1202–1212 (1995)

    Google Scholar 

  23. Shen, X., Zhai, C.X.: Active Feedback in Ad Hoc Information Retrieval. In: Proceedings of ACM SIGIR 2005 (2005)

    Google Scholar 

  24. Na, S.-H., Kang, I.-S., Roh, J.-E., Lee, J.-H.: An Empirical Study of Query Expansion and Cluster-Based Retrieval in Language Modeling Approach. In: AIRS 2005 (2005)

    Google Scholar 

  25. Cao, G., Nie, J.-Y., Bai, J.: Integrating Word Relationships into Language. In: Models Proceedings of ACM SIGIR 2005 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yu, MS., Huang, FL., Tsai, P. (2006). Statistical Behavior Analysis of Smoothing Methods for Language Models of Mandarin Data Sets. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_14

Download citation

  • DOI: https://doi.org/10.1007/11880592_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45780-0

  • Online ISBN: 978-3-540-46237-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics