Skip to main content

A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling

  • Conference paper
MICAI 2005: Advances in Artificial Intelligence (MICAI 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3789))

Included in the following conference series:

  • 1456 Accesses

Abstract

Data sparseness problem is inherent and severe in language modeling. Smoothing techniques are usually widely used to solve this problem. However, traditional smoothing techniques are all based on statistical hypotheses without concerning about linguistic knowledge. This paper introduces semantic information into smoothing technique and proposes a similarity-based smoothing method which is based on both statistical hypothesis and linguistic hypothesis. An experiential iterative algorithm is presented to optimize system parameters. Experiment results prove that compared with traditional smoothing techniques, our method can greatly improve the performance of language model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jelinek, F.: Self-Organized Language Modeling for Speech Recognition. In: IEEE ICASSP (1989)

    Google Scholar 

  2. Nagy, G.: At the Frontier of OCR. Processing of IEEE 80(7) (1992)

    Google Scholar 

  3. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2) (1992)

    Google Scholar 

  4. Bingquan, L., Xiaolong, W., Yuying, W.: Incorporating Linguistic Rules in Statistical Chinese Language Model for Pinyin-to-Character Conversion. High Technology Letters 7(2), 8–13 (2001)

    Google Scholar 

  5. Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)

    Google Scholar 

  6. Jeffreys, H.: Theory of Probability, 2nd edn. Clarendon Press, Oxford (1948)

    MATH  Google Scholar 

  7. Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264 (1953)

    MATH  MathSciNet  Google Scholar 

  8. Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speeech and Signal Processing 35(3), 400–401 (1987)

    Article  Google Scholar 

  9. Jelinek, F., Mercer, R.L.: Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, 381–397 (1980)

    Google Scholar 

  10. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 359–394 (1999)

    Article  Google Scholar 

  11. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database[EB], Cognitive Science Laboratory Princeton University (1993)

    Google Scholar 

  12. http://www.keenage.com

  13. Jiaju, M.: Chinese thesaurus “Tongyici Cilin”. Shanghai thesaurus Press (1983)

    Google Scholar 

  14. Essen, U., Steinbiss, V.: Coocurrence smoothing for stochastic language modeling. In: Proceedings of ICASSP, vol. I, pp. 161–164 (1992)

    Google Scholar 

  15. Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1-3), 43–69 (1999)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xiao, J., Liu, B., Wang, X., Li, B. (2005). A Similarity-Based Approach to Data Sparseness Problem of Chinese Language Modeling. In: Gelbukh, A., de Albornoz, Á., Terashima-Marín, H. (eds) MICAI 2005: Advances in Artificial Intelligence. MICAI 2005. Lecture Notes in Computer Science(), vol 3789. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11579427_77

Download citation

  • DOI: https://doi.org/10.1007/11579427_77

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29896-0

  • Online ISBN: 978-3-540-31653-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics