Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Abstract

The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq, and document frequency, df, but generalizes naturally to compute, df k (str), the number of documents that mention the substring str at least k times. df k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Meyer, D., Schvaneveldt, R.: Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology 90, 227–234 (1971)

    Article  Google Scholar 

  2. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  3. Prince, E.: Toward a taxonomy of given-new information. In: Cole, P. (ed.), pp. 236–256. Academic Press, New York (1981)

    Google Scholar 

  4. Davis, J.R., Hirschberg, J.: Meeting of the Association for Computational Linguistics, 187–193 (1988)

    Google Scholar 

  5. Salton, G.: Automatic text processing. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (1988)

    Google Scholar 

  6. Steele, G.: Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO. In: ACM Proceedings of the 1977 Annual Conference, pp. 187–193. ACM Press, New York (1988)

    Google Scholar 

  7. Bell, T., Cleary, J., Witten, I.: Text Compression. Prentice Hall, Englewood Cliffs (1990)

    Google Scholar 

  8. Charniak, E.: Statistical Language Learning. MIT Press, Cambridge (1993)

    Google Scholar 

  9. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  10. Harman, D., Liberman, M.: TIPSTER, LDC, vol. 1 (1993), http://www.ldc.upenn.edu

  11. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29(8-3), 1157–1166 (1997)

    Article  Google Scholar 

  12. Witten, I., Moffat, A., Bell, T.: Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York (1999)

    MATH  Google Scholar 

  13. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1999)

    Google Scholar 

  14. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  15. Church, K.W.: Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p 2. In: Coling (2000)

    Google Scholar 

  16. Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)

    Google Scholar 

  17. Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing. Prentice Hall, Upper Saddle River (2001)

    Google Scholar 

  18. Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)

    Book  MATH  Google Scholar 

  19. Yamamoto, M., Church, K.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)

    Article  Google Scholar 

  20. Xu, Y., Umemura, K.: Improvements of Katz K Mixture Model. Information and Media Technologies 1(1), 411–435 (2006)

    Google Scholar 

  21. Umemura, K.: www.cicling.org/2009/Umemura-Church/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Umemura, K., Church, K. (2009). Substring Statistics. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00382-0_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00381-3

  • Online ISBN: 978-3-642-00382-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics