Substring Statistics

Umemura, Kyoji; Church, Kenneth

doi:10.1007/978-3-642-00382-0_5

Kyoji Umemura¹⁷ &
Kenneth Church¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1819 Accesses
2 Citations

Abstract

The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq, and document frequency, df, but generalizes naturally to compute, df _k (str), the number of documents that mention the substring str at least k times. df _k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Frequency-Constrained Substring Complexity

A Guide to Dictionary-Based Text Mining

Common substring with shifts in b-ary expansions

Article 08 August 2024

References

Meyer, D., Schvaneveldt, R.: Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology 90, 227–234 (1971)
Article Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Article Google Scholar
Prince, E.: Toward a taxonomy of given-new information. In: Cole, P. (ed.), pp. 236–256. Academic Press, New York (1981)
Google Scholar
Davis, J.R., Hirschberg, J.: Meeting of the Association for Computational Linguistics, 187–193 (1988)
Google Scholar
Salton, G.: Automatic text processing. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (1988)
Google Scholar
Steele, G.: Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO. In: ACM Proceedings of the 1977 Annual Conference, pp. 187–193. ACM Press, New York (1988)
Google Scholar
Bell, T., Cleary, J., Witten, I.: Text Compression. Prentice Hall, Englewood Cliffs (1990)
Google Scholar
Charniak, E.: Statistical Language Learning. MIT Press, Cambridge (1993)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Harman, D., Liberman, M.: TIPSTER, LDC, vol. 1 (1993), http://www.ldc.upenn.edu
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29(8-3), 1157–1166 (1997)
Article Google Scholar
Witten, I., Moffat, A., Bell, T.: Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York (1999)
MATH Google Scholar
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1999)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Church, K.W.: Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p ². In: Coling (2000)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)
Google Scholar
Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing. Prentice Hall, Upper Saddle River (2001)
Google Scholar
Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)
Book MATH Google Scholar
Yamamoto, M., Church, K.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)
Article Google Scholar
Xu, Y., Umemura, K.: Improvements of Katz K Mixture Model. Information and Media Technologies 1(1), 411–435 (2006)
Google Scholar
Umemura, K.: www.cicling.org/2009/Umemura-Church/

Download references

Author information

Authors and Affiliations

Toyohashi University of Technology, Tempaku, Toyohashi, Aichi, 441-8580, Japan
Kyoji Umemura
Microsoft, One Microsoft Way, Redmond, WA, 98052, USA
Kenneth Church

Authors

Kyoji Umemura
View author publications
You can also search for this author in PubMed Google Scholar
Kenneth Church
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Umemura, K., Church, K. (2009). Substring Statistics. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics