Skip to main content

Advertisement

Log in

Extending Zipf’s law to n-grams for large corpora

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Experiments show that for a large corpus, Zipf’s law does not hold for all ranks of words: the frequencies fall below those predicted by Zipf’s law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf’s law with a slope close to −1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf’s law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory alone can predict this behavior in randomly created n-grams of binary bits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

WSJ:

Wall Street Journal

NANT:

North American News Text

References

  • Baayen H (1991) A stochastic process for word frequency distributions. In: Proceedings of the 29th annual meeting of the Association for Computational Linguistics (ACL-29). Berkeley, California, USA, pp 271–278

  • Baayen H (2001) Word frequency distributions. Kluwer Academic Publishers, Norlands, MA, USA

    MATH  Google Scholar 

  • Blake C (2006) A comparison of document, sentence, and term event spaces. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (COLING.ACL), vol 1. Sydney, pp 601–608

  • Booth AD (1967) A law of occurrences for words of low frequency. Inf Control 10(4): 386–393

    Article  MATH  Google Scholar 

  • Chau M, Lu Y, Fang X, Yang CC (2009) Characteristics of character usage in Chinese Web searching. Inf Proc Manag: Int J 45(1): 115–130

    Article  Google Scholar 

  • Cieri C, Liberman M (2000) Issues in corpus creation and distribution: the evolution of the linguistic data consortium. In: Proceedings LREC 2000. Athens, pp 49–56

  • Deane P (2005) A nonparametric method for extraction of candidate phrasal terms. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL). Ann Arbor, Michigan, pp 605–613

  • Egghe W (1999) On the law of Zipf–Mandelbrot for multi-word phrases. J Am Soc Inf Sci 50(3): 233–241

    Article  Google Scholar 

  • Evert S (2004) A simple LNRE model for random character sequences. In: Proceedings of the 7èmes Journées Internationales d’Analyse Statistique des Données Textuelles. pp 411–422

  • Fedorowicz J (1982) A Zipfian model of an automatic bibliographic system: an application to MEDLINE. J Am Soc Inf Sci 33: 223–232

    Article  Google Scholar 

  • Ferrer i Cancho R, Solé RV (2002) Two regimes in the frequency of words and the origin of complex lexicons. J Quant Linguist 8(3): 165–173

    Article  Google Scholar 

  • Francis WN, Kucera H (1964) Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University, Providence, Rhode Island

    Google Scholar 

  • Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3 & 4): 237–264

    MATH  MathSciNet  Google Scholar 

  • Guiter H, Arapov M (eds) (1982) Studies on Zipf’s law. Brochmeyer, Bochum

    Google Scholar 

  • Ha LQ (2005). Investigations into the statistical properties of language. PhD Thesis, Supervisor: F. J. Smith, Queen’s University Belfast, UK

  • Ha LQ, Sicilia-Garcia EI, Ming J, Smith FJ (2003) Extension of Zipf’s law to word and character n-grams for English and Chinese. J Comput Linguist Chin Lang Proc (CLCLP) 8(1): 77–102

    Google Scholar 

  • Harvey A, Devine K, Smith FJ (1994) Archive of Celtic-Latin literature ACLL-1 Royal Irish Academy. Dictionary of Medieval Latin from Celtic sources, Brespols

  • Hatzigeorgiu N, Mikros G, Carayannis G (2001) Word length, word frequencies and Zipf’s law in the Greek language. J Quant Linguist 8(3): 175–185

    Article  Google Scholar 

  • Jedynak BM, Khudanpur SM (2005) Maximum likelihood set for estimating a probability mass function. Neural Comput 17(7): 1508–1530

    Article  MATH  MathSciNet  Google Scholar 

  • Jelinek F, Mercer RL (1985) Probability distribution estimation from sparse data. IBM Tech Discl Bull 28(6)

  • Kornai A (2002) How many words are there?. Glottometrics 4: 61–86

    Google Scholar 

  • Laherrère J, Deheuvels P (1996) Distributions de type “fractal parabolique” dans la nature “Parabolic fractal” distributions in nature. http://www.hubbertpeak.com/laherrere/fractal.htm.

  • Li W (2001) Zipf’s law in importance of genes for cancer classification using microarray data. Lab Stat Genet Rockefeller Univ, New York

    Google Scholar 

  • Mandelbrot B (1953) An information theory of the statistical structure of language. In: Willis J (eds) Communication theory. Academic Press, New York, pp 486–502

    Google Scholar 

  • Mandelbrot B (1954) Simple games of strategy occurring in communication through natural languages. Trans IRE Prof Group Inf Theory 3: 124–137

    Google Scholar 

  • Miller GA, Newman EB, Friedman EA (1958) Length-frequency statistics for written English. Inf Control 1: 370–389

    Article  Google Scholar 

  • Montemurro M (2001) Beyond the Zipf–Mandelbrot law in quantitative linguistics. Phys A: Stat Mech Appl 300(3–4): 567–578

    Article  MATH  Google Scholar 

  • Nadas A (1985) On Turing’s formula for word probabilities. IEEE Trans on Acoust, Speech Signal Proc (ASSP-33 6: 1414–1416

    Article  Google Scholar 

  • Ney H (1999) The use of the maximum likelihood criterion in language modelling. In: Ponting K (eds) Computational models of speech pattern processing. Springer, Berlin, pp 259–279

    Google Scholar 

  • O’Boyle P, Owens M, Smith FJ (1994) A weighted average n-gram model of natural language. Comput Speech Lang 8: 337–349

    Article  Google Scholar 

  • Orlov JK, Chitashvili RY (1983) Generalized Z-distribution generating the well-known ‘rank-distributions’. Bull Acad Sci Georgia 110(2): 269–272

    MATH  MathSciNet  Google Scholar 

  • Paul DB, Baker JM (1992) The design for the Wall Street Journal-based CSR corpus. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP). Banff, Alberta, pp 899–902

  • Samuelson C (1996) Relating Turing’s formula and Zipf’s law. In: Proceedings of the 4th workshop on very large corpora. Copenhagen, Denmark

  • Sichel HS (1975) On a distribution law for word frequencies. J Am Stat Assoc 70: 542–547

    Article  Google Scholar 

  • Sichel HS (1997) Modelling species-abundance frequencies and species-individual functions with the generalized inverse Gaussian–Poisson distribution. S Afr Stat J 31: 13–37

    MATH  Google Scholar 

  • Silagadze ZK (1997) Citations and the Zipf–Mandelbrot law. Complex Syst 11(6): 487–499

    MATH  Google Scholar 

  • Simon HA (1955) On a class of skew distribution functions. Biometrika 42: 425–440

    MATH  MathSciNet  Google Scholar 

  • Simon HA (1960) Some further notes on a class of skew distribution functions. Inf Control 3: 80–88

    Article  MATH  Google Scholar 

  • Smith FJ, Devine K (1985) Storing and retrieving word phrases. Inf Proc Manag 21(3): 215–224

    Article  Google Scholar 

  • Uí Bheirn ÚM (ed) (2004) Corpas na Gaelge 1600–1882, CD. Royal Irish Academy, Dublin

    Google Scholar 

  • Wallace RS (2009) Alice project. http://www.alicebot.org/articles/wallace/zipf.html

  • Yonezawa Y, Motohasi H (1999) Zipf-scaling description in the DNA sequence. In: Proceedings of the 10th workshop on genome informatics. Japan

  • Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley Publishing Co, Reading, MA

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to F. J. Smith.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ha, L.Q., Hanna, P., Ming, J. et al. Extending Zipf’s law to n-grams for large corpora. Artif Intell Rev 32, 101–113 (2009). https://doi.org/10.1007/s10462-009-9135-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-009-9135-4

Keywords

Navigation