Abstract
Experiments show that for a large corpus, Zipf’s law does not hold for all ranks of words: the frequencies fall below those predicted by Zipf’s law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf’s law with a slope close to −1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf’s law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory alone can predict this behavior in randomly created n-grams of binary bits.
Similar content being viewed by others
Abbreviations
- WSJ:
-
Wall Street Journal
- NANT:
-
North American News Text
References
Baayen H (1991) A stochastic process for word frequency distributions. In: Proceedings of the 29th annual meeting of the Association for Computational Linguistics (ACL-29). Berkeley, California, USA, pp 271–278
Baayen H (2001) Word frequency distributions. Kluwer Academic Publishers, Norlands, MA, USA
Blake C (2006) A comparison of document, sentence, and term event spaces. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (COLING.ACL), vol 1. Sydney, pp 601–608
Booth AD (1967) A law of occurrences for words of low frequency. Inf Control 10(4): 386–393
Chau M, Lu Y, Fang X, Yang CC (2009) Characteristics of character usage in Chinese Web searching. Inf Proc Manag: Int J 45(1): 115–130
Cieri C, Liberman M (2000) Issues in corpus creation and distribution: the evolution of the linguistic data consortium. In: Proceedings LREC 2000. Athens, pp 49–56
Deane P (2005) A nonparametric method for extraction of candidate phrasal terms. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics (ACL). Ann Arbor, Michigan, pp 605–613
Egghe W (1999) On the law of Zipf–Mandelbrot for multi-word phrases. J Am Soc Inf Sci 50(3): 233–241
Evert S (2004) A simple LNRE model for random character sequences. In: Proceedings of the 7èmes Journées Internationales d’Analyse Statistique des Données Textuelles. pp 411–422
Fedorowicz J (1982) A Zipfian model of an automatic bibliographic system: an application to MEDLINE. J Am Soc Inf Sci 33: 223–232
Ferrer i Cancho R, Solé RV (2002) Two regimes in the frequency of words and the origin of complex lexicons. J Quant Linguist 8(3): 165–173
Francis WN, Kucera H (1964) Manual of information to accompany a standard corpus of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University, Providence, Rhode Island
Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40(3 & 4): 237–264
Guiter H, Arapov M (eds) (1982) Studies on Zipf’s law. Brochmeyer, Bochum
Ha LQ (2005). Investigations into the statistical properties of language. PhD Thesis, Supervisor: F. J. Smith, Queen’s University Belfast, UK
Ha LQ, Sicilia-Garcia EI, Ming J, Smith FJ (2003) Extension of Zipf’s law to word and character n-grams for English and Chinese. J Comput Linguist Chin Lang Proc (CLCLP) 8(1): 77–102
Harvey A, Devine K, Smith FJ (1994) Archive of Celtic-Latin literature ACLL-1 Royal Irish Academy. Dictionary of Medieval Latin from Celtic sources, Brespols
Hatzigeorgiu N, Mikros G, Carayannis G (2001) Word length, word frequencies and Zipf’s law in the Greek language. J Quant Linguist 8(3): 175–185
Jedynak BM, Khudanpur SM (2005) Maximum likelihood set for estimating a probability mass function. Neural Comput 17(7): 1508–1530
Jelinek F, Mercer RL (1985) Probability distribution estimation from sparse data. IBM Tech Discl Bull 28(6)
Kornai A (2002) How many words are there?. Glottometrics 4: 61–86
Laherrère J, Deheuvels P (1996) Distributions de type “fractal parabolique” dans la nature “Parabolic fractal” distributions in nature. http://www.hubbertpeak.com/laherrere/fractal.htm.
Li W (2001) Zipf’s law in importance of genes for cancer classification using microarray data. Lab Stat Genet Rockefeller Univ, New York
Mandelbrot B (1953) An information theory of the statistical structure of language. In: Willis J (eds) Communication theory. Academic Press, New York, pp 486–502
Mandelbrot B (1954) Simple games of strategy occurring in communication through natural languages. Trans IRE Prof Group Inf Theory 3: 124–137
Miller GA, Newman EB, Friedman EA (1958) Length-frequency statistics for written English. Inf Control 1: 370–389
Montemurro M (2001) Beyond the Zipf–Mandelbrot law in quantitative linguistics. Phys A: Stat Mech Appl 300(3–4): 567–578
Nadas A (1985) On Turing’s formula for word probabilities. IEEE Trans on Acoust, Speech Signal Proc (ASSP-33 6: 1414–1416
Ney H (1999) The use of the maximum likelihood criterion in language modelling. In: Ponting K (eds) Computational models of speech pattern processing. Springer, Berlin, pp 259–279
O’Boyle P, Owens M, Smith FJ (1994) A weighted average n-gram model of natural language. Comput Speech Lang 8: 337–349
Orlov JK, Chitashvili RY (1983) Generalized Z-distribution generating the well-known ‘rank-distributions’. Bull Acad Sci Georgia 110(2): 269–272
Paul DB, Baker JM (1992) The design for the Wall Street Journal-based CSR corpus. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP). Banff, Alberta, pp 899–902
Samuelson C (1996) Relating Turing’s formula and Zipf’s law. In: Proceedings of the 4th workshop on very large corpora. Copenhagen, Denmark
Sichel HS (1975) On a distribution law for word frequencies. J Am Stat Assoc 70: 542–547
Sichel HS (1997) Modelling species-abundance frequencies and species-individual functions with the generalized inverse Gaussian–Poisson distribution. S Afr Stat J 31: 13–37
Silagadze ZK (1997) Citations and the Zipf–Mandelbrot law. Complex Syst 11(6): 487–499
Simon HA (1955) On a class of skew distribution functions. Biometrika 42: 425–440
Simon HA (1960) Some further notes on a class of skew distribution functions. Inf Control 3: 80–88
Smith FJ, Devine K (1985) Storing and retrieving word phrases. Inf Proc Manag 21(3): 215–224
Uí Bheirn ÚM (ed) (2004) Corpas na Gaelge 1600–1882, CD. Royal Irish Academy, Dublin
Wallace RS (2009) Alice project. http://www.alicebot.org/articles/wallace/zipf.html
Yonezawa Y, Motohasi H (1999) Zipf-scaling description in the DNA sequence. In: Proceedings of the 10th workshop on genome informatics. Japan
Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley Publishing Co, Reading, MA
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ha, L.Q., Hanna, P., Ming, J. et al. Extending Zipf’s law to n-grams for large corpora. Artif Intell Rev 32, 101–113 (2009). https://doi.org/10.1007/s10462-009-9135-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-009-9135-4