ABSTRACT
The power-law approximation for the frequency distribution of words postulated by Zipf has been extensively studied for decades, which led to many variations on the theme. However, comparatively less attention has been paid to the investigation of the case of word frequencies. In this paper, we derive its analytical expression from the inverse of the underlying rank-size distribution as a function of total word count, vocabulary size and the shape parameter, thereby providing a unified framework to explain the nonlinear behavior of low frequencies on the log-log scale. We also present an efficient method based on relative entropy minimization for a robust estimation of the shape parameter using a small number of empirical low-frequency probabilities. Experiments were carried out for a selected set of languages with varying degrees of inflection in order to demonstrate the effectiveness of the proposed approach.
Supplemental Material
- R. Harald Baayen. 2001. Word Frequency Distributions (1 ed.). Springer Dordrecht.Google Scholar
- Andrew Donald Booth. 1967. A law of occurrences for words of low frequency. Information and Control 10, 4 (April 1967), 386--393. https://doi.org/10.1016/S0019-9958(67)90201-XGoogle ScholarCross Ref
- Ye-Sho Chen and Ferdinand Leimkuhler. 1990. Booth's law of word frequency. Journal of the American Society for Information Science 41, 5 (1990), 387--388.Google ScholarCross Ref
- Flavio Chierichetti, Ravi Kumar, and Bo Pang. 2017. On the Power Laws of Language: Word Frequency Distributions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, 385--394. https://doi.org/10.1145/3077136.3080821Google ScholarDigital Library
- Flavio Chierichetti, Ravi Kumar, and Prabhakar Raghavan. 2009. Compressed Web Indexes. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW '09). ACM, 451--460. https://doi.org/10.1145/1526709.1526770Google ScholarDigital Library
- Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the Bible in 100 languages. Language Resources and Evaluation 49 (June 2015), 375--395. https://doi.org/10.1007/s10579-014-9287-yGoogle ScholarDigital Library
- Albert Cohen, Rosario Nunzio Mantegna, and Shlomo Havlin. 1997. Numerical analysis of word frequencies in artificial and natural language texts. Fractals 5, 01 (1997), 95--104.Google ScholarCross Ref
- Álvaro Corral, Isabel Serra, and Ramon Ferrer-i-Cancho. 2020. Distinct flavors of Zipf's law and its maximum likelihood fitting: Rank-size and size-distribution representations. Physical Review E 102, Article 052113 (Nov 2020), 17 pages. https://doi.org/10.1103/PhysRevE.102.052113Google ScholarCross Ref
- Mathias Creutz and Krista Lagus. 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology.Google Scholar
- Jane Fedorowicz. 1987. Database performance evaluation in an indexed file environment. ACM Transactions on Database Systems 12, 1 (March 1987), 85--110. https://doi.org/10.1145/12047.13675Google ScholarDigital Library
- Ramon Ferrer-i-Cancho. 2005. The variation of Zipf's law in human language. The European Physical Journal B - Condensed Matter and Complex Systems 44 (2005), 249--257. https://doi.org/10.1140/epjb/e2005-00121-8Google ScholarCross Ref
- Ramon Ferrer-i-Cancho and Ricard V. Solé. 2001. Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf's Law Revisited. Journal of Quantitative Linguistics 8, 3 (2001), 165--173. https://doi.org/10.1076/jqul.8.3.165. 4101Google ScholarCross Ref
- Xavier Gabaix and Yannis M. Ioannides. 2004. The Evolution of City Size Distributions. In Cities and Geography. Handbook of Regional and Urban Economics, Vol. 4. Elsevier, 2341--2378. https://doi.org/10.1016/S1574-0080(04)80010-5Google ScholarCross Ref
- Michel L. Goldstein, Steven A. Morris, and Gary G. Yen. 2004. Problems with fitting to the power-law distribution. The European Physical Journal B - Condensed Matter and Complex Systems 41 (2004), 255--258. https://doi.org/10.1140/epjb/e2004-00316-5Google ScholarCross Ref
- Le Quan Ha, Philip Hanna, Ji Ming, and Francis Jack Smith. 2009. Extending Zipf's law to n-grams for large corpora. Artificial Intelligence Review 32 (Dec. 2009), 101--113. https://doi.org/10.1007/s10462-009-9135-4Google ScholarDigital Library
- Harold Stanley Heaps. 1978. Information retrieval, computational and theoretical aspects. Academic Press.Google Scholar
- Wentian Li. 2002. Zipf's Law everywhere. Glottometrics 5 (2002), 14--21.Google Scholar
- Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.Google Scholar
- Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953), 486--502.Google Scholar
- Giordano De Marzo, Andrea Gabrielli, Andrea Zaccaria, and Luciano Pietronero. 2021. Dynamical approach to Zipf's law. Physical Review Research 3, Article 013084 (Jan 2021), 16 pages. https://doi.org/10.1103/PhysRevResearch.3.013084Google ScholarCross Ref
- Charles T. Meadow, Jiabin Wang, and Manal Stamboulie. 1993. An analysis of Zipf-Mandelbrot language measures and their application to artificial languages. Journal of Information Science 19, 4 (1993), 247--257. https://doi.org/10.1177/016555159301900401Google ScholarCross Ref
- Ali Mehri and Maryam Jamaati. 2017. Variation of Zipf's exponent in one hundred live languages: A study of the Holy Bible translations. Physics Letters A 381, 31 (Aug. 2017), 2457--2558. https://doi.org/10.1016/j.physleta.2017.05.061Google ScholarCross Ref
- Ali Mehri and Maryam Jamaati. 2021. Statistical metrics for languages classification: A case study of the Bible translations. Chaos, Solitons & Fractals 144, Article 110679 (March 2021). https://doi.org/10.1016/j.chaos.2021.110679Google ScholarCross Ref
- Isabel Moreno-Sánchez, Francesc Font-Clos, and Álvaro Corral. 2016. Large-Scale Analysis of Zipf's Law in English Texts. PLOS ONE 11 (Jan. 2016), 1--19. https://doi.org/10.1371/journal.pone.0147073Google ScholarCross Ref
- Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan. 1992. Information theoretic models in statistical linguistics-Part I: A model for word frequencies. Current Science 63, 5 (1992), 261--269. http://www.jstor.org/stable/24095491Google Scholar
- Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan. 1998. Models for power law relations in linguistics and information science. Journal of Quantitative Linguistics 5, 1--2 (1998), 35--61. https://doi.org/10.1080/09296179808590110Google ScholarCross Ref
- Michael Nelson and Jean Tague. 1985. Split size-rank models for the distribution of index terms. Journal of the American Society for Information Science 36, 5 (Sept. 1985), 283--296. https://doi.org/10.1002/asi.4630360502Google ScholarDigital Library
- Miranda Lee Pao. 1978. Automatic text analysis based on transition phenomena of word occurrences. Journal of the American Society for Information Science 29, 3 (1978), 121--124. https://doi.org/10.1002/asi.4630290303Google ScholarCross Ref
- Richard Perline. 2005. Strong, Weak and False Inverse Power Laws. Statist. Sci. 20, 1 (2005), 68--88.Google Scholar
- David Pinto, Héctor Jiménez-Salazar, and Paolo Rosso. 2006. Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 536--546. https://doi.org/10.1007/11671299_55Google ScholarCross Ref
- Manfred R. Schroeder. 1991. Fractals, chaos, power laws: Minutes from an infinite paradise. W. H. Freeman and Company, New York, NY, USA.Google Scholar
- Ernst Schuegraf. 1976. Compression of large inverted files with hyperbolic term distribution. Information Processing & Management 12, 6 (1976), 377--384. https://doi.org/10.1016/0306-4573(76)90035-2Google ScholarCross Ref
- Lei Shi, Zhi-Min Gu, Yong-Cai Tao, Lin Wei, and Yun Shi. 2005. Modeling Web objects' popularity. In 2005 International Conference on Machine Learning and Cybernetics, Vol. 4. IEEE, 2320--2324. https://doi.org/10.1109/ICMLC.2005.1527331Google ScholarCross Ref
- Herbert Sichel. 1975. On a Distribution Law for Word Frequencies. J. Amer. Statist. Assoc. 70, 351a (1975), 542--547. https://doi.org/10.1080/01621459.1975.10482469Google ScholarCross Ref
- Jean Tague, Michael Nelson, and Harry Wu. 1980. Problems in the simulation of bibliographic retrieval systems. In Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval (SIGIR '80). ACM, 236--255. https://doi.org/10.5555/636669.636684Google ScholarDigital Library
- Juhan Tuldava. 1996. The frequency spectrum of text and vocabulary. Journal of Quantitative Linguistics 3, 1 (1996), 38--50. https://doi.org/10.1080/09296179608590062Google ScholarCross Ref
- George Kingsley Zipf. 1932. Selected studies of the principle of relative frequency in language. Harvard University Press.Google Scholar
- George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA.Google Scholar
Index Terms
- A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's Law
Recommendations
Zipfian regularities in “non-point” word representations
AbstractBeing one of the most common empirical regularities, the Zipf’s law for word frequencies is a power law relation between word frequencies and frequency ranks of words. We quantitatively study semantic uncertainty of words through non-...
Highlights- Variances of Gaussian embeddings can be used to quantify semantic uncertainty.
- ...
Random texts exhibit Zipf's-law-like word frequency distribution
It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as English. The facts that the frequency of occurrence of a word is almost an inverse power law function ...
Zipf's law and entropy (Corresp.)
The estimate of the entropy of a language by assuming that the word probabilities follow Zipf's law is discussed briefly. Previous numerical results [3] on the vocabulary size implied by Zipf's law and entropy per word are corrected. The vocabulary size ...
Comments