skip to main content
10.1145/3539618.3591942acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's Law

Published:18 July 2023Publication History

ABSTRACT

The power-law approximation for the frequency distribution of words postulated by Zipf has been extensively studied for decades, which led to many variations on the theme. However, comparatively less attention has been paid to the investigation of the case of word frequencies. In this paper, we derive its analytical expression from the inverse of the underlying rank-size distribution as a function of total word count, vocabulary size and the shape parameter, thereby providing a unified framework to explain the nonlinear behavior of low frequencies on the log-log scale. We also present an efficient method based on relative entropy minimization for a robust estimation of the shape parameter using a small number of empirical low-frequency probabilities. Experiments were carried out for a selected set of languages with varying degrees of inflection in order to demonstrate the effectiveness of the proposed approach.

Skip Supplemental Material Section

Supplemental Material

SIGIR23-srp5360.mp4

mp4

42.6 MB

References

  1. R. Harald Baayen. 2001. Word Frequency Distributions (1 ed.). Springer Dordrecht.Google ScholarGoogle Scholar
  2. Andrew Donald Booth. 1967. A law of occurrences for words of low frequency. Information and Control 10, 4 (April 1967), 386--393. https://doi.org/10.1016/S0019-9958(67)90201-XGoogle ScholarGoogle ScholarCross RefCross Ref
  3. Ye-Sho Chen and Ferdinand Leimkuhler. 1990. Booth's law of word frequency. Journal of the American Society for Information Science 41, 5 (1990), 387--388.Google ScholarGoogle ScholarCross RefCross Ref
  4. Flavio Chierichetti, Ravi Kumar, and Bo Pang. 2017. On the Power Laws of Language: Word Frequency Distributions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, 385--394. https://doi.org/10.1145/3077136.3080821Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Flavio Chierichetti, Ravi Kumar, and Prabhakar Raghavan. 2009. Compressed Web Indexes. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW '09). ACM, 451--460. https://doi.org/10.1145/1526709.1526770Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the Bible in 100 languages. Language Resources and Evaluation 49 (June 2015), 375--395. https://doi.org/10.1007/s10579-014-9287-yGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  7. Albert Cohen, Rosario Nunzio Mantegna, and Shlomo Havlin. 1997. Numerical analysis of word frequencies in artificial and natural language texts. Fractals 5, 01 (1997), 95--104.Google ScholarGoogle ScholarCross RefCross Ref
  8. Álvaro Corral, Isabel Serra, and Ramon Ferrer-i-Cancho. 2020. Distinct flavors of Zipf's law and its maximum likelihood fitting: Rank-size and size-distribution representations. Physical Review E 102, Article 052113 (Nov 2020), 17 pages. https://doi.org/10.1103/PhysRevE.102.052113Google ScholarGoogle ScholarCross RefCross Ref
  9. Mathias Creutz and Krista Lagus. 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology.Google ScholarGoogle Scholar
  10. Jane Fedorowicz. 1987. Database performance evaluation in an indexed file environment. ACM Transactions on Database Systems 12, 1 (March 1987), 85--110. https://doi.org/10.1145/12047.13675Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ramon Ferrer-i-Cancho. 2005. The variation of Zipf's law in human language. The European Physical Journal B - Condensed Matter and Complex Systems 44 (2005), 249--257. https://doi.org/10.1140/epjb/e2005-00121-8Google ScholarGoogle ScholarCross RefCross Ref
  12. Ramon Ferrer-i-Cancho and Ricard V. Solé. 2001. Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf's Law Revisited. Journal of Quantitative Linguistics 8, 3 (2001), 165--173. https://doi.org/10.1076/jqul.8.3.165. 4101Google ScholarGoogle ScholarCross RefCross Ref
  13. Xavier Gabaix and Yannis M. Ioannides. 2004. The Evolution of City Size Distributions. In Cities and Geography. Handbook of Regional and Urban Economics, Vol. 4. Elsevier, 2341--2378. https://doi.org/10.1016/S1574-0080(04)80010-5Google ScholarGoogle ScholarCross RefCross Ref
  14. Michel L. Goldstein, Steven A. Morris, and Gary G. Yen. 2004. Problems with fitting to the power-law distribution. The European Physical Journal B - Condensed Matter and Complex Systems 41 (2004), 255--258. https://doi.org/10.1140/epjb/e2004-00316-5Google ScholarGoogle ScholarCross RefCross Ref
  15. Le Quan Ha, Philip Hanna, Ji Ming, and Francis Jack Smith. 2009. Extending Zipf's law to n-grams for large corpora. Artificial Intelligence Review 32 (Dec. 2009), 101--113. https://doi.org/10.1007/s10462-009-9135-4Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Harold Stanley Heaps. 1978. Information retrieval, computational and theoretical aspects. Academic Press.Google ScholarGoogle Scholar
  17. Wentian Li. 2002. Zipf's Law everywhere. Glottometrics 5 (2002), 14--21.Google ScholarGoogle Scholar
  18. Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.Google ScholarGoogle Scholar
  19. Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953), 486--502.Google ScholarGoogle Scholar
  20. Giordano De Marzo, Andrea Gabrielli, Andrea Zaccaria, and Luciano Pietronero. 2021. Dynamical approach to Zipf's law. Physical Review Research 3, Article 013084 (Jan 2021), 16 pages. https://doi.org/10.1103/PhysRevResearch.3.013084Google ScholarGoogle ScholarCross RefCross Ref
  21. Charles T. Meadow, Jiabin Wang, and Manal Stamboulie. 1993. An analysis of Zipf-Mandelbrot language measures and their application to artificial languages. Journal of Information Science 19, 4 (1993), 247--257. https://doi.org/10.1177/016555159301900401Google ScholarGoogle ScholarCross RefCross Ref
  22. Ali Mehri and Maryam Jamaati. 2017. Variation of Zipf's exponent in one hundred live languages: A study of the Holy Bible translations. Physics Letters A 381, 31 (Aug. 2017), 2457--2558. https://doi.org/10.1016/j.physleta.2017.05.061Google ScholarGoogle ScholarCross RefCross Ref
  23. Ali Mehri and Maryam Jamaati. 2021. Statistical metrics for languages classification: A case study of the Bible translations. Chaos, Solitons & Fractals 144, Article 110679 (March 2021). https://doi.org/10.1016/j.chaos.2021.110679Google ScholarGoogle ScholarCross RefCross Ref
  24. Isabel Moreno-Sánchez, Francesc Font-Clos, and Álvaro Corral. 2016. Large-Scale Analysis of Zipf's Law in English Texts. PLOS ONE 11 (Jan. 2016), 1--19. https://doi.org/10.1371/journal.pone.0147073Google ScholarGoogle ScholarCross RefCross Ref
  25. Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan. 1992. Information theoretic models in statistical linguistics-Part I: A model for word frequencies. Current Science 63, 5 (1992), 261--269. http://www.jstor.org/stable/24095491Google ScholarGoogle Scholar
  26. Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan. 1998. Models for power law relations in linguistics and information science. Journal of Quantitative Linguistics 5, 1--2 (1998), 35--61. https://doi.org/10.1080/09296179808590110Google ScholarGoogle ScholarCross RefCross Ref
  27. Michael Nelson and Jean Tague. 1985. Split size-rank models for the distribution of index terms. Journal of the American Society for Information Science 36, 5 (Sept. 1985), 283--296. https://doi.org/10.1002/asi.4630360502Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Miranda Lee Pao. 1978. Automatic text analysis based on transition phenomena of word occurrences. Journal of the American Society for Information Science 29, 3 (1978), 121--124. https://doi.org/10.1002/asi.4630290303Google ScholarGoogle ScholarCross RefCross Ref
  29. Richard Perline. 2005. Strong, Weak and False Inverse Power Laws. Statist. Sci. 20, 1 (2005), 68--88.Google ScholarGoogle Scholar
  30. David Pinto, Héctor Jiménez-Salazar, and Paolo Rosso. 2006. Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 536--546. https://doi.org/10.1007/11671299_55Google ScholarGoogle ScholarCross RefCross Ref
  31. Manfred R. Schroeder. 1991. Fractals, chaos, power laws: Minutes from an infinite paradise. W. H. Freeman and Company, New York, NY, USA.Google ScholarGoogle Scholar
  32. Ernst Schuegraf. 1976. Compression of large inverted files with hyperbolic term distribution. Information Processing & Management 12, 6 (1976), 377--384. https://doi.org/10.1016/0306-4573(76)90035-2Google ScholarGoogle ScholarCross RefCross Ref
  33. Lei Shi, Zhi-Min Gu, Yong-Cai Tao, Lin Wei, and Yun Shi. 2005. Modeling Web objects' popularity. In 2005 International Conference on Machine Learning and Cybernetics, Vol. 4. IEEE, 2320--2324. https://doi.org/10.1109/ICMLC.2005.1527331Google ScholarGoogle ScholarCross RefCross Ref
  34. Herbert Sichel. 1975. On a Distribution Law for Word Frequencies. J. Amer. Statist. Assoc. 70, 351a (1975), 542--547. https://doi.org/10.1080/01621459.1975.10482469Google ScholarGoogle ScholarCross RefCross Ref
  35. Jean Tague, Michael Nelson, and Harry Wu. 1980. Problems in the simulation of bibliographic retrieval systems. In Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval (SIGIR '80). ACM, 236--255. https://doi.org/10.5555/636669.636684Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Juhan Tuldava. 1996. The frequency spectrum of text and vocabulary. Journal of Quantitative Linguistics 3, 1 (1996), 38--50. https://doi.org/10.1080/09296179608590062Google ScholarGoogle ScholarCross RefCross Ref
  37. George Kingsley Zipf. 1932. Selected studies of the principle of relative frequency in language. Harvard University Press.Google ScholarGoogle Scholar
  38. George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA.Google ScholarGoogle Scholar

Index Terms

  1. A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's Law

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 July 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader