short-paper

A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's Law

Authors:
Can Özbey

Huawei Turkey R&D Center, Istanbul, Turkey

Huawei Turkey R&D Center, Istanbul, Turkey

0009-0005-8432-9413
View Profile

,
Talha Çolakoğlu

Huawei Turkey R&D Center, Istanbul, Turkey

Huawei Turkey R&D Center, Istanbul, Turkey

0000-0002-4524-862X
View Profile

,
M. Şafak Bilici

Huawei Turkey R&D Center, Istanbul, Turkey

Huawei Turkey R&D Center, Istanbul, Turkey

0009-0005-4456-5163
View Profile

,
Ekin Can Erkuş;

Huawei Turkey R&D Center, Istanbul, Turkey

Huawei Turkey R&D Center, Istanbul, Turkey

0000-0002-2445-5929
View Profile

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2023Pages 1776–1780https://doi.org/10.1145/3539618.3591942

Published:18 July 2023Publication History

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1776–1780

ABSTRACT

The power-law approximation for the frequency distribution of words postulated by Zipf has been extensively studied for decades, which led to many variations on the theme. However, comparatively less attention has been paid to the investigation of the case of word frequencies. In this paper, we derive its analytical expression from the inverse of the underlying rank-size distribution as a function of total word count, vocabulary size and the shape parameter, thereby providing a unified framework to explain the nonlinear behavior of low frequencies on the log-log scale. We also present an efficient method based on relative entropy minimization for a robust estimation of the shape parameter using a small number of empirical low-frequency probabilities. Experiments were carried out for a selected set of languages with varying degrees of inflection in order to demonstrate the effectiveness of the proposed approach.

Supplemental Material

SIGIR23-srp5360.mp4

mp4

42.6 MB

Download

References

R. Harald Baayen. 2001. Word Frequency Distributions (1 ed.). Springer Dordrecht.Google Scholar
Andrew Donald Booth. 1967. A law of occurrences for words of low frequency. Information and Control 10, 4 (April 1967), 386--393. https://doi.org/10.1016/S0019-9958(67)90201-XGoogle ScholarCross Ref
Ye-Sho Chen and Ferdinand Leimkuhler. 1990. Booth's law of word frequency. Journal of the American Society for Information Science 41, 5 (1990), 387--388.Google ScholarCross Ref
Flavio Chierichetti, Ravi Kumar, and Bo Pang. 2017. On the Power Laws of Language: Word Frequency Distributions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17). ACM, 385--394. https://doi.org/10.1145/3077136.3080821Google ScholarDigital Library
Flavio Chierichetti, Ravi Kumar, and Prabhakar Raghavan. 2009. Compressed Web Indexes. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW '09). ACM, 451--460. https://doi.org/10.1145/1526709.1526770Google ScholarDigital Library
Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: the Bible in 100 languages. Language Resources and Evaluation 49 (June 2015), 375--395. https://doi.org/10.1007/s10579-014-9287-yGoogle ScholarDigital Library
Albert Cohen, Rosario Nunzio Mantegna, and Shlomo Havlin. 1997. Numerical analysis of word frequencies in artificial and natural language texts. Fractals 5, 01 (1997), 95--104.Google ScholarCross Ref
Álvaro Corral, Isabel Serra, and Ramon Ferrer-i-Cancho. 2020. Distinct flavors of Zipf's law and its maximum likelihood fitting: Rank-size and size-distribution representations. Physical Review E 102, Article 052113 (Nov 2020), 17 pages. https://doi.org/10.1103/PhysRevE.102.052113Google ScholarCross Ref
Mathias Creutz and Krista Lagus. 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology.Google Scholar
Jane Fedorowicz. 1987. Database performance evaluation in an indexed file environment. ACM Transactions on Database Systems 12, 1 (March 1987), 85--110. https://doi.org/10.1145/12047.13675Google ScholarDigital Library
Ramon Ferrer-i-Cancho. 2005. The variation of Zipf's law in human language. The European Physical Journal B - Condensed Matter and Complex Systems 44 (2005), 249--257. https://doi.org/10.1140/epjb/e2005-00121-8Google ScholarCross Ref
Ramon Ferrer-i-Cancho and Ricard V. Solé. 2001. Two Regimes in the Frequency of Words and the Origins of Complex Lexicons: Zipf's Law Revisited. Journal of Quantitative Linguistics 8, 3 (2001), 165--173. https://doi.org/10.1076/jqul.8.3.165. 4101Google ScholarCross Ref
Xavier Gabaix and Yannis M. Ioannides. 2004. The Evolution of City Size Distributions. In Cities and Geography. Handbook of Regional and Urban Economics, Vol. 4. Elsevier, 2341--2378. https://doi.org/10.1016/S1574-0080(04)80010-5Google ScholarCross Ref
Michel L. Goldstein, Steven A. Morris, and Gary G. Yen. 2004. Problems with fitting to the power-law distribution. The European Physical Journal B - Condensed Matter and Complex Systems 41 (2004), 255--258. https://doi.org/10.1140/epjb/e2004-00316-5Google ScholarCross Ref
Le Quan Ha, Philip Hanna, Ji Ming, and Francis Jack Smith. 2009. Extending Zipf's law to n-grams for large corpora. Artificial Intelligence Review 32 (Dec. 2009), 101--113. https://doi.org/10.1007/s10462-009-9135-4Google ScholarDigital Library
Harold Stanley Heaps. 1978. Information retrieval, computational and theoretical aspects. Academic Press.Google Scholar
Wentian Li. 2002. Zipf's Law everywhere. Glottometrics 5 (2002), 14--21.Google Scholar
Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.Google Scholar
Benoit Mandelbrot. 1953. An informational theory of the statistical structure of language. Communication Theory 84 (1953), 486--502.Google Scholar
Giordano De Marzo, Andrea Gabrielli, Andrea Zaccaria, and Luciano Pietronero. 2021. Dynamical approach to Zipf's law. Physical Review Research 3, Article 013084 (Jan 2021), 16 pages. https://doi.org/10.1103/PhysRevResearch.3.013084Google ScholarCross Ref
Charles T. Meadow, Jiabin Wang, and Manal Stamboulie. 1993. An analysis of Zipf-Mandelbrot language measures and their application to artificial languages. Journal of Information Science 19, 4 (1993), 247--257. https://doi.org/10.1177/016555159301900401Google ScholarCross Ref
Ali Mehri and Maryam Jamaati. 2017. Variation of Zipf's exponent in one hundred live languages: A study of the Holy Bible translations. Physics Letters A 381, 31 (Aug. 2017), 2457--2558. https://doi.org/10.1016/j.physleta.2017.05.061Google ScholarCross Ref
Ali Mehri and Maryam Jamaati. 2021. Statistical metrics for languages classification: A case study of the Bible translations. Chaos, Solitons & Fractals 144, Article 110679 (March 2021). https://doi.org/10.1016/j.chaos.2021.110679Google ScholarCross Ref
Isabel Moreno-Sánchez, Francesc Font-Clos, and Álvaro Corral. 2016. Large-Scale Analysis of Zipf's Law in English Texts. PLOS ONE 11 (Jan. 2016), 1--19. https://doi.org/10.1371/journal.pone.0147073Google ScholarCross Ref
Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan. 1992. Information theoretic models in statistical linguistics-Part I: A model for word frequencies. Current Science 63, 5 (1992), 261--269. http://www.jstor.org/stable/24095491Google Scholar
Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan. 1998. Models for power law relations in linguistics and information science. Journal of Quantitative Linguistics 5, 1--2 (1998), 35--61. https://doi.org/10.1080/09296179808590110Google ScholarCross Ref
Michael Nelson and Jean Tague. 1985. Split size-rank models for the distribution of index terms. Journal of the American Society for Information Science 36, 5 (Sept. 1985), 283--296. https://doi.org/10.1002/asi.4630360502Google ScholarDigital Library
Miranda Lee Pao. 1978. Automatic text analysis based on transition phenomena of word occurrences. Journal of the American Society for Information Science 29, 3 (1978), 121--124. https://doi.org/10.1002/asi.4630290303Google ScholarCross Ref
Richard Perline. 2005. Strong, Weak and False Inverse Power Laws. Statist. Sci. 20, 1 (2005), 68--88.Google Scholar
David Pinto, Héctor Jiménez-Salazar, and Paolo Rosso. 2006. Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In Computational Linguistics and Intelligent Text Processing. Springer Berlin Heidelberg, 536--546. https://doi.org/10.1007/11671299_55Google ScholarCross Ref
Manfred R. Schroeder. 1991. Fractals, chaos, power laws: Minutes from an infinite paradise. W. H. Freeman and Company, New York, NY, USA.Google Scholar
Ernst Schuegraf. 1976. Compression of large inverted files with hyperbolic term distribution. Information Processing & Management 12, 6 (1976), 377--384. https://doi.org/10.1016/0306-4573(76)90035-2Google ScholarCross Ref
Lei Shi, Zhi-Min Gu, Yong-Cai Tao, Lin Wei, and Yun Shi. 2005. Modeling Web objects' popularity. In 2005 International Conference on Machine Learning and Cybernetics, Vol. 4. IEEE, 2320--2324. https://doi.org/10.1109/ICMLC.2005.1527331Google ScholarCross Ref
Herbert Sichel. 1975. On a Distribution Law for Word Frequencies. J. Amer. Statist. Assoc. 70, 351a (1975), 542--547. https://doi.org/10.1080/01621459.1975.10482469Google ScholarCross Ref
Jean Tague, Michael Nelson, and Harry Wu. 1980. Problems in the simulation of bibliographic retrieval systems. In Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval (SIGIR '80). ACM, 236--255. https://doi.org/10.5555/636669.636684Google ScholarDigital Library
Juhan Tuldava. 1996. The frequency spectrum of text and vocabulary. Journal of Quantitative Linguistics 3, 1 (1996), 38--50. https://doi.org/10.1080/09296179608590062Google ScholarCross Ref
George Kingsley Zipf. 1932. Selected studies of the principle of relative frequency in language. Harvard University Press.Google Scholar
George Kingsley Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA.Google Scholar

Index Terms

A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's Law
1. General and reference
  1. Cross-computing tools and techniques
    1. Estimation

Recommendations

Zipfian regularities in “non-point” word representations
Abstract
Being one of the most common empirical regularities, the Zipf’s law for word frequencies is a power law relation between word frequencies and frequency ranks of words. We quantitatively study semantic uncertainty of words through non-...
Highlights
- Variances of Gaussian embeddings can be used to quantify semantic uncertainty.
- ...
Read More
Random texts exhibit Zipf's-law-like word frequency distribution

It is shown that the distribution of word frequencies for randomly generated texts is very similar to Zipf's law observed in natural languages such as English. The facts that the frequency of occurrence of a word is almost an inverse power law function ...
Read More
Zipf's law and entropy (Corresp.)

The estimate of the entropy of a language by assuming that the word probabilities follow Zipf's law is discussed briefly. Previous numerical results [3] on the vocabulary size implied by Zipf's law and entropy per word are corrected. The vocabulary size ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
inverse zipf's law
linearity
parameter estimation
word frequencies
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 136
  Total Downloads
- Downloads (Last 12 months)136
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Unified Formulation for the Frequency Distribution of Word Frequencies using the Inverse Zipf's Law

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Zipfian regularities in “non-point” word representations

Random texts exhibit Zipf's-law-like word frequency distribution

Zipf's law and entropy (Corresp.)