Abstract
Three models for word frequency distributions, the lognormal law, the generalized inverse Gauss-Poisson law and the extended generalized Zipf's law are compared and evaluated with respect to goodness of fit and rationale. Application of these models to frequency distributions of a text, a corpus and morphological data reveals that no model can lay claim to exclusive validity, while inspection of the extrapolated theoretical vocabulary sizes raises doubts as to whether the urn scheme with independent trials is the correct underlying model for word frequency data. The role of morphology in shaping word frequency distributions is discussed, as well as parallelisms between vocabulary richness in literary studies and morphological productivity in linguistics.
Similar content being viewed by others
References
Baayen, R.H. A Corpus-Based Approach to Morphological Productivity. Statistical Analysis and Psycholinguistic Interpretation. Diss. Free University, Amsterdam, 1989.
Baayen, R.H., and Lieber, R. “Productivity and English Derivation: A Corpus Based Study.” Linguistics, 29 (1991), 801–43.
Baayen, R.H. “A Stochastic Process for Word Frequency Distributions.” In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Ed. D.E. Appelt. 1991 (a), pp. 271–78.
Baayen, R.H. “A Quantitative Approach to Morphological Productivity.” In Yearbook of Morphology 1991. Eds. G.E. Booij and J. van Marle. Dordrecht: Kluwer, 1991 (b), 109–49.
Bolinger, D.L. “On Defining the Morpheme.” In Forms of English. Accent, Morpheme, Order. Ed. D.L. Bolinger. Cambridge, MA: Harvard University Press, 1948, pp. 183–89.
Brunet, E. Le Vocabulaire de Jean Giraudoux. Structure et Évolution. Genève: Slatkine, 1978.
Carroll, J.B. “On Sampling from a Lognormal Model of Word Frequency Distribution.” In Computational Analysis of Present-Day American English. Eds. H. Kučera and W.N. Francis. Providence: Brown University Press, 1967, pp. 406–24.
Carroll, J.B. “A Rationale for an Asymptotic Lognormal Form of Word Frequency Distributions.” Research Bulletin. Educational Testing Service. Princeton, November 1969.
Efron, B., and Thisted, R. “Estimating the Number of Unseen Species: How many Words did Shakespeare Know?” Biometrika, 63 (1976), 435–47.
Good, I.J. “The Population Frequencies of Species and the Estimation of Population Parameters.” Biometrika, 40 (1953) 237–64.
Good, I.J., and Toulmin, G.H. “The Number of New Species and the Increase in Population Coverage, when a Sample is Increased.” Biometrika, 43 (1956), 45–63.
Guiraud, H. Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France, 1954.
Haeringen, C. B. van “Het Achtervoegsel -ing: Mogelijkheden en Beperkingen.” De Nieuwe Taalgids, 64 (1971), 449–68.
Harwood, F.W., and Wright, A.M. “Statistical Study of English Word Formation.” Language, 32 (1956), 260–73.
Herdan, G. Type-Token Mathematics. The Hague: Mouton, 1960.
Herdan, G. Quantitative Linguistics. London: Buttersworths, 1964.
Hill, B. M. “A Theoretical Derevation of the Zipf (Pareto) Law.” In Studies on Zipf's Law. Eds. H. Guiter and M.V. Arapov. Bochum: Brockmeyer, 1983, pp. 53–64.
Kalinin, V.M. “Functionals Related to the Poisson Distribution, and Statistical Structure of a Text.” In Articles on Mathematical Statistics and the Theory of Probability. Ed. J.V. Finnik. Providence, RI: American Mathematical Society, 1965, pp. 202–20.
Khmaladze, E.V., and Chitashvili, R.J. =“Statistical Analysis of Large Number of Rare Events and Related Problems.” Transactions of the Tbilisi Mathematical Institute, 91 (1989), 196–245.
Landauer, T.K., and Streeter, L.A. “Structural Differences Between Common and Rare Words: Failure of Equivalence Assumptions for Theories of Word Recognition.” Journal of Verbal Learning and Verbal Behavior, 12 (1973), 119–31.
Lánský, P., and Radil-Weiss, T. “A Generalization of the Yule-Simon Model, with Special Reference to Word Association Tests and Neural Cell Assembly Formation.” Journal of Mathematical Psychology, 21 (1980), 53–65.
Mandelbrot, B. “On the Theory of Word Frequencies and on Related Markovian Models of Discourse.” In Structure of Language and its Mathematical Aspects. Proceedings of Symposia in Applied Mathematics. Vol. XII. Ed. R. Jakobson. Providence, RI: American Mathematical Society, 1962, pp. 190–219.
Martin, W. Analyse van een Vocabularium met behulp van een computer. Brussels: AIMAV, 1970.
Menard, N. Mesure de la Richesse Lexicale. Théorie et Vérifi-cations Expérimentales. Etudes Stylométriques et Sociolinguistiques. Genève: Slatkine-Champion, 1983.
Miller, G.A. “Some Effects of Intermittent Silence.” The American Journal of Psychology, 52 (1957), 311–14.
Miller, G.A., Newman, E.B., and Friedman, E.A. “LengthFrequency Statistics for Written English.” Information and Control, 1 (1958), 370–89.
Morrison, D.F. Multivariate Statistical Methods. Tokyo: McGraw-Hill Kogakusha, 1976.
Muller, C. Principes et Méthodes de Statistique Lexicale. Paris: Hachette, 1977.
Muller, C. “Du Nouveau sur les Distributions Lexicales: La Formule de Waring-Herdan.” In Langue Française et Linguistique Quantitative. Ed. C. Muller. Geneve: Slatkine, 1979, pp. 177–95.
Nushbaum, H.C. “A Stochastic Account of the Relationship between Lexical Density and Word Frequency.” Research on Speech Perception, Progress Report # 11. 1985, Indiana University.
Orlov, J.K. “Dynamik der Häufigkeitsstrukturen.” In Studies on Zipf's Law. Eds. H. Guiter and M.V. Arapov. Bochum: Brockmeyer, 1983, pp. 116–53.
Orlov, J.K. “Ein Model der Häufigkeitsstruktur des Vokabulars.” In Studies of Zipf's Law. Eds. H. Guiter and M.V. Arapov. Bochum: Brockmeyer, 1983, pp. 154–233.
Orlov, J.K., and Chitashvili, R.Y. “On the Distribution of Frequency Spectrum in Small Samples from Populations with a Large Number of Events.” Bulletin of the Academy of Sciences, Georgia, 108.2 (1982a), 297–300.
Orlov, J.K., and Chitashvili, R.Y. “On Some Problems of Statistical Estimation in Relatively Small Samples.” Bulletin of the Academy of Sciences, Georgia, 108.3 (1982b), 513–16.
Orlov, J.K., and Chitashvili, R.Y. “On the Statistical Interpretation of Zipf's Law.” Bulletin of the Academy of Sciences, Georgia, 109.3 (1983a), 505–508.
Orlov, J.K., and Chitashvili, R.Y. “Generalized Z-Distribution Generating the Well-Known ‘Rank-Distributions’.” Bulletin of the Academy of Sciences, Georgia, 110.2 (1983b), 268–72.
Paivio, A., Yuille, J.C., and Madigan, S. “Concreteness, Imagery and Meaningful Values for 925 Nouns.” Journal of Experimental Psychology Monograph 76 I, Pt. 2.1968.
Rainer, F. “Towards a Theory of Blocking: The Case of Italian and German Quality Nouns.” Yearbook of Morphology, 1 (1988), 155–85.
Ratkowsky, D. “The Travaux de Linguistique Quantitative.” (Book Review.) Computers and the Humanities, 22 (1988), 77–85.
Reder, L.M., Anderson, J.R., and Bjork, R.A. “A Semantic Interpretation of Encoding Specificity.” Journal of Experimental Psychology, 102 (1974), 648–56.
Rouault, A. “Loi de Zipf et Sources Markoviennes.” Ann. Inst. H. Poincaré, 14 (1978), 169–88.
Roy, G-R. Contribution d l Analyse de Syntagme Verbal. Étude Morphosyntaxique et Statistique des Coverbes. Paris: Klincksieck, 1976.
Schultink, H. “Produktiviteit als Morfologisch Fenomeen.” Forum der Letteren, 2 (1961), 110–25.
Sichel, H.A. “On a Distribution Law for Word Frequencies.” Journal of the American Statistical Association, 70 (1975), 542–47.
Sichel, H.A. “Word Frequency Distributions and Type-Token Characteristics.” Mathematical Scientist, 11 (1986), 45–72.
Simon, H.A. “On a Class of Skew Distribution Functions.” Biometrika, 42 (1955), 435–40.
Sinclair, J.M., ed. Looking Up: An Account of the Cobuild Project in Lexical Computing. London: Collins, 1987.
Sterkenburg, P.G.J., and Pijnenburg, W.J.J. van Dale Groot woordenboek van hedendaags Nederlands. Utrecht: Van Dale Lexicografie, 1984.
Uit den Boogaart, P.C. Woordfrequenties in Gesproken en Geschreven Nederlands. Utrecht: Oosthoek, Scheltema and Holkema, 1975.
Veld, R in 't. Hoe willekeurig kiest een schrijver ziin woorden? Een urn model voor onderzoek naar de frequenties van woorden, munten, achternamen en vissen. Doctoral dissertation. University of Amsterdam, 1984.
Yule, G.U. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.
Zipf, G.K. The Psycho-Biology of Language. Boston: Houghton Mifflin, 1935.
Author information
Authors and Affiliations
Additional information
R. Harald Baayen received his PhD at the Free University, Amsterdam, where he was involved in research on morphological productivity. He is now at the Max-Planck Institute for Psycholinguistics, Nijmegen, participating in a project on computational modelling of lexical representation and process.
Rights and permissions
About this article
Cite this article
Baayen, H. Statistical models for word frequency distributions: A linguistic evaluation. Comput Hum 26, 347–363 (1992). https://doi.org/10.1007/BF00136980
Issue Date:
DOI: https://doi.org/10.1007/BF00136980