Abstract
The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes (\(n\!\ge \!1\)), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, \(k=1, 2,\ldots \), by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near \(4\%\), for English and French corpora from 62 Million to 8.6 Billion words.
Acknowledgements to FCT MCTES, NOVA LINCS UIDB/04516/2020 and Carlos Gonçalves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ausloos, M., Cerqueti, R.: A universal rank-size law. PLoS ONE 11(11) (2016)
Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and zipf’s law. Glottometrics 4, 1–26 (2002)
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: The meta book and size-dependent properties of written language. New J. Phys. 11(12), 123015 (2009)
Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on Empirical Methods in NLP and Computational Natural Language Learning, pp. 858–867. ACL (2007)
Cancho, R.F., Solé, R.V.: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited*. J. Quant. Linguist. 8(3), 165–173 (2001)
Dias, G.: Multiword unit hybrid extraction. In: ACL Workshop on Multiword Expressions, vol. 18, pp. 41–48. ACL (2003)
Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)
Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large Corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6
Haight, F.A.: Handbook of the Poisson Distribution. John Wiley & Sons, New York (1967)
Lü, L., Zhang, Z.K., Zhou, T.: Deviation of zipf’s and heaps’ laws in human languages with limited dictionary sizes. Sci. Rep. 3(1082), 1–7 (2013)
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structural of Language and its Mathematical Aspects (1953)
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1(2), 226–251 (2003)
Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic Bull. Rev. 21, 1112–1130 (2014)
Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 513–520 (2001)
Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., et al. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63
Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)
da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48159-1_9
Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, J.F., Cunha, J.C. (2021). A Model for Predicting n-gram Frequency Distribution in Large Corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-77961-0_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77960-3
Online ISBN: 978-3-030-77961-0
eBook Packages: Computer ScienceComputer Science (R0)