Skip to main content

A Model for Predicting n-gram Frequency Distribution in Large Corpora

  • Conference paper
  • First Online:
  • 1507 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12742))

Abstract

The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n-gram frequency distributions. For different n-gram sizes (\(n\!\ge \!1\)), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, \(k=1, 2,\ldots \), by predicting the influence of the corpus size upon the Zipf’s law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near \(4\%\), for English and French corpora from 62 Million to 8.6 Billion words.

Acknowledgements to FCT MCTES, NOVA LINCS UIDB/04516/2020 and Carlos Gonçalves.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ausloos, M., Cerqueti, R.: A universal rank-size law. PLoS ONE 11(11) (2016)

    Google Scholar 

  2. Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and zipf’s law. Glottometrics 4, 1–26 (2002)

    Google Scholar 

  3. Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: The meta book and size-dependent properties of written language. New J. Phys. 11(12), 123015 (2009)

    Article  Google Scholar 

  4. Booth, A.D.: A law of occurrences for words of low frequency. Inf. Control 10(4), 386–393 (1967)

    Article  Google Scholar 

  5. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on Empirical Methods in NLP and Computational Natural Language Learning, pp. 858–867. ACL (2007)

    Google Scholar 

  6. Cancho, R.F., Solé, R.V.: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited*. J. Quant. Linguist. 8(3), 165–173 (2001)

    Article  Google Scholar 

  7. Dias, G.: Multiword unit hybrid extraction. In: ACL Workshop on Multiword Expressions, vol. 18, pp. 41–48. ACL (2003)

    Google Scholar 

  8. Gerlach, M., Altmann, E.G.: Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013)

    Google Scholar 

  9. Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large Corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6

    Chapter  Google Scholar 

  10. Haight, F.A.: Handbook of the Poisson Distribution. John Wiley & Sons, New York (1967)

    MATH  Google Scholar 

  11. Lü, L., Zhang, Z.K., Zhou, T.: Deviation of zipf’s and heaps’ laws in human languages with limited dictionary sizes. Sci. Rep. 3(1082), 1–7 (2013)

    Google Scholar 

  12. Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structural of Language and its Mathematical Aspects (1953)

    Google Scholar 

  13. Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Math. 1(2), 226–251 (2003)

    Article  MathSciNet  Google Scholar 

  14. Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic Bull. Rev. 21, 1112–1130 (2014)

    Article  Google Scholar 

  15. Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings 2001 IEEE International Conference on Data Mining, pp. 513–520 (2001)

    Google Scholar 

  16. Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., et al. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63

    Chapter  Google Scholar 

  17. Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)

    Google Scholar 

  18. da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48159-1_9

    Chapter  Google Scholar 

  19. Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)

    Article  MathSciNet  Google Scholar 

  20. Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joaquim F. Silva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silva, J.F., Cunha, J.C. (2021). A Model for Predicting n-gram Frequency Distribution in Large Corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12742. Springer, Cham. https://doi.org/10.1007/978-3-030-77961-0_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77961-0_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77960-3

  • Online ISBN: 978-3-030-77961-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics