Skip to main content

Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included

  • Conference paper
Computational Science and Its Applications – ICCSA 2007 (ICCSA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4705))

Included in the following conference series:

Abstract

This paper is part of a larger study dedicated by the authors to the description of printed Romanian language as an information source. Here, the statistical investigation attempts to get an answer concerning the mathematical model of the language with orthography and punctuation marks included into the alphabet. To come out to an accurate result, the authors processed the information obtained out of multiple data sets sampled from a corpus linguistics, by using the following statistical inferences: probability estimation with multiple confidence intervals, test of the hypothesis that the probability belongs to an interval, and test of the equality between two probabilities. The second type statistical error probability involved in the tests was considered. The experimental results, which are new for printed Romanian, refer to the letter, digram and trigram statistical structure in a corpus linguistics of 93 books (about 50 millions characters).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Say, B., Akman, A.: Current Approaches to Punctuation in Computational Linguistics. Computer and the Humanities 30, 457–469 (1997)

    Article  Google Scholar 

  2. Vlad, A., Mitrea, A.: Estimating conditional probabilities and digram statistical structure in printed Romanian. In: Tufis D., Andersen P. (eds.): Recent Advances in Romanian Language Technology, Academiei, Bucharest, pp. 57–72; (1997), http://www.racai.ro/books/awde/vlad.html

  3. Vlad, A., Mitrea, A., Mitrea, M., Popa, D.: Statistical methods for verifying the natural language stationarity based on the first approximation. Case study: printed Romanian. In: Proc. of the conference VEXTAL’99, Venice, pp. 127–132; (November 1999), http://byron.cgm.unive.it/events/vlad.pdf

  4. Vlad, A., Mitrea, A., Mitrea, M.: Two frequency–rank laws for letters in printed Romanian. Procesamiento del Lenguaje Natural, (26), 153–160; (2000), http://www.sepln.org/revistaSEPLN/revista/26/index.html

  5. Vlad, A., Mitrea, A., Mitrea, M.: Verifying Printed Romanian Language Stationarity Based on the Digram Statistical Structure. In: Proc. of the Romanian Academy. Series A, vol. I(2/2000), pp. 129–139 (2000)

    Google Scholar 

  6. Vlad, A., Mitrea, A., Mitrea, M.: The trigram statistical structure in printed Romanian. ROMJIST (Romanian Journal of Information Science and Technology) 4(3), 353–372 (2001)

    Google Scholar 

  7. Vlad, A., Mitrea, A., Mitrea, M.: Estimating tetragram probabilities by using multiple data samples from a natural text. Case study: printed Romanian. In: Proc. of the 9th Intl. Conf. on Information Processing and Management of Uncertainty in Knowledge–Based Systems IPMU2002, pp. 1285–1292, Annecy France (July 2002)

    Google Scholar 

  8. Vlad, A., Mitrea, A., Mitrea, M.: A Corpus – based Analysis of how Accurately Printed Romanian Obeys Some Universal Laws. In: Wilson, A., Rayson, P., McEnery, T. (eds.) A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Ch. 15, pp. 153–165. Lincom Europa Publishing House, Munich (2003)

    Google Scholar 

  9. Vlad, A., Mitrea, A., Mitrea, M.: Limba română scrisă ca sursă de informaţie (Printed Romanian Language as an Information Source). Ed. Paideia, Bucharest (2003)

    Google Scholar 

  10. Vlad, A., Mitrea, A., Mitrea, M.: Printed Romanian Modelling: the m grams and the Word Information Sources. In: Burileanu, C.(coord.): Proc. Speech Techonology and Human Computer Dialogue, pp. 79–98, Ed. Academiei Romane, Bucharest (April 2003)

    Google Scholar 

  11. Vlad, A., Mitrea, A., Mitrea, M.: Letter statistical structure in Printed Romanian language when orthography and punctuation marks are included. In: Proc. of the IEEE Intl. Conf. Communications’2006, Bucharest, pp. 127–130. IEEE Computer Society Press, Los Alamitos (2006)

    Google Scholar 

  12. Shannon, C.E.: Prediction and Entropy of Printed English. Bell Syst. Tech. J. 30, 50–64 (1951)

    Google Scholar 

  13. Mood, A., Graybill, F., Boes, D.: Introduction to the Theory on Statistics, 3rd edn. McGraw-Hill Book Company, New York (1974)

    Google Scholar 

  14. Walpole, R.E., Myers, R.H.: Probability and Statistics for Engineers and Scientists, 4th edn. MacMillan Publishing Comp., New York (1989)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Osvaldo Gervasi Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vlad, A., Mitrea, A., Mitrea, M. (2007). Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included. In: Gervasi, O., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2007. ICCSA 2007. Lecture Notes in Computer Science, vol 4705. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74472-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74472-6_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74468-9

  • Online ISBN: 978-3-540-74472-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics