Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included

Vlad, Adriana; Mitrea, Adrian; Mitrea, Mihai

doi:10.1007/978-3-540-74472-6_33

Adriana Vlad^1,2,
Adrian Mitrea¹ &
Mihai Mitrea^1,3

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4705))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1755 Accesses
7 Citations

Abstract

This paper is part of a larger study dedicated by the authors to the description of printed Romanian language as an information source. Here, the statistical investigation attempts to get an answer concerning the mathematical model of the language with orthography and punctuation marks included into the alphabet. To come out to an accurate result, the authors processed the information obtained out of multiple data sets sampled from a corpus linguistics, by using the following statistical inferences: probability estimation with multiple confidence intervals, test of the hypothesis that the probability belongs to an interval, and test of the equality between two probabilities. The second type statistical error probability involved in the tests was considered. The experimental results, which are new for printed Romanian, refer to the letter, digram and trigram statistical structure in a corpus linguistics of 93 books (about 50 millions characters).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Authors who inspired Samuel Johnson’s language use in The Rambler: an investigation of his reading sources based on a phraseological unit “of our present state”

Article 20 August 2018

Digitising Swiss German: how to process and study a polycentric spoken language

Article Open access 11 April 2019

The lexicography of Sanskrit

References

Say, B., Akman, A.: Current Approaches to Punctuation in Computational Linguistics. Computer and the Humanities 30, 457–469 (1997)
Article Google Scholar
Vlad, A., Mitrea, A.: Estimating conditional probabilities and digram statistical structure in printed Romanian. In: Tufis D., Andersen P. (eds.): Recent Advances in Romanian Language Technology, Academiei, Bucharest, pp. 57–72; (1997), http://www.racai.ro/books/awde/vlad.html
Vlad, A., Mitrea, A., Mitrea, M., Popa, D.: Statistical methods for verifying the natural language stationarity based on the first approximation. Case study: printed Romanian. In: Proc. of the conference VEXTAL’99, Venice, pp. 127–132; (November 1999), http://byron.cgm.unive.it/events/vlad.pdf
Vlad, A., Mitrea, A., Mitrea, M.: Two frequency–rank laws for letters in printed Romanian. Procesamiento del Lenguaje Natural, (26), 153–160; (2000), http://www.sepln.org/revistaSEPLN/revista/26/index.html
Vlad, A., Mitrea, A., Mitrea, M.: Verifying Printed Romanian Language Stationarity Based on the Digram Statistical Structure. In: Proc. of the Romanian Academy. Series A, vol. I(2/2000), pp. 129–139 (2000)
Google Scholar
Vlad, A., Mitrea, A., Mitrea, M.: The trigram statistical structure in printed Romanian. ROMJIST (Romanian Journal of Information Science and Technology) 4(3), 353–372 (2001)
Google Scholar
Vlad, A., Mitrea, A., Mitrea, M.: Estimating tetragram probabilities by using multiple data samples from a natural text. Case study: printed Romanian. In: Proc. of the 9th Intl. Conf. on Information Processing and Management of Uncertainty in Knowledge–Based Systems IPMU2002, pp. 1285–1292, Annecy France (July 2002)
Google Scholar
Vlad, A., Mitrea, A., Mitrea, M.: A Corpus – based Analysis of how Accurately Printed Romanian Obeys Some Universal Laws. In: Wilson, A., Rayson, P., McEnery, T. (eds.) A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Ch. 15, pp. 153–165. Lincom Europa Publishing House, Munich (2003)
Google Scholar
Vlad, A., Mitrea, A., Mitrea, M.: Limba română scrisă ca sursă de informaţie (Printed Romanian Language as an Information Source). Ed. Paideia, Bucharest (2003)
Google Scholar
Vlad, A., Mitrea, A., Mitrea, M.: Printed Romanian Modelling: the m grams and the Word Information Sources. In: Burileanu, C.(coord.): Proc. Speech Techonology and Human Computer Dialogue, pp. 79–98, Ed. Academiei Romane, Bucharest (April 2003)
Google Scholar
Vlad, A., Mitrea, A., Mitrea, M.: Letter statistical structure in Printed Romanian language when orthography and punctuation marks are included. In: Proc. of the IEEE Intl. Conf. Communications’2006, Bucharest, pp. 127–130. IEEE Computer Society Press, Los Alamitos (2006)
Google Scholar
Shannon, C.E.: Prediction and Entropy of Printed English. Bell Syst. Tech. J. 30, 50–64 (1951)
Google Scholar
Mood, A., Graybill, F., Boes, D.: Introduction to the Theory on Statistics, 3rd edn. McGraw-Hill Book Company, New York (1974)
Google Scholar
Walpole, R.E., Myers, R.H.: Probability and Statistics for Engineers and Scientists, 4th edn. MacMillan Publishing Comp., New York (1989)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electronics, Telecommunications and Information Technology, POLITEHNICA University of Bucharest, Romania
Adriana Vlad, Adrian Mitrea & Mihai Mitrea
The Research Institute for Artificial Intelligence, Romanian Academy,
Adriana Vlad
ARTEMIS Department, National Institute on Telecommunications, Evry, France
Mihai Mitrea

Authors

Adriana Vlad
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Mitrea
View author publications
You can also search for this author in PubMed Google Scholar
Mihai Mitrea
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Osvaldo Gervasi Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vlad, A., Mitrea, A., Mitrea, M. (2007). Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included. In: Gervasi, O., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2007. ICCSA 2007. Lecture Notes in Computer Science, vol 4705. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74472-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-74472-6_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74468-9
Online ISBN: 978-3-540-74472-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics