Abstract
This paper is part of a larger study dedicated by the authors to the description of printed Romanian language as an information source. Here, the statistical investigation attempts to get an answer concerning the mathematical model of the language with orthography and punctuation marks included into the alphabet. To come out to an accurate result, the authors processed the information obtained out of multiple data sets sampled from a corpus linguistics, by using the following statistical inferences: probability estimation with multiple confidence intervals, test of the hypothesis that the probability belongs to an interval, and test of the equality between two probabilities. The second type statistical error probability involved in the tests was considered. The experimental results, which are new for printed Romanian, refer to the letter, digram and trigram statistical structure in a corpus linguistics of 93 books (about 50 millions characters).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Say, B., Akman, A.: Current Approaches to Punctuation in Computational Linguistics. Computer and the Humanities 30, 457–469 (1997)
Vlad, A., Mitrea, A.: Estimating conditional probabilities and digram statistical structure in printed Romanian. In: Tufis D., Andersen P. (eds.): Recent Advances in Romanian Language Technology, Academiei, Bucharest, pp. 57–72; (1997), http://www.racai.ro/books/awde/vlad.html
Vlad, A., Mitrea, A., Mitrea, M., Popa, D.: Statistical methods for verifying the natural language stationarity based on the first approximation. Case study: printed Romanian. In: Proc. of the conference VEXTAL’99, Venice, pp. 127–132; (November 1999), http://byron.cgm.unive.it/events/vlad.pdf
Vlad, A., Mitrea, A., Mitrea, M.: Two frequency–rank laws for letters in printed Romanian. Procesamiento del Lenguaje Natural, (26), 153–160; (2000), http://www.sepln.org/revistaSEPLN/revista/26/index.html
Vlad, A., Mitrea, A., Mitrea, M.: Verifying Printed Romanian Language Stationarity Based on the Digram Statistical Structure. In: Proc. of the Romanian Academy. Series A, vol. I(2/2000), pp. 129–139 (2000)
Vlad, A., Mitrea, A., Mitrea, M.: The trigram statistical structure in printed Romanian. ROMJIST (Romanian Journal of Information Science and Technology) 4(3), 353–372 (2001)
Vlad, A., Mitrea, A., Mitrea, M.: Estimating tetragram probabilities by using multiple data samples from a natural text. Case study: printed Romanian. In: Proc. of the 9th Intl. Conf. on Information Processing and Management of Uncertainty in Knowledge–Based Systems IPMU2002, pp. 1285–1292, Annecy France (July 2002)
Vlad, A., Mitrea, A., Mitrea, M.: A Corpus – based Analysis of how Accurately Printed Romanian Obeys Some Universal Laws. In: Wilson, A., Rayson, P., McEnery, T. (eds.) A Rainbow of Corpora: Corpus Linguistics and the Languages of the World. Ch. 15, pp. 153–165. Lincom Europa Publishing House, Munich (2003)
Vlad, A., Mitrea, A., Mitrea, M.: Limba română scrisă ca sursă de informaţie (Printed Romanian Language as an Information Source). Ed. Paideia, Bucharest (2003)
Vlad, A., Mitrea, A., Mitrea, M.: Printed Romanian Modelling: the m grams and the Word Information Sources. In: Burileanu, C.(coord.): Proc. Speech Techonology and Human Computer Dialogue, pp. 79–98, Ed. Academiei Romane, Bucharest (April 2003)
Vlad, A., Mitrea, A., Mitrea, M.: Letter statistical structure in Printed Romanian language when orthography and punctuation marks are included. In: Proc. of the IEEE Intl. Conf. Communications’2006, Bucharest, pp. 127–130. IEEE Computer Society Press, Los Alamitos (2006)
Shannon, C.E.: Prediction and Entropy of Printed English. Bell Syst. Tech. J. 30, 50–64 (1951)
Mood, A., Graybill, F., Boes, D.: Introduction to the Theory on Statistics, 3rd edn. McGraw-Hill Book Company, New York (1974)
Walpole, R.E., Myers, R.H.: Probability and Statistics for Engineers and Scientists, 4th edn. MacMillan Publishing Comp., New York (1989)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vlad, A., Mitrea, A., Mitrea, M. (2007). Printed Romanian Modelling: A Corpus Linguistics Based Study with Orthography and Punctuation Marks Included. In: Gervasi, O., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2007. ICCSA 2007. Lecture Notes in Computer Science, vol 4705. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74472-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-74472-6_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74468-9
Online ISBN: 978-3-540-74472-6
eBook Packages: Computer ScienceComputer Science (R0)