Abstract
There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia’s article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Leacock, C., Chodorow, M., Gamon, M., Tetreault, J.: Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publishers (2010)
Zeng, H., Alhossaini, M.A., Ding, L., Fikes, R., McGuinness, D.L.: Computing trust from revision history. In: Proceedings of the 2006 International Conference on Privacy, Security and Trust (2006)
Miłkowski, M.: Automated building of error corpora of polish. In: Corpus Linguistics, Computer Tools, and Applications State of the Art, pp. 631–639. Peter Lang (2008)
Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from wikipedia’s revision history. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (2010)
Zesch, T.: Measuring contextual fitness using error contexts extracted from the wikipedia revision history. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 529–538 (2012)
Graliński, F., Jassem, K., Junczys-Dowmunt, M.: PSI-Toolkit: Natural language processing pipeline. Computational Linguistics - Applications, 27–39 (2012)
Bušta, J., Hlaváčková, D., Jakubíček, M., Pala, K.: Classification of errors in text. In: RASLAN 2009: Recent Advances in Slavonic Natural Language Processing, pp. 109–119 (2009)
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv., 377–439 (1992)
Kapłon, T., Mazurkiewicz, J.: The method of inflection errors correction in texts composed in polish language – A concept. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 853–858. Springer, Heidelberg (2005)
Chin, S.C., Street, W.N., Srinivasan, P., Eichmann, D.: Detecting wikipedia vandalism with active learning and statistical language models. In: Proceedings of the 4th Workshop on Information Credibility, pp. 3–10 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grundkiewicz, R. (2013). Automatic Extraction of Polish Language Errors from Text Edition History. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-40585-3_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)