Abstract
The paper presents an original method for processing historical texts. A historical text is converted into its modernized equivalent by a tool called diachronic normalizer, embedded into a linguistic toolkit. The solution has a few merits. Firstly, the toolkit architecture allows for imposing the morphological constraints on diachronization rules. Secondly, the diachronic normalizer may be launched in the pipeline together with other NLP tools, such as parsers or translators. Lastly, the toolkit makes it possible to efficiently apply, in the diachronic normalization, a long list of diachronic pairs, found out with the aid of word distribution vectors in historical corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
Odkrywka [9], contains 40 billion tokens and consists of Polish publications (mostly newspapers and books) originating mainly from the years 1810–2013.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
References
Archer, D., Ernst-Gerlach, A., Kempken, S., Pilz, T., Rayson, P.: The identification of spelling variants in English and German historical texts: manual or automatic? In: Digital Humanities 2006, CATI, Université Paris-Sorbonne, Paris, France, pp. 3–5 (2006)
Baron, A., Rayson, P., Archer, D.: Automatic standardization of spelling for historical text mining. In: Digital Humanities 2009 (June 2009)
Bollmann, M., Petran, F., Dipper, S.: Rule-based normalization of historical texts. In: Proceedings of the International Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 34–42 (2011)
Bronikowska, R., Modrzejewski, E.: The enrichment of the lexical information and the corpus resources by using the results of the morphological analysis of historical texts (2017). https://ijp.pan.pl/images/konferencje/elex-budapeszt-2017.pdf
Etxeberria, I., Alegria, I., Uria, L., Hulden, M.: Evaluating the noisy channel model for the normalization of historical texts: Basque, Spanish and Slovene. In: Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (2016)
Graliński, F., Jassem, K., Junczys-Dowmunt, M.: PSI-toolkit: a natural language processing pipeline. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds.) Computational Linguistics, Studies in Computational Intelligence. Studies in Computational Intelligence, vol. 458, pp. 27–39. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-34399-5_2
Graliński, F., Jaworski, R., Borchmann, Ł., Wierzchoń, P.: Gonito.net - open platform for research competition, cooperation and reproducibility. In: Branco, A., Calzolari, N., Choukri, K. (eds.) Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language, pp. 13–20 (2016)
Graliński, F., Jassem, K.: Mining historical texts for diachronic spelling variants. Poznan Stud. Contemp. Lingustics (2019). https://ai.wmi.amu.edu.pl/wp-content/uploads/2020/02/gralinski2019mining-2.pdf. Accepted 6 Mar 2019
Graliński, F., Wierzchoń, P.: Odkrywka, czyli leksykografia diachroniczna live. In: Bańko, M., Karaś, H. (eds.) Między teorią a praktyką. Metody współczesnej leksykografii, vol. 1, pp. 59–69. Wydawnictwa Uniwersytetu Warszawskiego, Warszawa (2018)
Hauser, A.W., Schulz, K.U.: Unsupervised learning of edit distance weights for retrieving historical spelling variations. In: Proceedings of the 1st Workshop on Finite-State Techniques and Approximate Search, Borovets, Bulgaria, pp. 1–6 (2007)
Jassem, K., Graliński, F., Obrȩbski, T., Wierzchoń, P.: Automatic diachronic normalization of Polish texts (2017, to appear)
Jassem, K., Graliński, F., Obrębski, T.: Pros and cons of normalizing text with Thrax. In: Proceedings of the 8th Language and Technology Conference, Poznań, pp. 230–235 (2017)
Malinowski, M.: Ortografia polska od II połowy XVIII wieku do współczesności. Kodyfikacja, reformy, recepcja. Ph.D. thesis, Uniwersytet Śla̧ski w Katowicach, Katowice (2011)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) 27th Annual Conference on Neural Information Processing Systems 2013. Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, United States, 5–8 December 2013, vol. 26, pp. 3111–3119 (2013)
Miłkowski, M.: Developing an open-source, rule-based proofreading tool. Softw. Pract. Exp 40(7), 543–566 (2010)
Mykowiecka, A., Rychlik, P., Waszczuk, J.: Building an electronic dictionary of Old Polish on the base of the paper resource. In: Osenov, P., Piperidis, S., Slavcheva, M., Vertan, C. (eds.) Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage at LREC 2012, pp. 16–21. European Language Resources Association (ELRA) (2012)
Nissim, M., Matheson, C., Reid, J.: Recognising geographical entities in Scottish historical documents. In: Proceedings of the Workshop on Geographic Information Retrieval at SIGIR ACM 2004, Sheffield, UK (2004)
Oravecz, C., Sass, B., Simon, E.: Semi-automatic normalization of Old Hungarian codices. In: Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH 2010, Lisbon, Portugal, pp. 55–59 (2010)
Pettersson, E., Megyesi, B., Nivre, J.: Rule-based normalisation of historical text - a diachronic study. In: Empirical Methods in Natural Language Processing: Proceedings of the 11th Conference on Natural Language Processing, KONVENS 2012, Vienna, Austria, pp. 333–341. Österreichische Gesellschaft für Artificial Intelligence (ÖGAI) (2012)
Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool, San Rafael (2012). https://doi.org/10.2200/S00436ED1V01Y201207HLT017
Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in Old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013; NEALT Proceedings, Oslo, Norway, pp. 70–79. No. 87 in 18, Linköping University Electronic Press; Linköpings Universitet (2013)
Rayson, P., Archer, D., Baron, A., Smith, N.: Tagging historical corpora - the problem of spelling variation. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fr Informatik (2007)
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source finite-state grammar software libraries. In: Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea, pp. 61–66. Association for Computational Linguistics (July 2012). http://www.aclweb.org/anthology/P12-3011
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: 4th Biennial Workshop on Balto-Slavic Natural Language Processing, BSNLP 2013 (2013)
Sleator, D., Temperley, D.: Parsing English with a link Grammar. In: 3rd International Workshop on Parsing Technologies (1993)
Sproat, R., Jaitly, N.: RNN approaches to text normalization: A challenge (2016)
Tai, T., Skut, W., Sproat, R.: Thrax: an open source grammar compiler built on OpenFst. In: IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2011, Waikoloa Resort, Hawaii, vol. 12 (2011)
Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz, Ł.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 860–864. European Language Resources Association (ELRA) (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Skórzewski, P., Jassem, K., Graliński, F. (2020). Automated Normalization and Analysis of Historical Texts. In: Vetulani, Z., Paroubek, P., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2017. Lecture Notes in Computer Science(), vol 12598. Springer, Cham. https://doi.org/10.1007/978-3-030-66527-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-66527-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66526-5
Online ISBN: 978-3-030-66527-2
eBook Packages: Computer ScienceComputer Science (R0)