Automated Normalization and Analysis of Historical Texts

Skórzewski, Paweł; Jassem, Krzysztof; Graliński, Filip

doi:10.1007/978-3-030-66527-2_6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12598))

Included in the following conference series:

Language and Technology Conference

267 Accesses
1 Citations

Abstract

The paper presents an original method for processing historical texts. A historical text is converted into its modernized equivalent by a tool called diachronic normalizer, embedded into a linguistic toolkit. The solution has a few merits. Firstly, the toolkit architecture allows for imposing the morphological constraints on diachronization rules. Secondly, the diachronic normalizer may be launched in the pipeline together with other NLP tools, such as parsers or translators. Lastly, the toolkit makes it possible to efficiently apply, in the diachronic normalization, a long list of diachronic pairs, found out with the aid of word distribution vectors in historical corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://hackage.haskell.org/package/hist-pl-transliter.
2.
http://synat.nlp.ipipan.waw.pl.
3.
http://clip.ipipan.waw.pl/KORBA.
4.
https://bitbucket.org/jsbien/pol.
5.
Odkrywka [9], contains 40 billion tokens and consists of Polish publications (mostly newspapers and books) originating mainly from the years 1810–2013.
6.
http://psi-toolkit.wmi.amu.edu.pl/help/documentation.html.
7.
http://morfologik.blogspot.com.
8.
https://www.docker.com.
9.
https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu.
10.
https://store.docker.com/editions/community/docker-ce-desktop-windows.
11.
http://psi-toolkit.wmi.amu.edu.pl/help/processor.psis?name=iayko.
12.
http://www.openfst.org/twiki/bin/view/GRM/Thrax.

References

Archer, D., Ernst-Gerlach, A., Kempken, S., Pilz, T., Rayson, P.: The identification of spelling variants in English and German historical texts: manual or automatic? In: Digital Humanities 2006, CATI, Université Paris-Sorbonne, Paris, France, pp. 3–5 (2006)
Google Scholar
Baron, A., Rayson, P., Archer, D.: Automatic standardization of spelling for historical text mining. In: Digital Humanities 2009 (June 2009)
Google Scholar
Bollmann, M., Petran, F., Dipper, S.: Rule-based normalization of historical texts. In: Proceedings of the International Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 34–42 (2011)
Google Scholar
Bronikowska, R., Modrzejewski, E.: The enrichment of the lexical information and the corpus resources by using the results of the morphological analysis of historical texts (2017). https://ijp.pan.pl/images/konferencje/elex-budapeszt-2017.pdf
Etxeberria, I., Alegria, I., Uria, L., Hulden, M.: Evaluating the noisy channel model for the normalization of historical texts: Basque, Spanish and Slovene. In: Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (2016)
Google Scholar
Graliński, F., Jassem, K., Junczys-Dowmunt, M.: PSI-toolkit: a natural language processing pipeline. In: Przepiórkowski, A., Piasecki, M., Jassem, K., Fuglewicz, P. (eds.) Computational Linguistics, Studies in Computational Intelligence. Studies in Computational Intelligence, vol. 458, pp. 27–39. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-34399-5_2
Chapter Google Scholar
Graliński, F., Jaworski, R., Borchmann, Ł., Wierzchoń, P.: Gonito.net - open platform for research competition, cooperation and reproducibility. In: Branco, A., Calzolari, N., Choukri, K. (eds.) Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language, pp. 13–20 (2016)
Google Scholar
Graliński, F., Jassem, K.: Mining historical texts for diachronic spelling variants. Poznan Stud. Contemp. Lingustics (2019). https://ai.wmi.amu.edu.pl/wp-content/uploads/2020/02/gralinski2019mining-2.pdf. Accepted 6 Mar 2019
Graliński, F., Wierzchoń, P.: Odkrywka, czyli leksykografia diachroniczna live. In: Bańko, M., Karaś, H. (eds.) Między teorią a praktyką. Metody współczesnej leksykografii, vol. 1, pp. 59–69. Wydawnictwa Uniwersytetu Warszawskiego, Warszawa (2018)
Google Scholar
Hauser, A.W., Schulz, K.U.: Unsupervised learning of edit distance weights for retrieving historical spelling variations. In: Proceedings of the 1st Workshop on Finite-State Techniques and Approximate Search, Borovets, Bulgaria, pp. 1–6 (2007)
Google Scholar
Jassem, K., Graliński, F., Obrȩbski, T., Wierzchoń, P.: Automatic diachronic normalization of Polish texts (2017, to appear)
Google Scholar
Jassem, K., Graliński, F., Obrębski, T.: Pros and cons of normalizing text with Thrax. In: Proceedings of the 8th Language and Technology Conference, Poznań, pp. 230–235 (2017)
Google Scholar
Malinowski, M.: Ortografia polska od II połowy XVIII wieku do współczesności. Kodyfikacja, reformy, recepcja. Ph.D. thesis, Uniwersytet Śla̧ski w Katowicach, Katowice (2011)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) 27th Annual Conference on Neural Information Processing Systems 2013. Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, United States, 5–8 December 2013, vol. 26, pp. 3111–3119 (2013)
Google Scholar
Miłkowski, M.: Developing an open-source, rule-based proofreading tool. Softw. Pract. Exp 40(7), 543–566 (2010)
Google Scholar
Mykowiecka, A., Rychlik, P., Waszczuk, J.: Building an electronic dictionary of Old Polish on the base of the paper resource. In: Osenov, P., Piperidis, S., Slavcheva, M., Vertan, C. (eds.) Proceedings of the Workshop on Adaptation of Language Resources and Tools for Processing Cultural Heritage at LREC 2012, pp. 16–21. European Language Resources Association (ELRA) (2012)
Google Scholar
Nissim, M., Matheson, C., Reid, J.: Recognising geographical entities in Scottish historical documents. In: Proceedings of the Workshop on Geographic Information Retrieval at SIGIR ACM 2004, Sheffield, UK (2004)
Google Scholar
Oravecz, C., Sass, B., Simon, E.: Semi-automatic normalization of Old Hungarian codices. In: Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH 2010, Lisbon, Portugal, pp. 55–59 (2010)
Google Scholar
Pettersson, E., Megyesi, B., Nivre, J.: Rule-based normalisation of historical text - a diachronic study. In: Empirical Methods in Natural Language Processing: Proceedings of the 11th Conference on Natural Language Processing, KONVENS 2012, Vienna, Austria, pp. 333–341. Österreichische Gesellschaft für Artificial Intelligence (ÖGAI) (2012)
Google Scholar
Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool, San Rafael (2012). https://doi.org/10.2200/S00436ED1V01Y201207HLT017
Book Google Scholar
Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in Old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013; NEALT Proceedings, Oslo, Norway, pp. 70–79. No. 87 in 18, Linköping University Electronic Press; Linköpings Universitet (2013)
Google Scholar
Rayson, P., Archer, D., Baron, A., Smith, N.: Tagging historical corpora - the problem of spelling variation. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fr Informatik (2007)
Google Scholar
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., Tai, T.: The OpenGrm open-source finite-state grammar software libraries. In: Proceedings of the ACL 2012 System Demonstrations, Jeju Island, Korea, pp. 61–66. Association for Computational Linguistics (July 2012). http://www.aclweb.org/anthology/P12-3011
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: 4th Biennial Workshop on Balto-Slavic Natural Language Processing, BSNLP 2013 (2013)
Google Scholar
Sleator, D., Temperley, D.: Parsing English with a link Grammar. In: 3rd International Workshop on Parsing Technologies (1993)
Google Scholar
Sproat, R., Jaitly, N.: RNN approaches to text normalization: A challenge (2016)
Google Scholar
Tai, T., Skut, W., Sproat, R.: Thrax: an open source grammar compiler built on OpenFst. In: IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2011, Waikoloa Resort, Hawaii, vol. 12 (2011)
Google Scholar
Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz, Ł.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, pp. 860–864. European Language Resources Association (ELRA) (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Paweł Skórzewski, Krzysztof Jassem & Filip Graliński

Authors

Paweł Skórzewski
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Jassem
View author publications
You can also search for this author in PubMed Google Scholar
Filip Graliński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paweł Skórzewski .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
Laboratoire d’Informatique pour la Méca, Orsay, France
Patrick Paroubek
Adam Mickiewicz University, Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Skórzewski, P., Jassem, K., Graliński, F. (2020). Automated Normalization and Analysis of Historical Texts. In: Vetulani, Z., Paroubek, P., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2017. Lecture Notes in Computer Science(), vol 12598. Springer, Cham. https://doi.org/10.1007/978-3-030-66527-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-66527-2_6
Published: 31 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66526-5
Online ISBN: 978-3-030-66527-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics