Abstract
The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) consisting of facsimiles with 45,000 pages as well as hand-corrected and structured transcriptions. The hand-annotated corpus has 300,000 tokens, where each word is tagged with its modernised word form, lemma, part-of-speech and, in cases of archaic words, its nearest contemporary equivalents. This information was extracted into the lexicon, which also covers an extended target-annotated corpus, resulting in 20,000 lemmas (of these 4,000 archaic) with 50,000 modern word forms and 70,000 attested forms. We have also developed a program to modernise, tag and lemmatise historical Slovene, and annotated the digital library with it, producing an automatically annotated corpus of 15 million words. To serve the humanities, the digital library and lexicon are available for reading and browsing on the web and the corpora via a concordancer. For language technology research and development the resources are available in source TEI XML under the Creative Commons Attribution licence. The paper presents the IMP resources, available from http://nl.ijs.si/imp/, the process of their compilation, encoding and dissemination, and concludes with directions for future research.







Notes
We have registered with IANA the sl-bohoric sub-language tag for texts using the Bohorič alphabet, as well as sl-dajnko and sl-metelko for the Dajnko and Metelko alphabets, in which a number of books were also printed. These tags can be used e.g., as the value of @xml:lang attribute.
TEI stylesheets are available at http://www.tei-c.org/Tools/Stylesheets/ and support the conversion of many formats to and from TEI P5.
We have given Slovene glosses to all the 546 elements defined by the TEI. The localisation is available at http://nl.ijs.si/tei/locale/.
The site with the AHLib DL and links to the Graz resources is http://nl.ijs.si/ahlib/.
While the original rtf2tei converter is no longer maintained, we have developed a new web-based Word to TEI P5 converter, which upgrades the current TEI XSLT stylesheets for docx2tei and tei2html conversion. The Web service accepts Office Open XML documents and converts them to TEI P5 and from there to HTML, and stores the complete results on a unique URL. It is available at http://nl.ijs.si/tei/convert/.
The project can be found at http://sl.wikisource.org/wiki/Wikivir:Slovenska_leposlovna_klasika.
The wiki2tei converter is available as a Web service at http://nl.ijs.si/wiki2tei, with its PHP source on https://github.com/domenk/wiki2tei. It has an interface both in Slovene and English and can also convert the works of other Wikisource languages, but the quality of the output depends on how much their format differs from the Wikivir one.
A previous, smaller and less well-annotated version is described in Erjavec (2012b).
Page sampling is unusual as pages do not correspond to linguistically motivated units but in the case of historical texts it is difficult to come up with a better alternative as some texts do not even distinguish divisions or paragraphs or these are very long. Furthermore, the alignment of text samples with the facsimiles is also preserved in this way.
The aim was to keep goo300k and foo3M disjoint, so the former can be used as the training and the latter as a realistic test set in lexicon related experiments.
Currently, the TEI lexicon contains 9 random examples, and HTML 4. Including all examples would be also possible, but for very frequent words this is a few thousand examples.
Sloleks is available for download under CC BY-NC from http://eng.slovenscina.eu/.
The home page of IMP is http://nl.ijs.si/imp/.
The eZISS digital library home page is http://nl.ijs.si/e-zrc/.
References
Arhar, Š. (2009). Učni korpus SSJ in leksikon besednih oblik za slovenščino (The SSJ training corpus and word form lexicon for Slovene). Jezik in Slovstvo, 54(3–4), 43–56.
Bień, J. S. (2014). The IMPACT project Polish Ground-Truth texts as a DjVu corpus. Cognitive Studies (Études Cognitives), 14, 75–84. http://bc.klf.uw.edu.pl/381/
Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX ’94: 3rd conference on computational lexicography and text research, Budapest, Hungary (pp. 23–32).
Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2011). Aletheia—An advanced document layout and text ground-truthing system for production environments. In IEEE Xplore Digital Library (pp. 48–52).
Dudczak, A., Kmieciak, M., & Werla, M. (2012). Creation of textual versions of historical documents from polish digital libraries. In Lecture notes in computer science (Vol. 7489, pp. 89–94). Berlin: Springer.
Erjavec, T. (2007). An architecture for editing complex digital documents. In Proceedings of INFuture’07 “digital information and heritage” (pp. 105–114). University of Zagreb.
Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, association for computational linguistics, Portland, OR, USA (pp. 33–38). http://www.aclweb.org/anthology/W11-1505
Erjavec, T. (2012a). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131–142.
Erjavec, T. (2012b). The goo300k corpus of historical Slovene. In Proceedings of the eight international conference on language resources and evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.
Erjavec, T. (2014). Posodabljanje starejše slovenščine (Modernising historical Slovene). Uporabna informatika, 21(4), 186–195.
Erjavec, T., & Fišer, D. (2014). Recepcija virov starejše slovenščine IMP (The reception of the IMP historical language resources). In 33. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana.
Erjavec, T., Vodopivec, I., & Kodrič, M. (2011). Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT (The compilation of a corpus of historical Slovene texts in the scope of the IMPACT project). In 30. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana (pp. 121–127).
Hladnik, M. (2009). Infrastruktura slovenistične literarne vede (The infrastructure of Slovene literary studies). In 28. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana (pp. 161–169). http://www.centerslo.net/files/file/simpozij/simp28/Hladnik
Kenter, T., Erjavec, T., Žorga, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL workshop on language technology for cultural heritage, social sciences, and humanities, ACL, Avignon, France.
Krauwer, S. (2003). The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In Proceedings of the international workshop speech and computer (SPECOM 2003) (pp. 8–15). Moscow State Linguistic University. http://www.elsnet.org/dox/krauwer-specom2003
Kroch, A., Santorini, B., & Diertani, A. (2004). Penn–Helsinki parsed corpus of Early Modern English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-2/
Kučera, K. (1999). The general principles of the diachronic part of the Czech National Corpus. In Text, speech and dialogue, lecture notes in computer science (Vol. 1692, pp. 841–842). Berlin: Springer.
Ljubešić, N., Erjavec, T., & Fišer, D. (2014). Standardizing tweets with character-level machine translation. In A. Gelbukh (Ed.), 15th International conference, CICLing 2014, proceedings, part II, lecture notes in computer science (Vol. 8404, pp. 164–175). Berlin: Springer.
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies. San Rafael, USA: Morgan & Claypool Publishers.
Pletschacher, S., & Antonacopoulos, A. (2010). The PAGE (page analysis and ground-truth elements) format framework. In Proceedings of the 20th international conference on pattern recognition (ICPR), Istambul.
Prunč, E. (2007). Deutsch-slowenische/kroatische Übersetzung 1848–1918. Ein Werkstättenbericht. Wiener Slavistisches Jahrbuch, 53, 63–176.
Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Corpus linguistics conference (CL2007). University of Birmingham, Birmingham, UK. http://ucrel.lancs.ac.uk/publications/CL2007/paper/192_Paper
Reffle, U. (2011). Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering, 17, 265–282.
Rychlý, P. (2007). Manatee/bonito—A modular corpus manager. In Proceedings of 1st workshop on recent advances in Slavonic natural language processing (pp. 65–70). Brno: Masaryk University.
Sánchez-Marco, C., Boleda, G., Fontana, J. M., & Domingo, J. (2010). Annotation and representation of a diachronic corpus of Spanish. In Proceedings of the seventh conference on language resources and evaluation (LREC’10), ELRA, Valletta, Malta.
Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., & Carrasco, R. C. (2013). An open diachronic corpus of historical Spanish. Language Resources and Evaluation, 47(4), 1327–1342.
Scheible, S., Whitt, R. J., Durrell, M., & Bennett, P. (2011). A gold standard corpus of Early Modern German. In Proceedings of the 5th linguistic annotation workshop, association for computational linguistics, Portland, Oregon, USA (pp. 124–128). http://www.aclweb.org/anthology/W11-0415
Scherrer, Y., & Erjavec, T. (2013). Modernizing historical Slovene words with character-based SMT. In BSNLP 2013—4th Biennial workshop on Balto-Slavic natural language processing, Sofia.
TEI Consortium (Ed.). (2012). Guidelines for electronic text encoding and interchange. TEI Consortium. http://www.tei-c.org/P5/
Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.
Wallenberg, J., Ingason, A. K., Sigurthsson, E. F., & Rögnvaldsson, E. (2011). Icelandic Parsed Historical Corpus (IcePaHC), version 0.9. http://www.linguist.is/icelandic_treebank
Acknowledgments
The author thanks the two anonymous reviewers for useful comments and suggestions. For collaborating in the compilation of the IMP language resources thanks are due to Kozma Ahačič, Simon Atelšek, Tina Benčina, Katja Cingerle, Metod Čepar, Darja Fišer, Miran Hladnik, Alenka Jelovšek, Urška Kamenšek, Alenka Kavčič Čolić, Domen Kermc, Maša Kodrič, Simon Krek, Nina Mikulin, Matija Ogrin, Daša Pokorn, Erich Prunč, Zala Šmid, Ines Vodopivec and Maja Žorga Dulmin. The work presented in this paper was supported by the Austrian Academy project “Deutsch-slowenische/kroatische Übersetzung 1848–1918”, the EU IMPACT project “Improving Access to Text”, the Google Digital Humanities Research Award “Language models for historical Slovenian”, and the Research Programme P2-0103 “Knowledge Technologies” funded by the Slovenian Research Agency.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Erjavec, T. The IMP historical Slovene language resources. Lang Resources & Evaluation 49, 753–775 (2015). https://doi.org/10.1007/s10579-015-9294-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9294-7