The IMP historical Slovene language resources

Erjavec, Tomaž

doi:10.1007/s10579-015-9294-7

The IMP historical Slovene language resources

Project Notes
Published: 21 January 2015

Volume 49, pages 753–775, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Tomaž Erjavec¹

665 Accesses
8 Citations
Explore all metrics

Abstract

The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) consisting of facsimiles with 45,000 pages as well as hand-corrected and structured transcriptions. The hand-annotated corpus has 300,000 tokens, where each word is tagged with its modernised word form, lemma, part-of-speech and, in cases of archaic words, its nearest contemporary equivalents. This information was extracted into the lexicon, which also covers an extended target-annotated corpus, resulting in 20,000 lemmas (of these 4,000 archaic) with 50,000 modern word forms and 70,000 attested forms. We have also developed a program to modernise, tag and lemmatise historical Slovene, and annotated the digital library with it, producing an automatically annotated corpus of 15 million words. To serve the humanities, the digital library and lexicon are available for reading and browsing on the web and the corpora via a concordancer. For language technology research and development the resources are available in source TEI XML under the Creative Commons Attribution licence. The paper presents the IMP resources, available from http://nl.ijs.si/imp/, the process of their compilation, encoding and dissemination, and concludes with directions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

We have registered with IANA the sl-bohoric sub-language tag for texts using the Bohorič alphabet, as well as sl-dajnko and sl-metelko for the Dajnko and Metelko alphabets, in which a number of books were also printed. These tags can be used e.g., as the value of @xml:lang attribute.
TEI stylesheets are available at http://www.tei-c.org/Tools/Stylesheets/ and support the conversion of many formats to and from TEI P5.
We have given Slovene glosses to all the 546 elements defined by the TEI. The localisation is available at http://nl.ijs.si/tei/locale/.
The site with the AHLib DL and links to the Graz resources is http://nl.ijs.si/ahlib/.
While the original rtf2tei converter is no longer maintained, we have developed a new web-based Word to TEI P5 converter, which upgrades the current TEI XSLT stylesheets for docx2tei and tei2html conversion. The Web service accepts Office Open XML documents and converts them to TEI P5 and from there to HTML, and stores the complete results on a unique URL. It is available at http://nl.ijs.si/tei/convert/.
The project can be found at http://sl.wikisource.org/wiki/Wikivir:Slovenska_leposlovna_klasika.
The wiki2tei converter is available as a Web service at http://nl.ijs.si/wiki2tei, with its PHP source on https://github.com/domenk/wiki2tei. It has an interface both in Slovene and English and can also convert the works of other Wikisource languages, but the quality of the output depends on how much their format differs from the Wikivir one.
A previous, smaller and less well-annotated version is described in Erjavec (2012b).
Page sampling is unusual as pages do not correspond to linguistically motivated units but in the case of historical texts it is difficult to come up with a better alternative as some texts do not even distinguish divisions or paragraphs or these are very long. Furthermore, the alignment of text samples with the facsimiles is also preserved in this way.
In IMPACT such lexicons and the corpora they were based on were developed also for other languages, e.g., for Polish (Bień 2014) and Spanish (Sánchez-Martínez et al. 2013), where a similar approach to ours was taken.
The aim was to keep goo300k and foo3M disjoint, so the former can be used as the training and the latter as a realistic test set in lexicon related experiments.
Currently, the TEI lexicon contains 9 random examples, and HTML 4. Including all examples would be also possible, but for very frequent words this is a few thousand examples.
Sloleks is available for download under CC BY-NC from http://eng.slovenscina.eu/.
The home page of IMP is http://nl.ijs.si/imp/.
http://ota.ahds.ac.uk/.
The eZISS digital library home page is http://nl.ijs.si/e-zrc/.

References

Arhar, Š. (2009). Učni korpus SSJ in leksikon besednih oblik za slovenščino (The SSJ training corpus and word form lexicon for Slovene). Jezik in Slovstvo, 54(3–4), 43–56.
Google Scholar
Bień, J. S. (2014). The IMPACT project Polish Ground-Truth texts as a DjVu corpus. Cognitive Studies (Études Cognitives), 14, 75–84. http://bc.klf.uw.edu.pl/381/
Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX ’94: 3rd conference on computational lexicography and text research, Budapest, Hungary (pp. 23–32).
Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2011). Aletheia—An advanced document layout and text ground-truthing system for production environments. In IEEE Xplore Digital Library (pp. 48–52).
Dudczak, A., Kmieciak, M., & Werla, M. (2012). Creation of textual versions of historical documents from polish digital libraries. In Lecture notes in computer science (Vol. 7489, pp. 89–94). Berlin: Springer.
Erjavec, T. (2007). An architecture for editing complex digital documents. In Proceedings of INFuture’07 “digital information and heritage” (pp. 105–114). University of Zagreb.
Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, association for computational linguistics, Portland, OR, USA (pp. 33–38). http://www.aclweb.org/anthology/W11-1505
Erjavec, T. (2012a). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131–142.
Article Google Scholar
Erjavec, T. (2012b). The goo300k corpus of historical Slovene. In Proceedings of the eight international conference on language resources and evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.
Erjavec, T. (2014). Posodabljanje starejše slovenščine (Modernising historical Slovene). Uporabna informatika, 21(4), 186–195.
Google Scholar
Erjavec, T., & Fišer, D. (2014). Recepcija virov starejše slovenščine IMP (The reception of the IMP historical language resources). In 33. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana.
Erjavec, T., Vodopivec, I., & Kodrič, M. (2011). Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT (The compilation of a corpus of historical Slovene texts in the scope of the IMPACT project). In 30. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana (pp. 121–127).
Hladnik, M. (2009). Infrastruktura slovenistične literarne vede (The infrastructure of Slovene literary studies). In 28. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana (pp. 161–169). http://www.centerslo.net/files/file/simpozij/simp28/Hladnik
Kenter, T., Erjavec, T., Žorga, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL workshop on language technology for cultural heritage, social sciences, and humanities, ACL, Avignon, France.
Krauwer, S. (2003). The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In Proceedings of the international workshop speech and computer (SPECOM 2003) (pp. 8–15). Moscow State Linguistic University. http://www.elsnet.org/dox/krauwer-specom2003
Kroch, A., Santorini, B., & Diertani, A. (2004). Penn–Helsinki parsed corpus of Early Modern English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-2/
Kučera, K. (1999). The general principles of the diachronic part of the Czech National Corpus. In Text, speech and dialogue, lecture notes in computer science (Vol. 1692, pp. 841–842). Berlin: Springer.
Ljubešić, N., Erjavec, T., & Fišer, D. (2014). Standardizing tweets with character-level machine translation. In A. Gelbukh (Ed.), 15th International conference, CICLing 2014, proceedings, part II, lecture notes in computer science (Vol. 8404, pp. 164–175). Berlin: Springer.
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies. San Rafael, USA: Morgan & Claypool Publishers.
Pletschacher, S., & Antonacopoulos, A. (2010). The PAGE (page analysis and ground-truth elements) format framework. In Proceedings of the 20th international conference on pattern recognition (ICPR), Istambul.
Prunč, E. (2007). Deutsch-slowenische/kroatische Übersetzung 1848–1918. Ein Werkstättenbericht. Wiener Slavistisches Jahrbuch, 53, 63–176.
Google Scholar
Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Corpus linguistics conference (CL2007). University of Birmingham, Birmingham, UK. http://ucrel.lancs.ac.uk/publications/CL2007/paper/192_Paper
Reffle, U. (2011). Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering, 17, 265–282.
Article Google Scholar
Rychlý, P. (2007). Manatee/bonito—A modular corpus manager. In Proceedings of 1st workshop on recent advances in Slavonic natural language processing (pp. 65–70). Brno: Masaryk University.
Sánchez-Marco, C., Boleda, G., Fontana, J. M., & Domingo, J. (2010). Annotation and representation of a diachronic corpus of Spanish. In Proceedings of the seventh conference on language resources and evaluation (LREC’10), ELRA, Valletta, Malta.
Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., & Carrasco, R. C. (2013). An open diachronic corpus of historical Spanish. Language Resources and Evaluation, 47(4), 1327–1342.
Article Google Scholar
Scheible, S., Whitt, R. J., Durrell, M., & Bennett, P. (2011). A gold standard corpus of Early Modern German. In Proceedings of the 5th linguistic annotation workshop, association for computational linguistics, Portland, Oregon, USA (pp. 124–128). http://www.aclweb.org/anthology/W11-0415
Scherrer, Y., & Erjavec, T. (2013). Modernizing historical Slovene words with character-based SMT. In BSNLP 2013—4th Biennial workshop on Balto-Slavic natural language processing, Sofia.
TEI Consortium (Ed.). (2012). Guidelines for electronic text encoding and interchange. TEI Consortium. http://www.tei-c.org/P5/
Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.
Article Google Scholar
Wallenberg, J., Ingason, A. K., Sigurthsson, E. F., & Rögnvaldsson, E. (2011). Icelandic Parsed Historical Corpus (IcePaHC), version 0.9. http://www.linguist.is/icelandic_treebank

Download references

Acknowledgments

The author thanks the two anonymous reviewers for useful comments and suggestions. For collaborating in the compilation of the IMP language resources thanks are due to Kozma Ahačič, Simon Atelšek, Tina Benčina, Katja Cingerle, Metod Čepar, Darja Fišer, Miran Hladnik, Alenka Jelovšek, Urška Kamenšek, Alenka Kavčič Čolić, Domen Kermc, Maša Kodrič, Simon Krek, Nina Mikulin, Matija Ogrin, Daša Pokorn, Erich Prunč, Zala Šmid, Ines Vodopivec and Maja Žorga Dulmin. The work presented in this paper was supported by the Austrian Academy project “Deutsch-slowenische/kroatische Übersetzung 1848–1918”, the EU IMPACT project “Improving Access to Text”, the Google Digital Humanities Research Award “Language models for historical Slovenian”, and the Research Programme P2-0103 “Knowledge Technologies” funded by the Slovenian Research Agency.

Author information

Authors and Affiliations

Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000, Ljubljana, Slovenia
Tomaž Erjavec

Authors

Tomaž Erjavec
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tomaž Erjavec.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Erjavec, T. The IMP historical Slovene language resources. Lang Resources & Evaluation 49, 753–775 (2015). https://doi.org/10.1007/s10579-015-9294-7

Download citation

Published: 21 January 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10579-015-9294-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The IMP historical Slovene language resources

Abstract

Access this article

Subscribe and save

Buy Now

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now