Skip to main content
Log in

The KAS corpus of Slovenian academic writing

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made freely available via on-line concordancers and is openly available for research through the CLARIN.SI research infrastructure. We also present the tools for mono- and bilingual term extraction and for thesis structure annotation that were developed in the scope of the project, including the manually annotated datasets used to train these tools. This specialised corpus, large by any standards, represents a substantial and highly useful language resource for the study of Slovenian academic writing and for terminology extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. https://www.gov.si/assets/ministrstva/MK/Slovenski-jezik/ReNPJP-20-24/Resolution-on-the-national-programme-for-language-policy-20142018-.docx.

  2. http://nl.ijs.si/kas/.

  3. http://openscience.si/.

  4. https://www.cobiss.si/en/.

  5. https://tika.apache.org/.

  6. The largest four universities in Slovenia are the University of Ljubljana, which has 40,000 students, the University of Maribor with 14,000, the University of Primorska with 6500 and the University of Nova Gorica with 800 students.

  7. http://www.udcc.org/.

  8. https://www.eurocris.org/cerif/main-features-cerif.

  9. https://github.com/clarinsi/reldi-tokeniser.

  10. https://github.com/clarinsi/reldi-tagger.

  11. https://github.com/clarinsi/kas-term.

  12. http://hdl.handle.net/11346/clarin.si-SRGD.

  13. http://hdl.handle.net/11346/clarin.si-UNOY

  14. http://hdl.handle.net/11346/clarin.si-SZNC.

  15. https://github.com/clarinsi/kas-biterm.

  16. https://github.com/clarinsi/kas-struct.

  17. The TEI ODD schema and the derived XML schemas are available from https://github.com/clarinsi/TEI-schema.

  18. https://www.clarin.si/kontext/, source available from https://github.com/ufal/lindat-kontext.

  19. https://www.clarin.si/noske/, source available from https://nlp.fi.muni.cz/trac/noske.

  20. http://clarin.si/repository/xmlui.

  21. KAS is distributed under the CLARIN.SI ACA-ID-BY-NC-INF-NORED 1.0 licence, available at https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0.

  22. https://www.anglistik.rwth-aachen.de/cms/Anglistik/Forschung/Laufende-Projekte/~dajzq/ACAW-Aachen-Corpus-of-Academic-Writing/.

  23. https://vlo.clarin.eu/?9.

  24. https://catalog.ldc.upenn.edu.

  25. http://www.elra.info/en/catalogues/.

  26. The PoS tagged and lemmatized version of this corpus is also available for on-line querying on the Sketch Engine concordancer: https://app.sketchengine.eu/#dashboard?corpname=preloaded%2Faclarc_2.

  27. https://metashare.ut.ee/repository/browse/corpus-of-estonian-scientific-texts/c5f0fd7258e211e2a6e4005056b40024979168bd6780454f980729788272c9f2/.

  28. http://muchmore.dfki.de/resources1.htm.

  29. https://www.termania.net/slovarji/223/kas-slovar-splosnostrokovne-leksike.

References

  • Abekawa, T., & Kageura, K. (2009). QRpotato: A system that exhaustively collects bilingual technical term pairs from the Web. In Proceedings of the 3rd international universal communication symposium (pp. 115–119). ACM.

  • Abekawa, T., & Kageura, K. (2011). Using seed terms for crawling bilingual terminology lists on the Web. Trans. Comp.

  • Bago, P., & Ljubešić, N. (2015). Using machine learning for language and structure annotation in an 18th century dictionary. Electronic lexicography in the 21st century: linking lexical data in the digital age (pp. 427–442).

  • Bird, S., Dale, R., Dorr, B.J., Gibson, B., Joseph, M.T., Kan, M.Y., Lee, D., Powley, B., Radev, D.R., & Tan, Y.F. (2008). The ACL Anthology Reference Corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08). ELRA. http://www.lrec-conf.org/proceedings/lrec2008/pdf/445_paper.pdf.

  • Bond, F. (2008). Extracting bilingual terms from mainly monolingual data. In 14th annual meeting of the association for natural language processing. Tokyo.

  • Burnard, L. (1995). Users Reference Guide British National Corpus Version 1.0. Tech. rep., Oxford University Computing Services, UK. http://www.natcorp.ox.ac.uk/docs/URG/.

  • Callies, M., & Zaytseva, E. (2013). The Corpus of Academic Learner English (CALE): A new resource for the assessment of writing proficiency in the academic register. Dutch Journal of Applied Linguistics, 2(1), 126–132.

    Article  Google Scholar 

  • Chambers, A., & Le Baron, F. (2006). Chambers-le Baron corpus of research articles in French. http://hdl.handle.net/20.500.12024/2527. Oxford Text Archive.

  • Conrado, M., Pardo, T., & Rezende, S. (2013). A machine learning approach to automatic term extraction using a rich feature set. In Proceedings of the 2013 NAACL HLT student research workshop (pp. 16–23).

  • Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). ParsCit: An open-source CRF reference string parsing package. LREC, 8, 661–667.

    Google Scholar 

  • Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238.

    Article  Google Scholar 

  • Daille, B., Gaussier, É., & Langé, J.M. (1994). Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th conference on computational linguistics (Vol. 1, pp. 515–521). Association for Computational Linguistics.

  • Degaetano-Ortlieb, S., Kermes, H., Lapshinova-Koltunski, E., & Teich, E. (2013). Scitex - a diachronic corpus for analyzing the development of scientific registers. New Methods in Historical Corpus Linguistics, 3, 93–104.

    Google Scholar 

  • Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., & Romih, M. (2015). Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1039.

  • Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2008). An approach for extracting bilingual terminology from Wikipedia. In International conference on database systems for advanced applications (pp. 380–392). Springer.

  • Erjavec, T., Fišer, D., Ljubešić, N., & Bitenc, M. (2018). Bilingual terminology extraction dataset KAS-biterm 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1199.

  • Erjavec, T., Fišer, D., Ljubešić, N., Arhar Holdt, Š., Bren, U., Robnik Šikonja, M., & Udović, B. (2018). Terminology identification dataset KAS-term 1.0. Slovenian language resource repository CLARIN.SI . http://hdl.handle.net/11356/1198.

  • Erjavec, T., Fišer, D., Ljubešić, N., Ferme, M., Borovič, M., Boškovič, B., Ojsteršek, M., & Hrovat, G. (2019a). Corpus of Academic Slovene (BSc/BA theses) KAS-dipl 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1267.

  • Erjavec, T., Fišer, D., Ljubešić, N., Ferme, M., Borovič, M., Boškovič, B., Ojsteršek, M., & Hrovat, G. (2019b). Corpus of Academic Slovene KAS 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1244.

  • Erjavec, T., Fišer, D., Ljubešić, N., Ferme, M., Borovič, M., Boškovič, B., Ojsteršek, M., & Hrovat, G. (2019c). Corpus of Academic Slovene (MSc/MA theses) KAS-mag 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1266.

  • Erjavec, T., Fišer, D., Ljubešić, N., Ferme, M., Borovič, M., Boškovič, B., Ojsteršek, M., & Hrovat, G. (2019d). Corpus of Academic Slovene (Ph.D. theses) KAS-dr 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1265.

  • Erjavec, T., Fišer, D., Ljubešić, N., Logar, N., & Ojsteršek, M. (2016). Slovenska znanstvena besedila: prototipni korpus in načrt analiz (Slovene Academic Texts: Prototye Corpus and Research Plan. In Proceedings of the conference on language technologies and digital humanities (pp. 58–64). http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Erjavec-et-al_Slovenska-akademska-besedila.pdf.

  • Erjavec, T., Ljubešić, N., & Fišer, D. (2020). English-Slovene term candidates KAS-biterm 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1263.

  • Francis, W.N., & Kucera, H. (1979). Brown corpus manual. Tech. rep., Department of Linguistics, Brown University, Providence, Rhode Island, US. http://icame.uib.no/brown/bcm.html.

  • Gupta, A., Goyal, A., Bindal, A., & Gupta, A. (2008). Meliorated approach for extracting bilingual terminology from Wikipedia. In 11th international conference on computer and information technology, 2008. ICCIT 2008 (pp. 560–565). IEEE.

  • Hladik, R. (2018). Czech sociological review 1993–2016. http://hdl.handle.net/11372/LRT-2703. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.

  • Jacques, M.P., Hartwell, L., & Falaise, A. (2013). Techniques de tal et corpus pour faciliter les formulations en anglais scientifique écrit. In Proceedings of TALN 2013 (Volume 1: Long Papers, pp. 146–159).

  • Khemakhem, M., Foppiano, L., & Romary, L. (2017). Automatic extraction of TEI structures in digitized lexical resources using conditional random fields. In Electronic lexicography, eLex 2017.

  • Krek, S., Erjavec, T., Dobrovoljc, K., Može, S., Ledinek, N., & Holz, N. (2013). Training corpus ssj500k 1.3. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1029.

  • Krek, S., Čibej Kaja Dobrovoljc, J., Erjavec, T., Gantar, P., Kosem, I., Ljubešić, N., & Repar, A. (2020) .Gigafida 2.0: the reference corpus of written standard Slovene. In Proceedings of the twelfth international conference on language resources and evaluation (LREC’20). European Language Resources Association (ELRA), Marseille, France (Submitted).

  • Ljubešić, N., & Dobrovoljc, K. (2019). What does neural bring? analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian. In Proceedings of the 7th workshop on balto-slavic natural language processing (pp. 29–34). Association for Computational Linguistics, Florence, Italy. https://doi.org/10.18653/v1/W19-3704.

  • Ljubešić, N., Fišer, D., & Erjavec, T. (2019). Kas-term: Extracting Slovene terms from doctoral theses via supervised machine learning. In K. Ekštein (Ed.), Text, speech, and dialogue (pp. 115–126). Cham: Springer.

    Chapter  Google Scholar 

  • Ljubešić, N., Erjavec, T., & Fišer, D. (2018). KAS-term and KAS-biterm: Datasets and baselines for monolingual and bilingual terminology extraction from academic writing. In Proceedings of the conference on language technologies and digital humanities 2018. Ljubljana, Slovenia. http://www.sdjt.si/wp/dogodki/konference/jtdh-2018-english/proceedings-jtdh-2018/

  • Ljubešić, N., & Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: the case of Slovene. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA), Paris, France.

  • Logar, N., & Erjavec, T. (2017). Slovene academic writing: a corpus approach to lexical analysis. In: Interdisciplinary knowledge-making, challenges for LSP research: book of abstracts (p. 44). Norwegian School of Economics, Bergen.

  • Logar, N., Holdt, Š.A., & Erjavec, T. (2016). Slovenski strokovni jezik: korpusni opis trpnika (Slovene scientific language: a corpus description of the passive). In Obdobja (pp. 237–245). Ljubljana. http://centerslo.si/wp-content/uploads/2016/11/LogarArhHolErj.pdf.

  • Logar, N., Kosem, I., & Erjavec, T. (2019). ALEKS: Leksikalno-skladenjska podatkovna zbirka slovenskega strokovno-znanstvenega jezika (zasnova in zgradba) (ALEKS: A lexico-syntactic database of Slovene scientific writing (concept and structure)). Tech. rep., Faculty for social studies, Centre for Language Resources and technologies, Jožef Stefan Institute, Ljubljana. http://nl.ijs.si/kas/wp-content/uploads/2019/07/KAS-ALEKS.pdf.

  • Lopez, P. (2009). Conference on theory and practice of digital libraries (pp. 473–474). Springer.

  • Loukachevitch, N.V. (2012). Automatic term recognition needs multiple evidence. In: LREC (pp. 2401–2407).

  • Mao, S., Rosenfeld, A., & Kanungo, T. (2003). Document structure analysis algorithms: a literature survey. In Document recognition and retrieval X (vol. 5010, pp. 197–208). International Society for Optics and Photonics.

  • Měchura, M.B. (2017). Introducing Lexonomy: An open-source dictionary writing and publishing system. In Electronic lexicography in the 21st century: Lexicography from scratch. Proceedings of the eLex 2017 conference. Leiden, The Netherlands.

  • n/a, n/a. (2012). Academic texts—humanities 1997–2012 (2017-10-16). http://hdl.handle.net/10794/49. SB/CLARIN digital library at Spraakbanken, University of Gothenburg.

  • n/a, n/a: (2013). Academic texts—social science 1997–2012 (2017-10-16). http://hdl.handle.net/10794/50. SB/CLARIN digital library at Spraakbanken, University of Gothenburg.

  • Nagata, M., Saito, T., & Suzuki, K. (2001). Using the Web as a bilingual dictionary. In Proceedings of the workshop on data-driven methods in machine translation (Vol. 14, pp. 1–8). Association for Computational Linguistics.

  • Nakagawa, H., & Mori, T. (2003). Automatic term recognition based on statistics of compound nouns and their components. Terminology, 9(2), 201–219.

    Article  Google Scholar 

  • Nesi, H., Gardner, S., Thompson, P., & Wickens, P. (2008). British academic written English corpus. Oxford Text Archive. http://hdl.handle.net/20.500.12024/2539.

  • Ohta, T., Tateisi, Y., & Kim, J.D. (2002). The GENIA corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the second international conference on Human Language technology research (pp. 82–86). Morgan Kaufmann Publishers Inc.

  • Ojsteršek, M., Brezovnik, J., Kotar, M., Ferme, M., Hrovat, G., Bregant, A., & Borovič, M. (2014). Establishing of a Slovenian open access infrastructure: A technical point of view. Program Electronic Library and Information Systems 48(4), 394–412. https://doi.org/10.1108/PROG-02-2014-0005.

  • Parodi, G. (2009). Written genres in university studies: Evidence from an academic corpus of Spanish in four disciplines. Genre in a Changing World. Perspectives on Writing. Colorado: The WAC Clearinghouse.

  • Pazienza, M., Pennacchiotti, M., & Zanzotto, F. (2005). Terminology extraction: an analysis of linguistic and statistical approaches. Knowledge mining (pp. 255–279).

  • Pérez-Llantada, C. (2014). Formulaic language in l1 and l2 expert academic writing: Convergent and divergent usage. Journal of English for Academic Purposes, 14, 84–94.

    Article  Google Scholar 

  • Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the terminology and knowledge engineering (TKE2012) conference.

  • Römer, U., & Swales, J. M. (2010). The Michigan corpus of upper-level student papers (micusp). Journal of English for Academic Purposes, 3(9), 249.

    Article  Google Scholar 

  • Rosa, R. (2016). Czech and English abstracts of ÚFAL papers. http://hdl.handle.net/11234/1-1731. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.

  • Rychlý, P. (2007). Manatee/Bonito—A Modular Corpus Manager. In 1st workshop on recent advances in slavonic natural language processing (pp. 65–70). Masarykova univerzita, Brno.

  • Siegel, N., Lourie, N., Power, R., & Ammar, W. (2018). Extracting scientific figures with distantly supervised neural networks. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 223–232). ACM.

  • Stahl, C.G., Young, S.R., Herrmannova, D., Patton, R.M., & Wells, J.C. (2018). DeepPDF: A deep learning approach to extracting text from PDFs. Tech. rep., Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States).

  • Tanaka, K., & Iwasaki, H. (1996). Extraction of lexical translations from non-aligned corpora. In Proceedings of the 16th conference on Computational linguistics (Vol. 2, pp. 580–585). Association for Computational Linguistics.

  • TEI Consortium (ed.): (2017). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium. http://www.tei-c.org/Guidelines/P5/.

  • Thompson, P. (2001). A pedagogically-motivated corpus-based examination of Ph.D. theses: Macrostructure, citation practices and uses of modal verbs. Citeseer.

  • Thurston, J., & Candlin, C. N. (1998). Concordancing and the teaching of the vocabulary of academic English. English for specific purposes, 17(3), 267–280.

    Article  Google Scholar 

  • Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L. (2014). Cermine–automatic extraction of metadata and references from scientific literature. In 2014 11th IAPR international workshop on document analysis systems (DAS) (pp. 217–221). IEEE.

  • University of Helsinki. (1999a). The University of Helsinki’s English E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016102401.

  • University of Helsinki. (1999b). The University of Helsinki’s Finnish E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016090601.

  • University of Helsinki. (1999c). The University of Helsinki’s French E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016102806.

  • University of Helsinki. (1999d). The University of Helsinki’s German E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016102807.

  • University of Helsinki. (1999e). The University of Helsinki’s Russian E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016102808.

  • University of Helsinki. (1999f). The University of Helsinki’s Spanish E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016102809.

  • University of Helsinki. (1999g). The University of Helsinki’s Swedish E-thesis, Korp Version. http://urn.fi/urn:nbn:fi:lb-2016102810.

  • Usoniene, A., Butenas, L., Ryvityte, B., Sinkuniene, J., Jasionyte, E., & Juozapavicius, A. (2009). Corpus Academicum Lithuanicum: design criteria, methodology, application. In Language and technology conference. pp. 412–422.

  • Vintar, Š. (2010). Luščenje terminologije iz angleškoslovenskih vzporednih in primerljivih korpusov (Terminology mining from English-Slovene parallel and comparable corpora). In Š. Vintar (Ed.), Slovenske korpusne raziskave (pp. 37–53). Ljubljana: Znanstvena založba Filozofske fakultete.

  • Yimam, S.M., Gurevych, I., de Castilho, R.E., & Biemann, C. (2013). Webanno: A flexible, web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics (system demonstrations) (ACL 2013) (pp. 1–6). Association for Computational Linguistics, Stroudsburg, PA, USA.

Download references

Acknowledgements

The authors would like to thank the two anonymous reviewers for their helpful comments and suggestions. We are indebted to Milan Ojsteršek, Marko Ferme, Mladen Borovič, Borko Boškovič and Goran Hrovat from the University of Maribor for providing the source data from which the KAS corpus was built, to Špela Arhar Holdt and Maja Bitenc for conducting the annotation campaigns, to the students who participated in the annotation process, and to their supervisors, Urban Bren, Marko Robnik Šikonja, and Boštjan Udovič. The research described in the paper was supported by the project ARRS J6-7094 “Slovene scientific texts: resources and description” and by the research programme ARRS P2-0103 (B) “Knowledge Technologies”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomaž Erjavec.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work presented in this paper was supported by the basic research project J6-7094: “Slovenian scientific texts: resources and description” and by the research programme P2-0103 (B) “Knowledge Technologies”, financed by the Slovenian research agency.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Erjavec, T., Fišer, D. & Ljubešić, N. The KAS corpus of Slovenian academic writing. Lang Resources & Evaluation 55, 551–583 (2021). https://doi.org/10.1007/s10579-020-09506-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09506-4

Keywords

Navigation