Abstract
Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Guevara, E.: NoWaC: a large web-based corpus for Norwegian. In: NAACL HLT 2010 Sixth Web as Corpus Workshop, pp. 1–7 (2010)
Spoustov, D., Spousta, M., Pecina, P.: Building a Web Corpus of Czech. In: Seventh Intl. Conf. on Language Resources and Evaluation, LREC 2010 (2010)
Sharoff, S.: Analysing Similarities and Differences between Corpora. In: 7th Conference ”Language Technologies”, Jožef Stefan Institute, Ljubljana, pp. 5–11 (2010)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM 2010, pp. 441–450 (2010)
Stupar, M., Jurić, T., Ljubešić, N.: Language Identification on Web Data for Building Linguistic Corpora. In: Proceedings of the INFuture 2011 Conference (2011) (in press)
Agić, Ž., Tadić, M.: Evaluating Morphosyntactic Tagging of Croatian Texts. In: Fifth Intl. Conf. on Language Resources and Evaluation (2006)
Erjavec, T., Ignat, C., Pouliquen, B., Steinberger, R.: Massive Multilingual Corpus Compilation: Acquis Communautaire and ToTaLe. Archives of Control Sciences 15(3), 253–264 (2005)
Erjavec, T., Krek, S.: The JOS morphosyntactically tagged corpus of Slovene. In: Sixth Intl. Conf. on Language Resources and Evaluation (2008)
Erjavec, T.: MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Seventh Intl. Conf. on Language Resources and Evaluation (2010)
McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ljubešić, N., Erjavec, T. (2011). hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. In: Habernal, I., Matoušek, V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science(), vol 6836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23538-2_50
Download citation
DOI: https://doi.org/10.1007/978-3-642-23538-2_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23537-5
Online ISBN: 978-3-642-23538-2
eBook Packages: Computer ScienceComputer Science (R0)