Skip to main content
Log in

Comparing web-crawled and traditional corpora

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a “traditional” corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the “searchable” web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. We consider the following projects as prototypes of traditional corpora: BNC for English (Aston and Burnard 1998) or NKJP for Polish (Górski and Łaziński 2012), each containing a majority of written texts and some proportion of spoken and/or web communication.

  2. The reason why we selected web texts for Koditex based on broad genres derived from extratextual metadata, as opposed to more fine-grained registers (as in Biber and Egbert 2016), is twofold: (1) a typology of web registers in Czech has not been established to date, and (2) we did not have any information about the web texts in the process of Koditex compilation other than their source, we therefore could not classify them according to their register prior to the MDA being carried out.

  3. A similar motivation for using text excerpts can be found in the Brown corpus (Francis and Kučera 1964).

  4. The data set is available via the TROLLing repository (doi: https://doi.org/10.18710/QAJKZW). It also includes the full list of linguistic features employed.

  5. It should be pointed out that apart from general opportunistic web-crawled corpora such as Araneum Bohemicum, there are also specialized web corpora concentrating on specific domains (corpora of tweets etc.), typically requiring more targeted approaches to data collection, e.g. through provided custom APIs.

  6. Cf. https://stats.nic.cz/reports/2013/ (visited November 2019).

  7. As a matter of fact, in the domain of hypertext media, the concept of “entire text” is problematic anyway.

  8. We did not take any special measures to deal with the noise in the web-crawled data as we wanted to keep the operationalizations identical.

  9. The 2% limit is derived from the class quota in the Koditex corpus, which is at least 200,000 words, i.e. 2.21% of the corpus, or a minimum of 67 chunks, i.e. 2.04% of the corpus. This range ensures that even in the worst case scenario, none of the text classes in the Koditex corpus would be fully excluded as outliers.

  10. Notice that these properties are characteristic of a catalog (list, table, or other condensed data presentation format) and similar text categories. These are indeed abundant on the web. They are also not very linguistically interesting.

  11. Again, the claim is not that there is e.g. no actual written private correspondence in web-crawled data (that would hardly be surprising), but that the web-crawled data does not even yield texts which would be linguistically equivalent to and could act as surrogates for private correspondence.

  12. See e.g. https://en.wikipedia.org/wiki/Languages_used_on_the_Internet (visited October 2019).

References

  • Anthony, L. (2018). AntCorGen. Tokyo: Waseda University. Retrieved November 23, 2018, from http://www.laurenceanthony.net/software.

  • Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British national corpus with SARA. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Baron, N. (2010). Always on: Language in an online and mobile world (1st ed.). Oxford: Oxford University Press.

    Google Scholar 

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation,43(3), 209–226. https://doi.org/10.1007/s10579-009-9081-4.

    Article  Google Scholar 

  • Baroni, M., Kilgarriff, A., Pomikálek, J., & Rychlý, P. (2006). WebBootCaT: a web tool for instant corpora. In Proceeding of the EuraLex Conference (pp. 123–132).

  • Benešová, L., Křen, M., & Waclawicová, M. (2013). ORAL2013: Representative corpus of informal spoken Czech. czech, Praha: Institute of the Czech National Corpus. FF UK. Retrieved March 18, 2020, from http://www.korpus.cz.

  • Benko, V. (2014). Aranea: Yet another family of (comparable) web corpora. In International Conference on Text, Speech, and Dialogue (pp. 257–264). Berlin: Springer.

  • Benko, V. (2016a). Two years of Aranea: Increasing counts and tuning the pipeline. In LREC (pp. 4245–4248).

  • Benko, V. (2016b). Feeding the “Brno Pipeline”: The case of Araneum Slovacum. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing,10, 19–27.

    Google Scholar 

  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing,8(4), 243–257.

    Article  Google Scholar 

  • Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. (2014). Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast,14(1), 7–34. https://doi.org/10.1075/lic.14.1.02bib.

    Article  Google Scholar 

  • Biber, D., & Egbert, J. (2016). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics,44(2), 95–137. https://doi.org/10.1177/0075424216628955.

    Article  Google Scholar 

  • Čermák, F., Adamovičová, A., & Pešička, J. (2001). PMK: Prague spoken corpus. czech, Praha: Institute of the Czech National Corpus. FF UK. Retrieved March 18, 2020, from http://www.korpus.cz.

  • Cvrček, V., Čermáková, A., & Křen, M. (2016). Nová koncepce synchronních korpusů psané češtiny. Slovo a slovesnost,77(2), 83–101.

    Google Scholar 

  • Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2018a). Variabilita češtiny: multidimenzionální analýza [Variability of Czech: A multi-dimensional analysis]. Slovo a slovesnost,79(4), 293–321.

    Google Scholar 

  • Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (2018b). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2018-0020.

    Article  Google Scholar 

  • Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (forthcoming). Register variability of elicited texts.

  • Davies, M. (2018). The 14 Billion Word iWeb Corpus. Retrieved May 10, 2019, from https://www.english-corpora.org/iweb/.

  • Francis, W. N., & Kučera, H. (1964, 1979). Manual of information to accompany A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Brown Corpus Manual. Retrieved December 13, 2018, from http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM.

  • Górski, R. L., & Łaziński, M. (2012). Reprezentatywność i zrównoważenie korpusu. In A. Przepiórkowski, M. Bańko, R. L. Górski, & B. Lewandowska-Tomaszczyk (Eds.), Narodowy korpus języka polskiego: praca zbiorowa (pp. 25–36). Warszawa: Wydawnictwo Naukowe PWN.

    Google Scholar 

  • Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods,6(4), 430–450.

    Article  Google Scholar 

  • Herring, S. C. (2010). Computer-mediated conversation Part I: Introduction and overview. Language@ internet, 7(2). Retrieved March 18, 2020, from https://www.languageatinternet.org/articles/2010/2801.

  • Hladká, Z. (2002). BMK: Brno spoken corpus. Praha: Institute of the Czech National Corpus. FF UK. Retrieved March 18, 2020, from http://www.korpus.cz.

  • Hoffmannová, J., Homoláč, J., Chvalovská, E., Jílková, L., Kaderka, P., Mareš, P., et al. (2016). Stylistika mluvené a psané češtiny (1st ed.). Praha: Academia.

    Google Scholar 

  • Ide, N., Reppen, R., & Suderman, K. (2002). The American National Corpus: More Than the Web Can Provide. In Proceedings of the Third Language Resources and Evaluation Conference (LREC) (pp. 839–844). Presented at the LREC 2002, Las Palmas, Canary Islands, Spain: Citeseer. Retrieved March 18, 2020, from http://www.lrec-conf.org/proceedings/lrec2002/pdf/303.pdf.

  • Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The tenten corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125–127).

  • Kaderka, P. (2012). Dialog: corpus of broadcasted Czech discussions. czech, Praha: Ústav pro jazyk český, AV ČR. Retrieved March 18, 2020, from http://www.korpus.cz.

  • Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics,6(1), 97–133.

    Article  Google Scholar 

  • Kilgarriff, A. (2012). Getting to know your corpus. In P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), Text, speech and dialogue (pp. 3–15). Berlin: Springer.

    Chapter  Google Scholar 

  • Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010). A corpus factory for many languages. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 1723 May 2010, Valletta, Malta (pp. 17–23). Valleta, Malta. Retrieved March 18, 2020, from http://www.lrec-conf.org/proceedings/lrec2010/summaries/79.html.

  • Křen, M., Cvrček, V., Čapka, T., Čermáková, A., Hnátková, M., Jelínek, T., et al. (2016). SYN2015: Representative corpus of contemporary written Czech. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 2522–2528). Presented at the LREC’16, Portorož: ELRA.

  • Leech, G. (2007). New resources, or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 133–149). Amsterdam: Rodopi.

    Chapter  Google Scholar 

  • Michelfeit, J., Pomikálek, J., & Suchomel, V. (2014). Text tokenisation using unitok. In 8th Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 71–75). Presented at the RASLAN 2014, Brno: NLP Consulting.

  • Piperski, A. (2017). Sum of Minimum Frequencies as a Measure of Corpus Similarity. Presented at the Corpus Linguistics 2017, Birmingham. Retrieved March 18, 2020, from https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper143.pdf.

  • Piperski, A. (2018). Corpus size and the robustness of measures of corpus distance. In Computational Linguistics and Intellectual Technologies (pp. 590–600). Presented at the Dialogue 2018, Moscow. http://www.dialog-21.ru/media/4327/piperskiach.pdf.

  • Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora (PhD Thesis). Masarykova univerzita, Fakulta informatiky, Brno. Retrieved March 18, 2020, from https://is.muni.cz/th/o6om2/phdthesis.pdf.

  • R Core Team. (2018). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Retrieved March 18, 2020, from https://www.R-project.org/.

  • Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing CorporaVolume 9 (pp. 1–6). Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1117729.1117730.

  • Revelle, W. (2018). psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, IL: Northwestern University. Retrieved March 18, 2020, from https://CRAN.R-project.org/package=psych.

  • Sharoff, S. (2018). Functional text dimensions for the annotation of web corpora. Corpora,13(1), 65–95. https://doi.org/10.3366/cor.2018.0136.

    Article  Google Scholar 

  • Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39–43). Lyon.

  • Válková, L., Waclawicová, M., & Křen, M. (2012). Balanced data repository of spontaneous spoken Czech. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 3345–3349). Presented at the LREC’12, Istanbul: ELRA. Retrieved March 18, 2020, from http://www.lrec-conf.org/proceedings/lrec2012/pdf/179_Paper.pdf.

  • Zasina, A. J., & Komrsková, Z. (2019). Koditex — korpus diverzifikovaných textů. Studie z aplikované lingvistiky - Studies in Applied Linguistics,10(1), 127–132.

    Google Scholar 

  • Zasina, A. J., Lukeš, D., Komrsková, Z., Poukarová, P., & Řehořková, A. (2018). Koditex: corpus of diversified texts. Czech, Prague: Institute of the Czech National Corpus. FF UK. Retrieved November 26, 2018, from http://www.korpus.cz.

Download references

Acknowledgements

This study was supported by the European Regional Development Fund project “Language Variation in the CNC” no. CZ.02.1.01/0.0/0.0/16_013/0001758 and has been, in part, funded by the Slovak KEGA and VEGA Grant Agencies, Project No. K-16-022-00 and 2/0017/17, respectively.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Václav Cvrček.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cvrček, V., Komrsková, Z., Lukeš, D. et al. Comparing web-crawled and traditional corpora. Lang Resources & Evaluation 54, 713–745 (2020). https://doi.org/10.1007/s10579-020-09487-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09487-4

Keywords

Navigation