Skip to main content
Log in

Building and evaluating web corpora representing national varieties of English

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Corpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://www.queensu.ca/strathy/corpus.

  2. http://www.scottishcorpus.ac.uk/.

  3. http://corpus.byu.edu/glowbe/.

  4. http://www.lemurproject.org/clueweb09/.

  5. The data for ClueWeb09 is split into several sections. The number of documents for the top-level domains of interest differs greatly. For the .uk corpus we used the first four sections of ClueWeb09, to give a final corpus of roughly similar size to the ukWaC (Ferraresi et al. 2008). For all other domains, we used the first seven sections, to give final corpora of roughly one billion tokens for each of .au, .ca, and .us, which will feature prominently in our analysis in Sects. 5 and 6.

  6. http://www.lemurproject.org/clueweb09/languageIndentification.php.

  7. http://www.anc.org/data/oanc/.

  8. http://www.queensu.ca/strathy/corpus.

  9. The Australian National Corpus (Peters 2009, http://www.ausnc.org.au/) is an effort to build a larger corpus of Australian English, but is currently a collection of many corpora of diverse types—many of which are spoken—and so does not appear to be suitable for our purposes.

  10. In this method the Chi-square value is not used for statistical hypothesis testing.

  11. These observations for AmE, BrE, and CanE are based on OANC, BNC, and Strathy, respectively.

  12. The findings using this method could, nevertheless, be influenced if such spelling preferences differed over time, or between text types that are more common in the web than national corpora (e.g., social media text).

  13. This is necessary because the chi-square measure of corpus similarity is only applicable to equal-size corpora.

  14. http://wordlist.aspell.net/.

  15. By looking down the columns in Table 5 we see that each national corpus is also most similar to the corresponding web corpus for each similarity measure, except in the case of OANC for chi-square, where OANC is more similar to .ca than .us.

  16. Although we do not have a sufficiently-large national corpus for Australia to use in these experiments, we measured the similarity between sub-corpora from .au and the national corpora. Here we found that for chi-square, .au is most similar to Strathy, but for cosine it is most like BNC.

  17. http://dare.wisc.edu/sites/dare.wisc.edu/files/DAREindex.htm.

  18. http://www.australiannationaldictionary.com/.

  19. http://www.dchp.ca/.

  20. http://aspell.net/.

  21. Bank of Canadian English, http://www.dchp.ca/.

  22. Although we have used CanOx as a source of Canadianisms for the analyses in this study, Dollinger and Gaylie (2015) argue that this dictionary might have incorrectly labelled some terms as Canadianisms. For example, term deposit—a fixed-term, fixed-interest deposit at a financial institution—appears to be widely used outside of Canada, and could be one such case.

References

  • Atkins, B. T. S. (2010). The DANTE database: Its contribution to English lexical research, and in particular to complementing the FrameNet data. In G. M. de Schryver (Ed.), A way with words: Recent advances in lexical theory and analysis. A Festschrift for Patrick Hanks. Kampala: Menha Publishers.

    Google Scholar 

  • Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of 39th annual meeting of the Association for Computational Linguistics (ACL 2001) (pp. 26–33), Toulouse, France.

  • Barber, K. (Ed.). (2005). Canadian Oxford dictionary (2nd ed.). Oxford: Oxford University Press.

    Google Scholar 

  • Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the Web. In Proceedings of the fourth international conference on language resources and evaluation (LREC 2004) (pp. 1313–1316), Lisbon, Portugal.

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.

    Article  Google Scholar 

  • Baroni, M., Chantree, F., Kilgarriff, A., & Sharoff, S. (2008). Cleaneval: A competition for cleaning Web pages. In Proceedings of the sixth international conference on language resources and evaluation (LREC 2008) (pp. 638–643), Marrakech, Morocco.

  • Baroni, M., Kilgarriff, A., Pomikálek J., & Rychlý, P. (2006). WebBootCaT: A web tool for instant corpora. In Proceedings XII EURALEX International Congress (EURALEX 2006) (pp. 123–131), Torino, Italy.

  • Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with python. Sebastopol, CA: O’Reilly Media Inc.

    Google Scholar 

  • Brewington, B. E., & Cybenko, G. (2000). How dynamic is the web. In Proceedings of the 9th international world wide web conference (pp. 257–276), Amsterdam, Netherlands.

  • Burnard, L. (2000). The British National Corpus users reference guide. Oxford: Oxford University Computing Services.

    Google Scholar 

  • Burnard, L. (2007). Reference guide for the British National Corpus (XML edition). Oxford: Oxford University Computing Services.

    Google Scholar 

  • Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of the 3rd annual symposium on document analysis and information retrieval (SDAIR-94) (pp. 161–175). Las Vegas, USA.

  • Chambers, J. K. (2008). The tangled garden: Relics and vestiges in Canadian English. Anglistik, 19(2), 7–21. (special issue: Focus on Canadian English).

    Google Scholar 

  • Clarke, C. L. A., Craswell. N., Soboroff, I., & Voorhees, E. M. (2011). Overview of the TREC 2011 Web Track. In Proceedings of the twentieth text REtrieval Conference (TREC 2011). NIST special publication: SP 500-295.

  • Cook, P., & Hirst, G. (2012). Do Web corpora from top-level domains represent national varieties of English? In Actes des 11es Journées internationales d’Analyse statistique des Données Textuelles/Proceedings of the 11th international conference on textual data statistical analysis (pp. 281–293). Liège, Belgium.

  • Cook, P., & Lui, M. (2012). langid.py for better language modelling. In Proceedings of the Australasian Language Technology Association workshop 2012 (ALTA 2012) (pp. 107–112), Dunedin, New Zealand.

  • Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.

    Article  Google Scholar 

  • Dillon, G. (2010). Building webcorpora of academic prose with BootCaT. In Proceedings of the NAACL HLT 2010 sixth web as corpus workshop (pp. 26–31), Los Angeles.

  • Dollinger, S. (2016). Googleology as smart lexicography: Big messy data for better regional labels. Dictionaries: Journal of the Dictionary Society of North America, 37, 60–98.

  • Dollinger, S., & Clarke, S. (2012). On the autonomy and homogeneity of Canadian English. World Englishes, 31(4), 449–466.

    Article  Google Scholar 

  • Dollinger, S., & Gaylie, S. (2015). Canadianisms in Canadian desk dictionaries: Scope, accuracy, desiderata. Presented at the 20th Biennial Dictionary Society of North America Meeting (DSNA-20) and the 9th Studies in the History of the English Language Conference (SHEL-9), Vancouver, Canada.

  • Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th web as corpus workshop: Can we beat Google (pp. 47–54), Marrakech, Morocco.

  • Green, E., & Peters, P. (1991). The Australian corpus project and Australian English. International Computer Archive of Modern English, 15, 37–53.

    Google Scholar 

  • Hall, J. H. (Ed.). (2012). Dictionary of American regional English (Vol. V: SI-Z). Cambridge: The Belknap Press of Harvard University Press.

    Google Scholar 

  • Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.

    Article  Google Scholar 

  • Kilgarriff, A. (2007). Googleology is bad science. Computational Linguistics, 33(1), 147–151.

    Article  Google Scholar 

  • Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the corpus linguistics conference, Liverpool, UK.

  • Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of Euralex (pp. 105–116), Lorient, France.

  • Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010). A corpus factory for many languages. In Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 904–910), Valletta, Malta.

  • Ljubešić, N., & Klubička, F. (2014). bs, hr, srwac—Web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th web as corpus workshop (WaC-9) (pp. 29–35), Gothenburg, Sweden.

  • Lui, M., & Baldwin, T. (2011). Cross-domain feature selection for language identification. In Proceedings of the fifth international joint conference on natural language processing (IJCNLP 2011) (pp. 553–561), Chiang Mai, Thailand.

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) system demonstrations (pp. 55–60).

  • Murphy, B., & Stemle, E. (2011). PaddyWaC: A minimally-supervised Web-corpus of Hiberno-English. In Proceedings of the first workshop on algorithms and resources for modelling of dialects and language varieties (pp. 22–29), Edinburgh, Scotland.

  • Passonneau, R. J., Ide, N., Su, S., & Stuart, J. (2014). Biber Redux: Reconsidering dimensions of variation in American English. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 565–576), Dublin, Ireland.

  • Peirsman, Y., Geeraerts, D., & Speelman, D. (2010). The automatic identification of lexical variation between language varieties. Natural Language Engineering, 16(4), 469–491.

    Article  Google Scholar 

  • Peters, P. (2009). The architecture of a multipurpose Australian national corpus. In Selected Proceedings of the 2008 HCSNet workshop on designing an Australian National Corpus (pp. 1–9), Sommerville, MA.

  • Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk University.

  • Pomikálek, J., Jakubíček, M., & Rychlý, P. (2012). Building a 70 billion word corpus of English from ClueWeb. In Proceedings of the eighth international conference on language resources and evaluation (LREC 2012) (pp. 502–506), Istanbul, Turkey.

  • Ramson, W. S. (Ed.). (1988). The Australian National Dictionary: A dictionary of Australianisms on historical principles. Oxford: Oxford University Press.

    Google Scholar 

  • Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.

    Article  Google Scholar 

  • Rezapour Asheghi, N., Markert, K., & Sharoff, S. (2014). Semi-supervised graph-based genre classification for web pages. In Proceedings of TextGraphs-9: The workshop on graph-based methods for natural language processing (pp. 39–47), Doha, Qatar.

  • Roth, T. (2012). Using web corpora for the recognition of regional variation in standard German collocations. In Proceedings of the seventh web as corpus workshop (WAC7) (pp. 31–38), Lyon, France.

  • Schäfer, R., & Bildhauer, F. (2013). Web corpus construction. San Rafael, CA: Morgan and Claypool.

    Google Scholar 

  • Schulz, S., Lyding, V., & Nicolas, L. (2013). STirWaC—Compiling a diverse corpus based on texts from the web for south Tyrolean German. In Proceedings of the 8th web as corpus workshop (WAC-8) (pp. 37–45), Lancaster, UK.

  • Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the Web as Corpus (pp. 63–98), GEDIT, Bologna, Italy.

Download references

Acknowledgements

This research was started while the first author was a McKenzie Postdoctoral Fellow at The University of Melbourne. This research was financially supported by the University of Melbourne, the University of New Brunswick, and the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Cook.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cook, P., Brinton, L.J. Building and evaluating web corpora representing national varieties of English. Lang Resources & Evaluation 51, 643–662 (2017). https://doi.org/10.1007/s10579-016-9378-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9378-z

Keywords

Navigation