Skip to main content

In the Garden and in the Jungle

Comparing Genres in the BNC and Internet

  • Chapter
  • First Online:
Genres on the Web

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 42))

Abstract

The jungle metaphor is quite common in genre studies. This paper presents a set of genre categories to compare web-derived corpora to their traditional counterparts (such as the BNC), and a set of methods for automatic assessment of the genre composition of newly collected webcorpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Throughout this chapter I refer to BNC texts using their ids from the BNC Index, which is available from http://clix.to/davidlee00

  2. 2.

    The quote refers to the purposes Michael Halliday intended for his “Introduction to Functional Grammar” [11].

  3. 3.

    This example assumes that the function of narration is actively used in the respective societies for approximately the same purposes, but for modern corpora this can be taken for granted.

  4. 4.

    A similar pattern is evident in the accuracy drop from about 90% in the “crisp” 7-webgenre corpus to 66% in a fuzzy KI-04 corpus in experiments described in [22].

  5. 5.

    The BNC has been retagged with TreeTagger, the same tool used for tagging I-EN, so there was no difference in the tagset and tagging between the two corpora (this could have caused variations in accuracy otherwise).

  6. 6.

    The use of keywords for genre detection has been studied, e.g., in [29] or [8].

References

  1. Allen, P., J.A. Bateman, and J.L. Delin. 1999. Genre and layout in multimodal documents: Towards an empirical account. In Proceedings of the AAAI Fall Symposium on Using Layout for the Generation, Understanding, or Retrieval of Documents, eds. R. Power and D. Scott, 27–34. Cape Cod, MA: American Association for Artificial Intelligence. URL http://www.fb10.uni-bremen.de/anglistik/langpro/projects/gem/downloads/allen-bateman-delin.PDF

    Google Scholar 

  2. Baroni, M., and S. Bernardini. 2004. Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of LREC2004. Lisbon.

    Google Scholar 

  3. Baroni, M., and A. Kilgarriff. 2006. Large linguistically-processed Web corpora for multiple languages. In Companion Volume to Proceedings of the European Association of Computational Linguistics, 87–90. Trento.

    Google Scholar 

  4. Baroni, M., F. Chantree, A. Kilgarriff, and S. Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the 6th Language Resources and Evaluation Conference, LREC 2008. Marrakech. URL http://corpus.leeds.ac.uk/serge/publications/lrec2008-cleaneval.pdf

  5. Biber, D. 1988. Variations across speech and writing. Cambridge, MA: Cambridge University Press.

    Book  Google Scholar 

  6. Biber, D., and J. Kurjian. 2006. Towards a taxonomy of web registers and text types: A multidimensional analysis. In Corpus linguistics and the web, eds. M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Amsterdam: Rodopi.

    Google Scholar 

  7. Braslavski, P. 2004. Document style recognition using shallow statistical analysis. In ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, 1–9. Nancy.

    Google Scholar 

  8. Crossley, S.A., and M. Lowerse. 2007. Multi-dimensional register classification using bigrams. International Journal of Corpus Linguistics 12(4):453–478.

    Google Scholar 

  9. EAGLES. 1996. Preliminary recommendations on text typology. Technical Report EAG-TCWG-TTYP/P, Expert Advisory Group on Language Engineering Standards document. URL http://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html

  10. Ferraresi, A. 2007. Building a very large corpus of English obtained by web crawling: ukwac. Master’s thesis, University of Bologna.

    Google Scholar 

  11. Halliday, M.A.K. 1985. An introduction to functional grammar. London: Edward Arnold.

    Google Scholar 

  12. Jakobson, R. 1960. Linguistics and poetics. In Style in Language, ed. T.A. Sebeok, 350–377. Cambridge, MA: MIT Press.

    Google Scholar 

  13. Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum 38(2):57–61. doi: http://doi.acm.org/10.1145/1041394.1041395

    Article  Google Scholar 

  14. Kessler, B., Nunberg, G., and H. Schütze. 1997. Automatic detection of text genre. In Proceedings of the 35th ACL/8th EACL, 32–38. Madrid.

    Google Scholar 

  15. Kilgarriff, A. 2001. The web as corpus. In proceeding of corpus linguistics 2001. Lancaster. URL http://www.itri.bton.ac.uk/techreports/ITRI-01-14.abs.html

  16. Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology 5(3):37–72. URL http://llt.msu.edu/vol5num3/pdf/lee.pdf

    Google Scholar 

  17. Macdonald, C., and I. Ounis. 2006. The TREC blogs06 collection: Creating and analysing a blog test collection. Technical Report TR-2006-224, Department of Computing Science, University of Glasgow. URL http://ir.dcs.gla.ac.uk/terrier/publications/macdonald06creating.pdf

  18. Martin, J.R. 1984. Language, register and genre. In Children Writing: Reader (ECT language studies: Children writing), ed. F. Christie, 21–30. Geelong, VIC: Deakin University Press.

    Google Scholar 

  19. Mehler, A., and R. Gleim. 2006. The net for the graphs – towards webgenre representation for corpus linguistic studies. In WaCky! Working papers on the Web as Corpus, eds. M. Baroni and S. Bernardini. Bologna: Gedit.

    Google Scholar 

  20. Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages. In Proceedings of the 27th German Conference on Artificial Intelligence. Ulm.

    Google Scholar 

  21. Rehm, G., M. Santini, A. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of the 6th Language Resources and Evaluation Conference, LREC 2008. Marrakech.

    Google Scholar 

  22. Santini, M. 2007. Automatic identification of genre in web pages. PhD thesis, University of Brighton.

    Google Scholar 

  23. Sharoff, S. 2005. Methods and tools for development of the Russian reference corpus. In Corpus linguistics around the world, eds. D. Archer, A. Wilson, and P. Rayson, 167–180. Amsterdam: Rodopi.

    Google Scholar 

  24. Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus, eds. M. Baroni and S. Bernardini. Bologna: Gedit. http://wackybook.sslmit.unibo.it

    Google Scholar 

  25. Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop. Louvain-la-Neuve.

    Google Scholar 

  26. Sinclair, J. 2003. Corpora for lexicography. In A practical guide to lexicography, ed. P. van Sterkenberg, 167–178. Amsterdam: Benjamins.

    Google Scholar 

  27. Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceeding Towards Genre-Enabled Search Engines: The Impact of NLP. RANLP, URL http://dis.ijs.si/MitjaL/documents/Vidulin-Using_Genres_to_Improve_Search_Engines-RANLP-07-TGESE.pdf

  28. Witten, I.H., and E. Frank. 2005. Data Mining: Practical machine learning tools and techniques. San Francisco, CA: Morgan Kaufmann.

    MATH  Google Scholar 

  29. Xiao, Z., and A. McEnery. 2005. Three genres in modern American English. Journal of English Linguistics 33(1):62–82.

    Article  Google Scholar 

Download references

Acknowledgements

I’m grateful to Silvia Bernardini, Adam Kilgarriff, Katja Markert and Marina Santini for useful discussions. The usual disclaimers apply. The tools for genre classification described in this chapter and the results of classifications of the Internet corpora are available from http://corpus.leeds.ac.uk/serge/webgenres/

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Serge Sharoff .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Sharoff, S. (2010). In the Garden and in the Jungle. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-9178-9_7

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-9177-2

  • Online ISBN: 978-90-481-9178-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics