Skip to main content

Cross-Testing a Genre Classification Model for the Web

  • Chapter
  • First Online:
Genres on the Web

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 42))

Abstract

The main aim of the experiments described in this chapter is to investigate ways of assessing the robustness and stability of an Automatic Genre Identification (AGI) model for the web. More specifically, a series of comparisons using four genre collections are illustrated and analysed. I call this comparative approach cross-testing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Many of them are available through the webgenrewiki, at <http://purl.org/net/webgenres>. Copyright of genre collections built with web material may vary according to national laws. The copyright of the web pages contained in the genre collections used in this chapter is held by the author/owner(s) of the web pages. These web pages are used for research purposes only.

  2. 2.

    A quite long initial phase, if we consider that this research field was initiated in 1994 with Karlgren and Cutting’s extensively cited paper based on the Brown Corpus and discriminant analysis [18].

  3. 3.

    Cf. Lee [19] for the application of these three levels following the prototype theory to genre.

  4. 4.

    See <http://en.wikipedia.org/wiki/LOB_Corpus>, retrieved April 2009.

  5. 5.

    See <http://en.wikipedia.org/wiki/Brown_Corpus>, retrieved April 2009.

  6. 6.

    See <http://www.amazon.co.uk/Books-Categories/b/ref=sv_b_1?ie=UTF8&node=1025612>, retrieved April 2009.

  7. 7.

    This model has already been presented to the genre community with a partial evaluation in Santini [27, 29, 33].

  8. 8.

    In April 2005 – when the genre model described in this chapter was designed and built – Google could search 8,058,044,651 web pages.

  9. 9.

    The concept of “noise” can be applied to different situations. For example, while in Stubbe et al. [37] “noise” refers to orthographical errors, in the present study “noise” refers to documents that straddle to more than one genre and to documents that belong to no genre.

  10. 10.

    As the authors point out “By splitting the multi-labeled ML problem into 20 binary sub-problems, we got 20 unbalanced data sets with high numbers of negative and low number of positive examples. Sub-classifiers that would recognize only negative examples would still be highly accurate” [41].

  11. 11.

    The list of objective sources is listed in Santini [29, Appendix A].

  12. 12.

    The spreadsheet containing my standoff annotation is available at <http://sites.google.com/site/ marinasantiniacademicsite/>: see my_manual_genre_labelling_1000SPIRIT_webpages_NOVEM BER2008_ matching_with_the_initial_corpus.xls.

  13. 13.

    It would be interesting to define the amount of the critical mass for genre annotation, i.e. to establish the point when the majority agrees on a number of labels for the same document. It seems that genre annotation based of the agreement of small number of people (2, 3, 4, or a few more) does not guarantee reliability. For instance Mikael Gunnarson, made the following observations on the article genres included in the KI-04 corpus, which is defined as “Documents with longer passages of text, such as research articles, reviews, technical reports, or book chapters” [22]. In this class, Gunnarsson found: a book announcement, a redirect page, a table of contents, bibliography, three documents authored in German, 2 commercial portrayals, 2 help pages, 2 discussion pages, 1 link list, and 1 personal homepage among the 127 articles (personal communication). Although intra-genre variation is, in my opinion, a positive characteristic, as well as a certain degree of noise, after Gunnarsson’s breakdown one might wonder about the criteria for representing a genre class.

  14. 14.

    Following Biber’s tradition [2], I had named them “text types” in my previous publications.

  15. 15.

    For example, Rosso suggested that genre tags could be added (with a special genre-enabled tool) within social networks (personal communication).

  16. 16.

    Cf. also the interesting experiments with “heavy” visual features carried out by Levering et al. [20] in order to detect subgenres.

  17. 17.

    “The notion of function is closely associated with the notion of situation. A primary motivation for analysis of the components of situation is the desire to link the functions of particular linguistic features to variation in the communicative situation” [2, p. 33].

  18. 18.

    See all the excel files whose names start with “GIMs” at http://sites.google.com/site/ marinasantiniacademicsite/.

  19. 19.

    Chi-square calculator: <http://www.physics.csbsju.edu/cgi-bin/stats/contingency_form.sh? nrow=2&ncolumn=2>. (April 2009)

  20. 20.

    The same method can be used for language identification and subject-based text classification.

  21. 21.

    The spreadsheet containing the matches is available at <http://sites.google.com/site/ marinasantiniacademicsite/>: see my_manual_genre_labelling_1000SPIRIT_webpages.xls.

References

  1. Berninger V., Y. Kim, and R. Ross. 2008. Building a document genre corpus: A profile of the KRYS I corpus. Corpus profiling for information retrieva and natural language processing. Workshop Held in Conjunction with IIiX 2008, 18th Oct 2008. London.

    Google Scholar 

  2. Biber, D. 1988. Variations across speech and writing. Cambridge, UK: Cambridge University Press.

    Book  Google Scholar 

  3. Biber, D. and Kurjian, J. (2007). Towards a taxanomy of web registers and text types: a multi-dimensional analysis. In Corpus linguistics and the web, eds., M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Rodopi – Amsterdam – New York.

    Google Scholar 

  4. Blood, R. 2000. Weblogs: A history and perspective. Rebecca’s pocket. http://www. rebeccablood.net/essays/weblog_history.html. Accessed 7 Sep 2000.

  5. Bruce, I. 2008. Academic writing and genre. A systematic analysis. London-New York: Continuum International Publishing Group Ltd.

    Google Scholar 

  6. Dewdney, N., C. Vaness-Dikema, and R. Macmillan. 2001. The form is the substance: Classification of genres in text. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter of the Association for Computational Linguistics. Toulouse.

    Google Scholar 

  7. Dewe, J., J. Karlgren, and I. Bretan. 1998. Assembling a balanced corpus from the internet. In Proceedings of the 11th Nordic Conference of Computational Linguistics. Copenhagen.

    Google Scholar 

  8. Döring, N. 2002. Personal home pages on the web: A review of research. Journal of Computer-Mediated Communication (JCMC) 7(3).

    Google Scholar 

  9. Duda, R., J. Gasching, and P. Hart. 1979. Model design in the prospector consultant system for mineral exploration. In Expert systems in the micro-electronic age, ed. D. Michie, 153–167. Edinburgh: Edinburgh University Press. Reprinted in 1984.

    Google Scholar 

  10. Duda, R., P. Hart, and N. Nilsson. 1981. Subjective methods for rule-based inference system. In Readings in artificial intelligence, eds. B. Weber and N. Nilsson, 192–199. Palo Alto, CA: Tioga Publishing Company.

    Google Scholar 

  11. Freund, L. 2008. Exploiting task-document relations in support of information retrieval in the workplace. Doctoral dissertation, Faculty of Information Studies, University of Toronto, Toronto. http://faculty.arts.ubc.ca/lfreund/Publications/Freund_Luanne_S_200811_ PhD_thesis.pdf

  12. Freund, L., C.L.A. Clarke, and E.G. Toms. 2006. Genre classification for IR in the workplace. In Proceedings of Information Interaction in Context (IIiX 2006) Copenhagen, Denmark.

    Google Scholar 

  13. Görlach, M. 2004. Text types and the history of English. Berlin-New York: Mouton de Gruyter.

    Book  Google Scholar 

  14. Heyd, T. 2008. Email Hoaxes. Form, function, genre ecology. Amsterdam; Philadelphia, PA: J. Benjamins Publishing Company.

    Google Scholar 

  15. Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum, 38(2), December 2004.

    Google Scholar 

  16. Kanaris, I. and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. Washington, DC.

    Google Scholar 

  17. Kanaris, I., and E. Stamatatos. 2009. Learning to recognize webpage genres. Information Processing and Management 45(5):499–512.

    Article  Google Scholar 

  18. Karlgren, J., and D. Cutting. 1994. Recognizing text genre with simple metrics using discriminant analysis. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994). Kyoto.

    Google Scholar 

  19. Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5(3):37–72.

    Google Scholar 

  20. Levering, R., M. Cutler, and L. Yu. 2008. Using visual features for fine-grained genre classification of web pages. In Proceedings of the 41st Hawaii International Conference on System Sciences. Big Island, Hawaii.

    Google Scholar 

  21. Mason, J., M. Shepherd, and J. Duffy. 2009. An n-gram based approach to automatically identifying web page genre. In Proceedings of the 42nd Annual Hawaii International Conference on System Sciences. Big Island, Hawaii.

    Google Scholar 

  22. Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages: User study and feasibility analysis. In Advances in artificial intelligence, eds. S. Biundo, T. Frühwirth, and G. Palm, 256–269. Berlin: Springer.

    Google Scholar 

  23. Rehm, G., M. Santini, M. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of LREC 2008, May 28–30. Marrakech, Morocco.

    Google Scholar 

  24. Rosso, M. 2008. User-based identification of Web genres. Journal of the American Society for Information Science and Technology 59(7):1053–1072.

    Article  MathSciNet  Google Scholar 

  25. Santini, M. 2005. Building on syntactic annotation: Labelling subordinate clauses. In Proceedings of the Workshop on Exploring Syntactically Annotated Corpora (held in conjunction with Corpus Linguistics 2005 Conference). Birmingham.

    Google Scholar 

  26. Santini, M. 2006. Common criteria for genre classification: Annotation and granularity. In Proceedings of the Workshop on Text-based Information Retrieval (TIR-06) (held in conjunction with ECAI 2006). Riva del Garda.

    Google Scholar 

  27. Santini, M. 2007a. Automatic genre identification: Towards a flexible classification scheme. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007a) (held in conjunction with the European Summer School on IR (ESSIR 2007)), Tuesday, 28th and Wednesday, 29th of Aug. Glasgow.

    Google Scholar 

  28. Santini, M. 2007b. Characterizing genres of web pages: Genre hybridism and individualization. In Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS-40). Hawaii.

    Google Scholar 

  29. Santini, M. 2007c. Automatic identification of genre in web pages. PhD thesis, University of Brighton, Brighton.

    Google Scholar 

  30. Santini, M. 2008. Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing and Management 44(2):702–737.

    Article  MathSciNet  Google Scholar 

  31. Santini, M., and M. Rosso. 2008. Testing a genre-enabled application: A preliminary assessment. In Proceedings of Future Direction in Information Access (FDIA-2008). BCS, London.

    Google Scholar 

  32. Santini, M., and S. Sharoff. 2009. Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):129–145.

    Google Scholar 

  33. Santini, M., R. Power, and R. Evans. 2006. Implementing a characterization of genre for automatic genre identification of web pages. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL/COLING 2006). Main Conference Poster Paper. Sydney.

    Google Scholar 

  34. Shepherd, M., C. Watters, and A. Kennedy. 2004. Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering 3(3–4):236–251.

    Google Scholar 

  35. Stein, B., and S. Meyer zu Eissen. 2008. Retrieval Models for Genre Classification. Scandinavian Journal of Information Systems (SJIS) 20(1):91–117.

    Google Scholar 

  36. Stubbe, A., and C. Ringlstetter. 2007. Recognizing Genres. In Abstract Proceedings of the Colloqium “Towards a Reference Corpus of Web Genres” (held in conjunction with Corpus Linguistics 2007), 27 Jul 2007, eds. M. Santini and S. Sharoff. Birmingham.

    Google Scholar 

  37. Stubbe, A., C. Ringlstetter, and K. Schulz. 2007. Genre to classify noise – noise to classify genre. In Proceedings of the IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, 8 Jan 2007. Hyderabad, India. International Journal on Document Analysis and Recognition (IJDAR), Dec 2007.

    Google Scholar 

  38. Thelwall, M. 2008a. Text in social network web sites: A word frequency analysis of Live Spaces. First Monday 13(2).

    Google Scholar 

  39. Thelwall, M. 2008b. Quantitative comparisons of search engine results. Journal of the American Society for Information Science and Technology 59(11):1702–1710.

    Article  Google Scholar 

  40. Thelwall, M. 2008c. Extracting accurate and complete results from search engines: Case study Windows Live. Journal of the American Society for Information Science and Technology 59(1):38–50.

    Article  Google Scholar 

  41. Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceedings of Towards Genre-enable Search Engines: The Impact of Natural Language Processing Workshop, Sept 2007. Borovets, Bulgaria.

    Google Scholar 

  42. Vidulin, V., M. Luštrek, and M. Gams. 2009. Multi-label approaches to web genre identification. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):97–114.

    Google Scholar 

  43. Waltinger, U., and A. Mehler. 2009. The feature difference coefficient: Classification by means of feature distributions. In Proceedings of the Conference on Text Mining Services (TMS 2009), 159–168. Leipzig, Germany.

    Google Scholar 

  44. Xu, J., Y. Cao, H. Li, N. Craswell, and Y. Huang. 2007. Searching documents based on relevance and type. In Proceeding of ECIR 2007. Rome, Italy.

    Google Scholar 

  45. Yeung, P., S. Büttcher, C. Clarke, and M. Kolla. 2007a. A Bayesian approach for learning document type relevance. ECIR 2007. Rome.

    Google Scholar 

  46. Yeung, P., C. Clarke, and S. Büttcher. 2007b. Improving retrieval accuracy by weighting document types with clickthrough data. SIGIR’07. Amsterdam, The Netherlands.

    Google Scholar 

  47. Yeung, P., L. Freund, and C. Clarke. 2007c. X-Site: A workplace search tool for software engineers. System demo presented at the 30th International ACM SIGIR Conference. Amsterdam.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Santini .

Editor information

Editors and Affiliations

Appendix

Appendix

The appendix contains tables describing the genre corpora used in the experiments explained in Chapter 5.

1.1 SANTINIS (2,480 Web Pages). Cf. Also Santini ([29], Appendix B)

Table 5.15 SANTINIS composition

1.2 KI-04 (1,205 Web Pages). Cf. Also Meyer zu Eissen and Stein [22]

Table 5.16 KI-04 composition

1.3 HGC (Used 1,180 for Crosstesting). Cf. Also Stubbe et al. [37]

Table 5.17 HGC composition

1.4 MGC (1,539 Web Pages). Cf. Also Vidulin et al. [41]

Table 5.18 MGC composition

1.5 100 Facets

Table 5.19 100 facets

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Santini, M. (2010). Cross-Testing a Genre Classification Model for the Web. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-9178-9_5

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-9177-2

  • Online ISBN: 978-90-481-9178-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics