Abstract
The main aim of the experiments described in this chapter is to investigate ways of assessing the robustness and stability of an Automatic Genre Identification (AGI) model for the web. More specifically, a series of comparisons using four genre collections are illustrated and analysed. I call this comparative approach cross-testing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Many of them are available through the webgenrewiki, at <http://purl.org/net/webgenres>. Copyright of genre collections built with web material may vary according to national laws. The copyright of the web pages contained in the genre collections used in this chapter is held by the author/owner(s) of the web pages. These web pages are used for research purposes only.
- 2.
A quite long initial phase, if we consider that this research field was initiated in 1994 with Karlgren and Cutting’s extensively cited paper based on the Brown Corpus and discriminant analysis [18].
- 3.
Cf. Lee [19] for the application of these three levels following the prototype theory to genre.
- 4.
See <http://en.wikipedia.org/wiki/LOB_Corpus>, retrieved April 2009.
- 5.
See <http://en.wikipedia.org/wiki/Brown_Corpus>, retrieved April 2009.
- 6.
See <http://www.amazon.co.uk/Books-Categories/b/ref=sv_b_1?ie=UTF8&node=1025612>, retrieved April 2009.
- 7.
- 8.
In April 2005 – when the genre model described in this chapter was designed and built – Google could search 8,058,044,651 web pages.
- 9.
The concept of “noise” can be applied to different situations. For example, while in Stubbe et al. [37] “noise” refers to orthographical errors, in the present study “noise” refers to documents that straddle to more than one genre and to documents that belong to no genre.
- 10.
As the authors point out “By splitting the multi-labeled ML problem into 20 binary sub-problems, we got 20 unbalanced data sets with high numbers of negative and low number of positive examples. Sub-classifiers that would recognize only negative examples would still be highly accurate” [41].
- 11.
The list of objective sources is listed in Santini [29, Appendix A].
- 12.
The spreadsheet containing my standoff annotation is available at <http://sites.google.com/site/ marinasantiniacademicsite/>: see my_manual_genre_labelling_1000SPIRIT_webpages_NOVEM BER2008_ matching_with_the_initial_corpus.xls.
- 13.
It would be interesting to define the amount of the critical mass for genre annotation, i.e. to establish the point when the majority agrees on a number of labels for the same document. It seems that genre annotation based of the agreement of small number of people (2, 3, 4, or a few more) does not guarantee reliability. For instance Mikael Gunnarson, made the following observations on the article genres included in the KI-04 corpus, which is defined as “Documents with longer passages of text, such as research articles, reviews, technical reports, or book chapters” [22]. In this class, Gunnarsson found: a book announcement, a redirect page, a table of contents, bibliography, three documents authored in German, 2 commercial portrayals, 2 help pages, 2 discussion pages, 1 link list, and 1 personal homepage among the 127 articles (personal communication). Although intra-genre variation is, in my opinion, a positive characteristic, as well as a certain degree of noise, after Gunnarsson’s breakdown one might wonder about the criteria for representing a genre class.
- 14.
Following Biber’s tradition [2], I had named them “text types” in my previous publications.
- 15.
For example, Rosso suggested that genre tags could be added (with a special genre-enabled tool) within social networks (personal communication).
- 16.
Cf. also the interesting experiments with “heavy” visual features carried out by Levering et al. [20] in order to detect subgenres.
- 17.
“The notion of function is closely associated with the notion of situation. A primary motivation for analysis of the components of situation is the desire to link the functions of particular linguistic features to variation in the communicative situation” [2, p. 33].
- 18.
See all the excel files whose names start with “GIMs” at http://sites.google.com/site/ marinasantiniacademicsite/.
- 19.
Chi-square calculator: <http://www.physics.csbsju.edu/cgi-bin/stats/contingency_form.sh? nrow=2&ncolumn=2>. (April 2009)
- 20.
The same method can be used for language identification and subject-based text classification.
- 21.
The spreadsheet containing the matches is available at <http://sites.google.com/site/ marinasantiniacademicsite/>: see my_manual_genre_labelling_1000SPIRIT_webpages.xls.
References
Berninger V., Y. Kim, and R. Ross. 2008. Building a document genre corpus: A profile of the KRYS I corpus. Corpus profiling for information retrieva and natural language processing. Workshop Held in Conjunction with IIiX 2008, 18th Oct 2008. London.
Biber, D. 1988. Variations across speech and writing. Cambridge, UK: Cambridge University Press.
Biber, D. and Kurjian, J. (2007). Towards a taxanomy of web registers and text types: a multi-dimensional analysis. In Corpus linguistics and the web, eds., M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Rodopi – Amsterdam – New York.
Blood, R. 2000. Weblogs: A history and perspective. Rebecca’s pocket. http://www. rebeccablood.net/essays/weblog_history.html. Accessed 7 Sep 2000.
Bruce, I. 2008. Academic writing and genre. A systematic analysis. London-New York: Continuum International Publishing Group Ltd.
Dewdney, N., C. Vaness-Dikema, and R. Macmillan. 2001. The form is the substance: Classification of genres in text. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter of the Association for Computational Linguistics. Toulouse.
Dewe, J., J. Karlgren, and I. Bretan. 1998. Assembling a balanced corpus from the internet. In Proceedings of the 11th Nordic Conference of Computational Linguistics. Copenhagen.
Döring, N. 2002. Personal home pages on the web: A review of research. Journal of Computer-Mediated Communication (JCMC) 7(3).
Duda, R., J. Gasching, and P. Hart. 1979. Model design in the prospector consultant system for mineral exploration. In Expert systems in the micro-electronic age, ed. D. Michie, 153–167. Edinburgh: Edinburgh University Press. Reprinted in 1984.
Duda, R., P. Hart, and N. Nilsson. 1981. Subjective methods for rule-based inference system. In Readings in artificial intelligence, eds. B. Weber and N. Nilsson, 192–199. Palo Alto, CA: Tioga Publishing Company.
Freund, L. 2008. Exploiting task-document relations in support of information retrieval in the workplace. Doctoral dissertation, Faculty of Information Studies, University of Toronto, Toronto. http://faculty.arts.ubc.ca/lfreund/Publications/Freund_Luanne_S_200811_ PhD_thesis.pdf
Freund, L., C.L.A. Clarke, and E.G. Toms. 2006. Genre classification for IR in the workplace. In Proceedings of Information Interaction in Context (IIiX 2006) Copenhagen, Denmark.
Görlach, M. 2004. Text types and the history of English. Berlin-New York: Mouton de Gruyter.
Heyd, T. 2008. Email Hoaxes. Form, function, genre ecology. Amsterdam; Philadelphia, PA: J. Benjamins Publishing Company.
Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum, 38(2), December 2004.
Kanaris, I. and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. Washington, DC.
Kanaris, I., and E. Stamatatos. 2009. Learning to recognize webpage genres. Information Processing and Management 45(5):499–512.
Karlgren, J., and D. Cutting. 1994. Recognizing text genre with simple metrics using discriminant analysis. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994). Kyoto.
Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5(3):37–72.
Levering, R., M. Cutler, and L. Yu. 2008. Using visual features for fine-grained genre classification of web pages. In Proceedings of the 41st Hawaii International Conference on System Sciences. Big Island, Hawaii.
Mason, J., M. Shepherd, and J. Duffy. 2009. An n-gram based approach to automatically identifying web page genre. In Proceedings of the 42nd Annual Hawaii International Conference on System Sciences. Big Island, Hawaii.
Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages: User study and feasibility analysis. In Advances in artificial intelligence, eds. S. Biundo, T. Frühwirth, and G. Palm, 256–269. Berlin: Springer.
Rehm, G., M. Santini, M. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of LREC 2008, May 28–30. Marrakech, Morocco.
Rosso, M. 2008. User-based identification of Web genres. Journal of the American Society for Information Science and Technology 59(7):1053–1072.
Santini, M. 2005. Building on syntactic annotation: Labelling subordinate clauses. In Proceedings of the Workshop on Exploring Syntactically Annotated Corpora (held in conjunction with Corpus Linguistics 2005 Conference). Birmingham.
Santini, M. 2006. Common criteria for genre classification: Annotation and granularity. In Proceedings of the Workshop on Text-based Information Retrieval (TIR-06) (held in conjunction with ECAI 2006). Riva del Garda.
Santini, M. 2007a. Automatic genre identification: Towards a flexible classification scheme. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007a) (held in conjunction with the European Summer School on IR (ESSIR 2007)), Tuesday, 28th and Wednesday, 29th of Aug. Glasgow.
Santini, M. 2007b. Characterizing genres of web pages: Genre hybridism and individualization. In Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS-40). Hawaii.
Santini, M. 2007c. Automatic identification of genre in web pages. PhD thesis, University of Brighton, Brighton.
Santini, M. 2008. Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing and Management 44(2):702–737.
Santini, M., and M. Rosso. 2008. Testing a genre-enabled application: A preliminary assessment. In Proceedings of Future Direction in Information Access (FDIA-2008). BCS, London.
Santini, M., and S. Sharoff. 2009. Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):129–145.
Santini, M., R. Power, and R. Evans. 2006. Implementing a characterization of genre for automatic genre identification of web pages. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL/COLING 2006). Main Conference Poster Paper. Sydney.
Shepherd, M., C. Watters, and A. Kennedy. 2004. Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering 3(3–4):236–251.
Stein, B., and S. Meyer zu Eissen. 2008. Retrieval Models for Genre Classification. Scandinavian Journal of Information Systems (SJIS) 20(1):91–117.
Stubbe, A., and C. Ringlstetter. 2007. Recognizing Genres. In Abstract Proceedings of the Colloqium “Towards a Reference Corpus of Web Genres” (held in conjunction with Corpus Linguistics 2007), 27 Jul 2007, eds. M. Santini and S. Sharoff. Birmingham.
Stubbe, A., C. Ringlstetter, and K. Schulz. 2007. Genre to classify noise – noise to classify genre. In Proceedings of the IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, 8 Jan 2007. Hyderabad, India. International Journal on Document Analysis and Recognition (IJDAR), Dec 2007.
Thelwall, M. 2008a. Text in social network web sites: A word frequency analysis of Live Spaces. First Monday 13(2).
Thelwall, M. 2008b. Quantitative comparisons of search engine results. Journal of the American Society for Information Science and Technology 59(11):1702–1710.
Thelwall, M. 2008c. Extracting accurate and complete results from search engines: Case study Windows Live. Journal of the American Society for Information Science and Technology 59(1):38–50.
Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceedings of Towards Genre-enable Search Engines: The Impact of Natural Language Processing Workshop, Sept 2007. Borovets, Bulgaria.
Vidulin, V., M. Luštrek, and M. Gams. 2009. Multi-label approaches to web genre identification. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):97–114.
Waltinger, U., and A. Mehler. 2009. The feature difference coefficient: Classification by means of feature distributions. In Proceedings of the Conference on Text Mining Services (TMS 2009), 159–168. Leipzig, Germany.
Xu, J., Y. Cao, H. Li, N. Craswell, and Y. Huang. 2007. Searching documents based on relevance and type. In Proceeding of ECIR 2007. Rome, Italy.
Yeung, P., S. Büttcher, C. Clarke, and M. Kolla. 2007a. A Bayesian approach for learning document type relevance. ECIR 2007. Rome.
Yeung, P., C. Clarke, and S. Büttcher. 2007b. Improving retrieval accuracy by weighting document types with clickthrough data. SIGIR’07. Amsterdam, The Netherlands.
Yeung, P., L. Freund, and C. Clarke. 2007c. X-Site: A workplace search tool for software engineers. System demo presented at the 30th International ACM SIGIR Conference. Amsterdam.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
The appendix contains tables describing the genre corpora used in the experiments explained in Chapter 5.
1.1 SANTINIS (2,480 Web Pages). Cf. Also Santini ([29], Appendix B)
1.2 KI-04 (1,205 Web Pages). Cf. Also Meyer zu Eissen and Stein [22]
1.3 HGC (Used 1,180 for Crosstesting). Cf. Also Stubbe et al. [37]
1.4 MGC (1,539 Web Pages). Cf. Also Vidulin et al. [41]
1.5 100 Facets
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Santini, M. (2010). Cross-Testing a Genre Classification Model for the Web. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_5
Download citation
DOI: https://doi.org/10.1007/978-90-481-9178-9_5
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9177-2
Online ISBN: 978-90-481-9178-9
eBook Packages: Computer ScienceComputer Science (R0)