Cross-Testing a Genre Classification Model for the Web

Santini, Marina

doi:10.1007/978-90-481-9178-9_5

Marina Santini⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 42))

1188 Accesses
10 Citations

Abstract

The main aim of the experiments described in this chapter is to investigate ways of assessing the robustness and stability of an Automatic Genre Identification (AGI) model for the web. More specifically, a series of comparisons using four genre collections are illustrated and analysed. I call this comparative approach cross-testing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Many of them are available through the webgenrewiki, at <http://purl.org/net/webgenres>. Copyright of genre collections built with web material may vary according to national laws. The copyright of the web pages contained in the genre collections used in this chapter is held by the author/owner(s) of the web pages. These web pages are used for research purposes only.
2.
A quite long initial phase, if we consider that this research field was initiated in 1994 with Karlgren and Cutting’s extensively cited paper based on the Brown Corpus and discriminant analysis [18].
3.
Cf. Lee [19] for the application of these three levels following the prototype theory to genre.
4.
See <http://en.wikipedia.org/wiki/LOB_Corpus>, retrieved April 2009.
5.
See <http://en.wikipedia.org/wiki/Brown_Corpus>, retrieved April 2009.
6.
See <http://www.amazon.co.uk/Books-Categories/b/ref=sv_b_1?ie=UTF8&node=1025612>, retrieved April 2009.
7.
This model has already been presented to the genre community with a partial evaluation in Santini [27, 29, 33].
8.
In April 2005 – when the genre model described in this chapter was designed and built – Google could search 8,058,044,651 web pages.
9.
The concept of “noise” can be applied to different situations. For example, while in Stubbe et al. [37] “noise” refers to orthographical errors, in the present study “noise” refers to documents that straddle to more than one genre and to documents that belong to no genre.
10.
As the authors point out “By splitting the multi-labeled ML problem into 20 binary sub-problems, we got 20 unbalanced data sets with high numbers of negative and low number of positive examples. Sub-classifiers that would recognize only negative examples would still be highly accurate” [41].
11.
The list of objective sources is listed in Santini [29, Appendix A].
12.
The spreadsheet containing my standoff annotation is available at <http://sites.google.com/site/ marinasantiniacademicsite/>: see my_manual_genre_labelling_1000SPIRIT_webpages_NOVEM BER2008_ matching_with_the_initial_corpus.xls.
13.
It would be interesting to define the amount of the critical mass for genre annotation, i.e. to establish the point when the majority agrees on a number of labels for the same document. It seems that genre annotation based of the agreement of small number of people (2, 3, 4, or a few more) does not guarantee reliability. For instance Mikael Gunnarson, made the following observations on the article genres included in the KI-04 corpus, which is defined as “Documents with longer passages of text, such as research articles, reviews, technical reports, or book chapters” [22]. In this class, Gunnarsson found: a book announcement, a redirect page, a table of contents, bibliography, three documents authored in German, 2 commercial portrayals, 2 help pages, 2 discussion pages, 1 link list, and 1 personal homepage among the 127 articles (personal communication). Although intra-genre variation is, in my opinion, a positive characteristic, as well as a certain degree of noise, after Gunnarsson’s breakdown one might wonder about the criteria for representing a genre class.
14.
Following Biber’s tradition [2], I had named them “text types” in my previous publications.
15.
For example, Rosso suggested that genre tags could be added (with a special genre-enabled tool) within social networks (personal communication).
16.
Cf. also the interesting experiments with “heavy” visual features carried out by Levering et al. [20] in order to detect subgenres.
17.
“The notion of function is closely associated with the notion of situation. A primary motivation for analysis of the components of situation is the desire to link the functions of particular linguistic features to variation in the communicative situation” [2, p. 33].
18.
See all the excel files whose names start with “GIMs” at http://sites.google.com/site/ marinasantiniacademicsite/.
19.
Chi-square calculator: <http://www.physics.csbsju.edu/cgi-bin/stats/contingency_form.sh? nrow=2&ncolumn=2>. (April 2009)
20.
The same method can be used for language identification and subject-based text classification.
21.
The spreadsheet containing the matches is available at <http://sites.google.com/site/ marinasantiniacademicsite/>: see my_manual_genre_labelling_1000SPIRIT_webpages.xls.

References

Berninger V., Y. Kim, and R. Ross. 2008. Building a document genre corpus: A profile of the KRYS I corpus. Corpus profiling for information retrieva and natural language processing. Workshop Held in Conjunction with IIiX 2008, 18th Oct 2008. London.
Google Scholar
Biber, D. 1988. Variations across speech and writing. Cambridge, UK: Cambridge University Press.
Book Google Scholar
Biber, D. and Kurjian, J. (2007). Towards a taxanomy of web registers and text types: a multi-dimensional analysis. In Corpus linguistics and the web, eds., M. Hundt, N. Nesselhauf, and C. Biewer, 109–131. Rodopi – Amsterdam – New York.
Google Scholar
Blood, R. 2000. Weblogs: A history and perspective. Rebecca’s pocket. http://www. rebeccablood.net/essays/weblog_history.html. Accessed 7 Sep 2000.
Bruce, I. 2008. Academic writing and genre. A systematic analysis. London-New York: Continuum International Publishing Group Ltd.
Google Scholar
Dewdney, N., C. Vaness-Dikema, and R. Macmillan. 2001. The form is the substance: Classification of genres in text. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics and 10th Conference of the European Chapter of the Association for Computational Linguistics. Toulouse.
Google Scholar
Dewe, J., J. Karlgren, and I. Bretan. 1998. Assembling a balanced corpus from the internet. In Proceedings of the 11th Nordic Conference of Computational Linguistics. Copenhagen.
Google Scholar
Döring, N. 2002. Personal home pages on the web: A review of research. Journal of Computer-Mediated Communication (JCMC) 7(3).
Google Scholar
Duda, R., J. Gasching, and P. Hart. 1979. Model design in the prospector consultant system for mineral exploration. In Expert systems in the micro-electronic age, ed. D. Michie, 153–167. Edinburgh: Edinburgh University Press. Reprinted in 1984.
Google Scholar
Duda, R., P. Hart, and N. Nilsson. 1981. Subjective methods for rule-based inference system. In Readings in artificial intelligence, eds. B. Weber and N. Nilsson, 192–199. Palo Alto, CA: Tioga Publishing Company.
Google Scholar
Freund, L. 2008. Exploiting task-document relations in support of information retrieval in the workplace. Doctoral dissertation, Faculty of Information Studies, University of Toronto, Toronto. http://faculty.arts.ubc.ca/lfreund/Publications/Freund_Luanne_S_200811_ PhD_thesis.pdf
Freund, L., C.L.A. Clarke, and E.G. Toms. 2006. Genre classification for IR in the workplace. In Proceedings of Information Interaction in Context (IIiX 2006) Copenhagen, Denmark.
Google Scholar
Görlach, M. 2004. Text types and the history of English. Berlin-New York: Mouton de Gruyter.
Book Google Scholar
Heyd, T. 2008. Email Hoaxes. Form, function, genre ecology. Amsterdam; Philadelphia, PA: J. Benjamins Publishing Company.
Google Scholar
Joho, H., and M. Sanderson. 2004. The SPIRIT collection: An overview of a large web collection. SIGIR Forum, 38(2), December 2004.
Google Scholar
Kanaris, I. and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence. Washington, DC.
Google Scholar
Kanaris, I., and E. Stamatatos. 2009. Learning to recognize webpage genres. Information Processing and Management 45(5):499–512.
Article Google Scholar
Karlgren, J., and D. Cutting. 1994. Recognizing text genre with simple metrics using discriminant analysis. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994). Kyoto.
Google Scholar
Lee, D. 2001. Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC Jungle. Language Learning & Technology 5(3):37–72.
Google Scholar
Levering, R., M. Cutler, and L. Yu. 2008. Using visual features for fine-grained genre classification of web pages. In Proceedings of the 41st Hawaii International Conference on System Sciences. Big Island, Hawaii.
Google Scholar
Mason, J., M. Shepherd, and J. Duffy. 2009. An n-gram based approach to automatically identifying web page genre. In Proceedings of the 42nd Annual Hawaii International Conference on System Sciences. Big Island, Hawaii.
Google Scholar
Meyer zu Eissen, S., and B. Stein. 2004. Genre classification of web pages: User study and feasibility analysis. In Advances in artificial intelligence, eds. S. Biundo, T. Frühwirth, and G. Palm, 256–269. Berlin: Springer.
Google Scholar
Rehm, G., M. Santini, M. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin. 2008. Towards a reference corpus of web genres for the evaluation of genre identification systems. In Proceedings of LREC 2008, May 28–30. Marrakech, Morocco.
Google Scholar
Rosso, M. 2008. User-based identification of Web genres. Journal of the American Society for Information Science and Technology 59(7):1053–1072.
Article MathSciNet Google Scholar
Santini, M. 2005. Building on syntactic annotation: Labelling subordinate clauses. In Proceedings of the Workshop on Exploring Syntactically Annotated Corpora (held in conjunction with Corpus Linguistics 2005 Conference). Birmingham.
Google Scholar
Santini, M. 2006. Common criteria for genre classification: Annotation and granularity. In Proceedings of the Workshop on Text-based Information Retrieval (TIR-06) (held in conjunction with ECAI 2006). Riva del Garda.
Google Scholar
Santini, M. 2007a. Automatic genre identification: Towards a flexible classification scheme. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007a) (held in conjunction with the European Summer School on IR (ESSIR 2007)), Tuesday, 28th and Wednesday, 29th of Aug. Glasgow.
Google Scholar
Santini, M. 2007b. Characterizing genres of web pages: Genre hybridism and individualization. In Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS-40). Hawaii.
Google Scholar
Santini, M. 2007c. Automatic identification of genre in web pages. PhD thesis, University of Brighton, Brighton.
Google Scholar
Santini, M. 2008. Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing and Management 44(2):702–737.
Article MathSciNet Google Scholar
Santini, M., and M. Rosso. 2008. Testing a genre-enabled application: A preliminary assessment. In Proceedings of Future Direction in Information Access (FDIA-2008). BCS, London.
Google Scholar
Santini, M., and S. Sharoff. 2009. Web genre benchmark under construction. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):129–145.
Google Scholar
Santini, M., R. Power, and R. Evans. 2006. Implementing a characterization of genre for automatic genre identification of web pages. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL/COLING 2006). Main Conference Poster Paper. Sydney.
Google Scholar
Shepherd, M., C. Watters, and A. Kennedy. 2004. Cybergenre: Automatic identification of home pages on the web. Journal of Web Engineering 3(3–4):236–251.
Google Scholar
Stein, B., and S. Meyer zu Eissen. 2008. Retrieval Models for Genre Classification. Scandinavian Journal of Information Systems (SJIS) 20(1):91–117.
Google Scholar
Stubbe, A., and C. Ringlstetter. 2007. Recognizing Genres. In Abstract Proceedings of the Colloqium “Towards a Reference Corpus of Web Genres” (held in conjunction with Corpus Linguistics 2007), 27 Jul 2007, eds. M. Santini and S. Sharoff. Birmingham.
Google Scholar
Stubbe, A., C. Ringlstetter, and K. Schulz. 2007. Genre to classify noise – noise to classify genre. In Proceedings of the IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data, 8 Jan 2007. Hyderabad, India. International Journal on Document Analysis and Recognition (IJDAR), Dec 2007.
Google Scholar
Thelwall, M. 2008a. Text in social network web sites: A word frequency analysis of Live Spaces. First Monday 13(2).
Google Scholar
Thelwall, M. 2008b. Quantitative comparisons of search engine results. Journal of the American Society for Information Science and Technology 59(11):1702–1710.
Article Google Scholar
Thelwall, M. 2008c. Extracting accurate and complete results from search engines: Case study Windows Live. Journal of the American Society for Information Science and Technology 59(1):38–50.
Article Google Scholar
Vidulin, V., M. Luštrek, and M. Gams. 2007. Using genres to improve search engines. In Proceedings of Towards Genre-enable Search Engines: The Impact of Natural Language Processing Workshop, Sept 2007. Borovets, Bulgaria.
Google Scholar
Vidulin, V., M. Luštrek, and M. Gams. 2009. Multi-label approaches to web genre identification. Journal for Language Technology and Computational Linguistics (JLCL) 24(1):97–114.
Google Scholar
Waltinger, U., and A. Mehler. 2009. The feature difference coefficient: Classification by means of feature distributions. In Proceedings of the Conference on Text Mining Services (TMS 2009), 159–168. Leipzig, Germany.
Google Scholar
Xu, J., Y. Cao, H. Li, N. Craswell, and Y. Huang. 2007. Searching documents based on relevance and type. In Proceeding of ECIR 2007. Rome, Italy.
Google Scholar
Yeung, P., S. Büttcher, C. Clarke, and M. Kolla. 2007a. A Bayesian approach for learning document type relevance. ECIR 2007. Rome.
Google Scholar
Yeung, P., C. Clarke, and S. Büttcher. 2007b. Improving retrieval accuracy by weighting document types with clickthrough data. SIGIR’07. Amsterdam, The Netherlands.
Google Scholar
Yeung, P., L. Freund, and C. Clarke. 2007c. X-Site: A workplace search tool for software engineers. System demo presented at the 30th International ACM SIGIR Conference. Amsterdam.
Google Scholar

Download references

Author information

Authors and Affiliations

KYH, Stockholm, Sweden
Marina Santini

Authors

Marina Santini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Santini .

Editor information

Editors and Affiliations

, Text Technology/Applied Comp. Ling., Bielefeld University, Universitätsstrasse 25, Bielefeld, 33615, Germany
Alexander Mehler
LS2 9JT Leeds, United Kingdom
Serge Sharoff
Varvsgatan 25, Stockholm, 117 29, Sweden
Marina Santini

Appendix

The appendix contains tables describing the genre corpora used in the experiments explained in Chapter 5.

1.1 SANTINIS (2,480 Web Pages). Cf. Also Santini ([29], Appendix B)

Table 5.15 SANTINIS composition

Full size table

1.2 KI-04 (1,205 Web Pages). Cf. Also Meyer zu Eissen and Stein [22]

Table 5.16 KI-04 composition

Full size table

1.3 HGC (Used 1,180 for Crosstesting). Cf. Also Stubbe et al. [37]

Table 5.17 HGC composition

Full size table

1.4 MGC (1,539 Web Pages). Cf. Also Vidulin et al. [41]

Table 5.18 MGC composition

Full size table

1.5 100 Facets

Table 5.19 100 facets

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Santini, M. (2010). Cross-Testing a Genre Classification Model for the Web. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_5

Download citation

DOI: https://doi.org/10.1007/978-90-481-9178-9_5
Published: 16 August 2010
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9177-2
Online ISBN: 978-90-481-9178-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Testing a Genre Classification Model for the Web

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 SANTINIS (2,480 Web Pages). Cf. Also Santini ([29], Appendix B)

1.2 KI-04 (1,205 Web Pages). Cf. Also Meyer zu Eissen and Stein [22]

1.3 HGC (Used 1,180 for Crosstesting). Cf. Also Stubbe et al. [37]

1.4 MGC (1,539 Web Pages). Cf. Also Vidulin et al. [41]

1.5 100 Facets

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation