Abstract
This chapter outlines the state of the art of empirical and computational webgenre research. First, it highlights why the concept of genre is profitable for a range of disciplines. At the same time, it lists a number of recent interpretations that can inform and influence present and future genre research. Last but not least, it breaks down a series of open issues that relate to the modelling of the concept of webgenre in empirical and computational studies.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
More precisely, “in the Poetics, Aristotle writes, ‘the medium being the same, and the objects [of imitation] the same, the poet may imitate by narration – in which case he can either take another personality as Homer does, or speak in his own person, unchanged – or he may present all his characters as living and moving before us’ …. The Poetics sketches out the basic framework of genre; yet this framework remains loose, since Aristotle establishes genre in terms of both convention and historical observation, and defines genre in terms of both convention and purpose”. Glossary available at The Chicago School of Media Theory, retrieved April 2008.
- 2.
For instance, see “PAN’09: 3rd Int. PAN Workshop – 1st Competition on Plagiarism Detection”.
- 3.
For instance, see “ECIR 2009 Workshop on Contextual Information Access, Seeking and Retrieval Evaluation”.
- 4.
For instance, see “CyberEmotions” http://www.cyberemotions.eu/
- 5.
For instance, see “WI/IAT’09 Workshop on Web Personalization, Reputation and Recommender Systems”.
- 6.
- 7.
Global edition: http://www.timesonline.co.uk/tol/global/, or UK edition http://www.timesonline. co.uk/tol/news/
- 8.
As noted by Bateman [9] functionality belongs to both paper and web documents.
- 9.
- 10.
This is another example where a difference in the domain of a text contributes to a difference in its genre.
- 11.
After collecting texts, developers of traditional corpora often introduce their own set of annotation layers, such as POS tagging, semantic or metatextual markup, but such layers are not taken from original texts in the form they have been published.
- 12.
See Lim et al. [56] for a study of the impact of different types of features including structural ones.
- 13.
There are indeed many other scholars in other parts of the world, such as the Mao school in ancient China, who have pondered about the concept of genre.
References
Amitay, E., D. Carmel, A. Darlow, R. Lempel, and A. Soffer. 2003. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, 38–47. University of Nottingham, UK.
Andersen, J. 2008. The concept of genre in information studies. Annual Review of Information Science & Technology 42:339, 2007.
Andersen, J. 2008. Bringing genre into focus: Lis and genre between people, texts, activity and situation. Bulletin of the American Society for Information Science and Technology 34(5):31–34.
Askehave, I., and A.E. Nielsen. 2005. Digital genres: A challenge to traditional genre theory. Information Technology & People 18(2):120–141.
Aston, G., and L. Burnard. 1998. The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
Barnard, D.T., L. Burnard, S.J. DeRose, D.G. Durand, and C.M. Sperberg-McQueen. 1995. Lessons for the World Wide Web from the text encoding initiative. In Proceedings of the 4th international World Wide Web conference “The Web Revolution”. Boston, MA.
Baroni, M., and A. Kilgarriff. 2006. Large linguistically-processed Web corpora for multiple languages. In Companion Volume to Proceedings of the European Association of Computational Linguistics, 87–90. Trento.
Baroni, M., F. Chantree, A. Kilgarriff, and S. Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008). Marrakech.
Bateman, J.A. 2008. Multimodality and genre: A foundation for the systematic analysis of multimodal documents. London: Palgrave Macmillan.
Bateman, J.A., T. Kamps, J. Kleinz, and K. Reichenberger. 2001. Towards constructive text, diagram, and layout generation for information presentation. Computational Linguistics 27(3):409–449.
Biber, D. 1988. Variation across speech and writing. Cambridge, MA: Cambridge University Press.
Biber, D. 1989. A typology of English texts. Linguistics 27(3):43–58.
Biber, D. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge, MA: Cambridge University Press.
Biber, D., U. Connor, and T.A. Upton. 2007. Discourse on the move: Using corpus analysis to describe discourse structure. Amsterdam: Benjamins.
Björneborn, L. 2004. Small-world link structures across an academic web space: A library and information science approach. PhD thesis, Royal School of Library and Information Science, Department of Information Studies, Denmark.
Björneborn, L. 2010. Genre connectivity and genre drift in a web of genres. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Braslavski, P. 2010. Marrying relevance and genre rankings: An exploratory study. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Bruce, I. 2008. Academic writing and genre: A systematic analysis. London: Continuum.
Bruce, I. 2010. Evolving genres in online domains: The hybrid genre of the participatory news article. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Chakrabarti, S. 2001. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference, May 1–5, 211–220. Hong Kong.
Chakrabarti, S., M. van den Berg, and B. Dom. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference. Toronto, ON.
Chakrabarti, S., M. Joshi, K. Punera, and D.M. Pennock. 2002. The structure of broad topics on the web. In Proceedings of the 11th International World Wide Web Conference, 251–262. New York, NY: ACM Press.
Cohn, D.A., and T. Hofmann. 2000. The missing link – a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS), eds. T.K. Leen, T.G. Dietterich, and V. Tresp, 430–436. Denver, CO: MIT Press,
Condamines, A. 2008. Taking genre into account when analysing conceptual relation patterns. Corpora 3(2):115–140.
Craven, M., D. DiPasquo, D. Freitag, A.K. McCallum, T.M. Mitchell, K. Nigam, and S. Slattery. 2000. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence 118(1–2):69–113.
Dehmer, M., and F. Emmert-Streib. 2010. Mining graph patterns in web-based systems: A conceptual view. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Denoyer, L., and P. Gallinari. 2004. Un modèle de mixture de modèles gùnùratifs pour les documents structurùs multimùdias. Document numùrique 8(3):35–54.
Diligenti, M., M. Gori, M. Maggini, and F. Scarselli. 2001. Classification of HTML documents by hidden tree-markov models. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 849–853. Seattle, WA.
Dillon, A. 2008. Bringing genre into focus: Why information has shape. Bulletin of the American Society for Information Science and Technology 34(5):17–19.
Donato, D., L. Laura, S. Leonardi, and S. Millozzi. 2007. The web as a graph: How far we are. ACM Transactions on Internet Technology 7(1):4.
Eiron, N., and K.S. McCurley. 2003. Untangling compound documents on the web. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, 85–94. Nottingham.
Ester, M., H.-P. Kriegel, and M. Schubert. 2002. Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In KDD ’02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 249–258. New York, NY: ACM Press.
Ferraresi, A., E. Zanchetta, S. Bernardini, and M. Baroni. 2008. Introducing and evaluating ukWaC, a very large web-derived corpus of English. In The 4th Web as Corpus Workshop: Can We Beat Google? (At LREC 2008). Marrakech.
Fletcher, W.H. 2004. Making the web more useful as a source for linguistic corpora. In Corpus linguistics in North America 2002: Selections from the 4th North American Symposium of the American Association for applied corpus linguistics, eds. U. Connor, and T. Upton. Editions Rodopi: Amsterdam/New York.
Frasconi, P., G. Soda, and A. Vullo. 2002. Hidden Markov models for text categorization in multi-page documents. Journal of Intelligent Information Systems 18(2–3):195–217.
Freund, L. 2008. Exploiting task-document relationships to support information retrieval in the workplace. PhD thesis, University of Toronto.
Freund, L., and C. Nilsen. 2008. Assessing a genre-based approach to online government information. In Proceedings of the 36th Annual Conference of the Canadian Association for Information Science (CAIS). University of British Columbia, Vancouver.
Grieve, J., D. Biber, E. Friginal, and T. Nekrasova. 2010. Variation among blogs: A multi-dimensional analysis. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Gunnarsson, M. 2010. Classification along genre dimensions. PhD, Inst. f. Biblioteks- och Informationsvetenskap, Göteborgs Universitet.
Gupta, S., H. Becker, G. Kaiser, and S. Stolfo. 2006. Verifying genre-based clustering approach to content extraction. In Proceedings of the 15th International Conference on World Wide Web, 875–876. New York, NY: ACM Press.
He, B., M. Patel, Z. Zhang, and K. Chen-Chuan Chang. 2007. Accessing the deep web: A survey. Communications of the ACM 50(2):94–101.
Herring, S.C., I. Kouper, J.C. Paolillo, L.A. Scheidt, M. Tyworth, P. Welsch, E. Wright, and N. Yu. 2005. Conversations in the blogosphere: An analysis “from the bottom up”. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS’05). Big Island, Hawaii.
Heyd, T. 2008. Email hoaxes: Form, function, genre ecology. Amsterdam: Benjamins.
Ide, N., R. Reppen, and K. Suderman. 2002. The American National Corpus: More than the Web can provide. In Proceedings of the 3rd Language Resources and Evaluation Conference, 839–844. Las Palmas.
Joachims, T., N. Cristianini, and J. Shawe-Taylor. 2001. Composite kernels for hypertext categorisation. In Proceedings of the 11th International Conference on Machine Learning, 250–257. San Fransisco, CA: Morgan Kaufmann.
Kanaris, I., and E. Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’07), Washington, DC: IEEE Computer Society.
Karlgren, J., and D. Cutting. 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics, vol. 2, 1071–1075. Kyoto.
Kessler, B., G. Nunberg, and H. Schütze. 1997. Automatic detection of text genre. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics. 32–38. Madrid, Spain.
Kim, Y., and S. Ross. 2010. Formulating representative features with respect to genre classification. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Kriegel, H.-P., and M. Schubert. 2004. Classification of websites as sets of feature vectors. In Databases and applications, ed. M.H. Hamza, 127–132. Anaheim, CA: IASTED/ACTA Press.
Kucera, H., and W.N. Francis. 1967. Computational analysis of presentday American English. Providence, RI: Brown University Press.
Kumar, R., J. Novak, P. Raghavan, and A. Tomkins. 2004. Structure and evolution of blogspace. Communications of the ACM 47(12):35–39.
Lee, D. 2001. Genres, registers, text types, domains, and styles: clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology 5(3): 37–72.
Li, W.-S., O. Kolak, Q. Vu, and H. Takano. 2000. Defining logical domains in a web site. In Proceedings of the 11th ACM on Hypertext and Hypermedia, 123–132. San Antonio, TX.
Li, W.-S., K.S. Candan, Q. Vu, and D. Agrawal. 2002. Query relaxation by structure and semantics for retrieval of logical web documents. IEEE Transactions on Knowledge and Data Engineering 14(4):768–791.
Lim, C.S., K.J. Lee, and G.C. Kim. 2005. Multiple sets of features for automatic genre classification of web documents. Information Processing & Management 41(5):1263–1276.
Lindemann, C., and L. Littig. 2010. Classification of web sites at super-genre level. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Marshman, E., M.-C. L’Homme, and V. Surtees. 2008. Portability of cause-effect relation markers across specialised domains and text genres: a comparative evaluation. Corpora 3(2):141–172.
Martin, J.R. 1994. Macro-genres: The ecology of the page. Network 21: 29–52.
Martin, J.R., and D. Rose. 2008. Genre relations: Mapping culture. London & Oakland: Equinox Pub.
Mehler, A. 2008. Structural similarities of complex networks: A computational model by example of wiki graphs. Applied Artificial Intelligence 22(7&8):619–683.
Mehler, A. 2010. Structure formation in the web. A graph-theoretical model of hypertext types. In Linguistic modeling of information and markup languages. Contributions to language technology, eds. A. Witt and D. Metzing, Text, Speech and Language Technology, 225–247. Dordrecht: Springer.
Mehler, A. 2009b. Generalised shortest paths trees: A novel graph class applied to semiotic networks. In Analysis of complex networks: From biology to linguistics, eds. M. Dehmer and F. Emmert-Streib. Weinheim: Wiley-VCH.
Mehler, A. 2010. A quantitative graph model of social ontologies by example of Wikipedia. In Towards an information theory of complex networks: Statistical methods and applications, eds. M. Dehmer, F. Emmert-Streib, and A. Mehler. Boston, MA/Basel: Birkhäuser.
Mehler, A., M. Dehmer, and R. Gleim. 2006. Towards logical hypertext structure: A graph-theoretic perspective. In Proceedings of the 4th International Workshop on Innovative Internet Computing Systems (I2CS ’04), eds. T. Böhme and G. Heyer, Lecture Notes in Computer Science, vol. 3473, 136–150. Berlin/New York, NY: Springer.
Mehler, A., R. Gleim, and A. Wegner. 2007. Structural uncertainty of hypertext types. An empirical study. In Proceedings of the Workshop “Towards Genre-Enabled Search Engines: The Impact of NLP”, September, 30, 2007, in Conjunction with RANLP 2007, 13–19. Borovets, Bulgaria.
Menczer, F. 2004. Lexical and semantic clustering by web links. Journal of the American Society for Information Science and Technology 55(14):1261–1269.
Montesi, M., and T. Navarrete. 2008. Classifying web genres in context: A case study documenting the web genres used by a software engineer. Information Processing and Management 44:1410–1430.
Ounis, I., M. de Rijke, C. Macdonald, G. Mishne, and I. Soboroff. 2006. Overview of the trec 2006 blog track. In Proceedings of the Text Retrieval Conference (TREC). NIST.
Päivärinta, T., M. Shepherd, L. Svensson, and M. Rossi. 2008. A special issue editorial. Scandinavian Journal of Information Systems 20(1).
Pirolli, P., J. Pitkow, and R. Rao. 1996. Silk from a sow’s ear: Extracting usable structures from the web. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing, 118–125. New York, NY: ACM Press.
Power, R., D. Scott, and N. Bouayad-Agha. 2003. Document structure. Computational Linguistics 29(2):211–260.
Raiko, T., K. Kersting, J. Karhunen, and L. de Raedt. 2002. Bayesian learning of logical hidden Markov models. In Proceedings of the Finnish AI Conference (STeP-2002), 64–71. Finland.
Rehm, G. 2002. Towards automatic web genre identification – A corpus-based approach in the domain of academia by example of the academic’s personal homepage. In Proceedings of the Hawaii International Conference on System Sciences. Big Island, Hawaii.
Rehm, G. 2010. Hypertext types and markup languages. The relationship between HTML and web genres. In Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology, eds. A. Witt and D. Metzing, Text, Speech and Language Technology, 143–164. Dordrecht: Springer.
Rosso, M.A. 2008. Bringing genre into focus: Stalking the wild web genre (with apologies to euell gibbons). Bulletin of the American Society for Information Science and Technology 34(5):20–22.
Rosso, M.A., and S.W. Haas. 2010. Identification of web genres by user warrant. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Santini, M. 2007a. Characterizing genres of web pages: Genre hybridism and individualization. In Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS’07). Big Island, Hawaii.
Santini, M. 2007b. Automatic identification of genre in Web pages. PhD thesis, University of Brighton, Brighton.
Santini, M. 2010. Cross-testing a genre classification model for the web. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Sharoff, S. 2006. Creating general-purpose corpora using automated search engine queries. In WaCky! Working Papers on the Web as Corpus, eds. M. Baroni and S. Bernardini, 63–68. Bologna: Gedit.
Sharoff, S. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop. Louvain-la-Neuve.
Sharoff, S. 2010. In the garden and in the jungle. Comparing genres in the bnc and internet. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Sinclair, J. ed. 1987. Looking up: An account of the COBUILD project in lexical computing. London and Glasgow: Collins.
Sinclair, J. 2003. Corpora for lexicography. In ed. P. van Sterkenberg, A practical guide to lexicography, 167–178. Amsterdam: Benjamins.
Stein, B., S. Meyer zu Eissen, and N. Lipka. 2010. Web genre analysis: Use cases, retrieval models, and implementation issues. In Genres on the web: Computational models and empirical studies, eds. A. Mehler, S. Sharoff, and M. Santini, Text, Speech and Language Technology. Dordrecht: Springer.
Stewart, J.G. 2008. Genre oriented summarization. PhD thesis, Carnegie Mellon University.
Sun, A., and E.-P. Lim. 2003. Web unit mining: Finding and classifying subgraphs of web pages. In CIKM ’03: Proceedings of the 12th International Conference on Information and Knowledge Management, 108–115, New York, NY: ACM Press.
Swales, J.M. 1990. Genre analysis: English in academic and research settings. Cambridge, MA: Cambridge University Press.
Tajima, K., Y. Mizuuchi, M. Kitagawa, and K. Tanaka. 1998. Cut as a querying unit for WWW, netnews, e-mail. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 235–244. New York, NY: ACM Press.
Tajima, K., and K. Tanaka. 1999. New techniques for the discovery of logical documents in web. In International Symposium on Database Applications in Non-traditional Environments. IEEE, 125–132.
Thelwall,M., L. Vaughan, and L. Björneborn. 2006. Webometrics. Annual Review of Information Science Technology 6(8):81–135.
Tian, Y.H., T.J. Huang, W. Gao, J. Cheng, and P. Bo Kang. 2003. Two-phase web site classification based on hidden Markov tree models. In WI ’03: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence. IEEE Computer Society, 227, Washington, DC.
Waltinger, U., A. Mehler, and A. Wegner. 2009. A two-level approach to web genre classification. In Proceedings of the 5th International Conference on Web Information Systems and Technologies (WEBIST ’09), March 23–26, 2007. Lisboa.
Wisniewski, G., F. Maes, L. Denoyer, and P. Gallinari. 2007. Modèle probabiliste pour l’extraction de structures dans les documents web. Document numùrique, 10(1):151–170.
Wodak, R. 2008. Introduction: Discourse studies – important concepts and terms. In Qualitative Discourse Analysis in the Social Sciences, eds. Wodak, R. and Krzyzanowski, M., 1–29. Palgrave.
Yates, S.J., and T.R. Sumner. 1997. Digital genres and the new burden of fixity. In Proceedings of the 30th Hawaii International Conference on System Sciences, vol. 6. Maui, HI.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Santini, M., Mehler, A., Sharoff, S. (2010). Riding the Rough Waves of Genre on the Web. In: Mehler, A., Sharoff, S., Santini, M. (eds) Genres on the Web. Text, Speech and Language Technology, vol 42. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9178-9_1
Download citation
DOI: https://doi.org/10.1007/978-90-481-9178-9_1
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9177-2
Online ISBN: 978-90-481-9178-9
eBook Packages: Computer ScienceComputer Science (R0)