Skip to main content

Genre Classification of Web Pages

User Study and Feasibility Analysis

  • Conference paper
KI 2004: Advances in Artificial Intelligence (KI 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3238))

Included in the following conference series:

Abstract

Genre classification means to discriminate between documents bymeans of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents.

While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. We see genre classification as a powerful instrument to bring Web-based search services closer to a user’s information need. This objective raises two questions:

  1. 1

    What are useful genres when searching the WWW?

  2. 2

    Can these genres be reliably identified?

The paper in hand presents results from a user study on Web genre usefulness as well as results from the construction of a genre classifier using discriminant analysis, neural network learning, and support vector machines. Particular attention is turned to a classifier’s underlying feature set: Aside from the standard feature types we introduce new features that are based on word frequency classes and that can be computed with minimum computational effort. They allow us to construct compact feature sets with few elements, with which a satisfactory genre diversification is achieved. About 70% of the Web-documents are assigned to their true genre; note in this connection that no genre classification benchmark for Web pages has been published so far.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biber, D.: The multidimensional approach to linguistic analyses of genre variation: An overview of methodology and findings. Computers and the Humanities 26, 331–345 (1992)

    Article  Google Scholar 

  2. Bretan, I., Dewe, J., Hallberg, A., Wolkert, N.: Web-specific genre visualization (1999)

    Google Scholar 

  3. Crowston, K., Williams, M.: The effects of linking on genres of web documents. In: HICSS1999.

    Google Scholar 

  4. Dennis, S.: The sydney morning herald word database (1995), http://www2.psy.uq.edu.au/CogPsych/Noetica/OpenForumIssue4/SMH.html

  5. Dewdney, N., VanEss-Dykema, C., MacMillan, R.: The form is the substance: Classification of genres in text. In: Proceedings of ACL Workshop on HumanLanguage Technology and Knowledge Management (2001)

    Google Scholar 

  6. Dimitrova, M., Finn, A., Kushmerick, N., Smyth, B.: Web genre visualization. In: Proceedings of the Conference on Human Factors in Computing Systems (2002)

    Google Scholar 

  7. Finn, A., Kushmerick, N.: Learning to classify documents according to genre. In: IJCAI 2003 WS on Computational Approaches to Style Analysis and Synthesis (2003)

    Google Scholar 

  8. Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th. International Conference on Computational Linguistics COLING 1994, Kyoto, Japan, vol. II, pp. 1071–1075 (1994)

    Google Scholar 

  9. Kessler, B., Nunberg, G., Schütze, H.: Automatic detection of text genre. In: Cohen, P.R., Wahlster, W. (eds.) Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Somerset, New Jersey, pp. 32–38 (1997)

    Google Scholar 

  10. Lee, Y.-B., Myaeng, S.: Text genre classification with genre-revealing and subjectrevealing features. In: Proc. 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 145–150. ACM Press, New York (2002) ISBN 1- 58113-561-0

    Chapter  Google Scholar 

  11. Levenshtein, V.: Binary codes capable of correcting deletions insertions and reversals. ISov Phys Dokl 6, 707–710 (1966)

    Google Scholar 

  12. U. of Leipzig. Wortschatz (1995), http://wortschatz.uni-leipzig.de

  13. Rehm, G.: Towards AutomaticWeb Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS 2002), January 2002, IEEE Computer Society, Los Alamitos (2002)

    Google Scholar 

  14. Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., Liu, X.: Genre based navigation on the web. In: Proceedings of the 34th Hawaii International Conference on System Sciences (2001)

    Google Scholar 

  15. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: Proceedings of the 18th Int. Conference on Computational Linguistics, Saarbrücken, Germany (2000)

    Google Scholar 

  16. University of Stuttgart. The decision tree tagger (1996), http://www.ims.uni-stuttgart.de

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Meyer zu Eissen, S., Stein, B. (2004). Genre Classification of Web Pages. In: Biundo, S., Frühwirth, T., Palm, G. (eds) KI 2004: Advances in Artificial Intelligence. KI 2004. Lecture Notes in Computer Science(), vol 3238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30221-6_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30221-6_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23166-0

  • Online ISBN: 978-3-540-30221-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics