Abstract
Genre classification means to discriminate between documents bymeans of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents.
While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. We see genre classification as a powerful instrument to bring Web-based search services closer to a user’s information need. This objective raises two questions:
-
1
What are useful genres when searching the WWW?
-
2
Can these genres be reliably identified?
The paper in hand presents results from a user study on Web genre usefulness as well as results from the construction of a genre classifier using discriminant analysis, neural network learning, and support vector machines. Particular attention is turned to a classifier’s underlying feature set: Aside from the standard feature types we introduce new features that are based on word frequency classes and that can be computed with minimum computational effort. They allow us to construct compact feature sets with few elements, with which a satisfactory genre diversification is achieved. About 70% of the Web-documents are assigned to their true genre; note in this connection that no genre classification benchmark for Web pages has been published so far.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Biber, D.: The multidimensional approach to linguistic analyses of genre variation: An overview of methodology and findings. Computers and the Humanities 26, 331–345 (1992)
Bretan, I., Dewe, J., Hallberg, A., Wolkert, N.: Web-specific genre visualization (1999)
Crowston, K., Williams, M.: The effects of linking on genres of web documents. In: HICSS1999.
Dennis, S.: The sydney morning herald word database (1995), http://www2.psy.uq.edu.au/CogPsych/Noetica/OpenForumIssue4/SMH.html
Dewdney, N., VanEss-Dykema, C., MacMillan, R.: The form is the substance: Classification of genres in text. In: Proceedings of ACL Workshop on HumanLanguage Technology and Knowledge Management (2001)
Dimitrova, M., Finn, A., Kushmerick, N., Smyth, B.: Web genre visualization. In: Proceedings of the Conference on Human Factors in Computing Systems (2002)
Finn, A., Kushmerick, N.: Learning to classify documents according to genre. In: IJCAI 2003 WS on Computational Approaches to Style Analysis and Synthesis (2003)
Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proceedings of the 15th. International Conference on Computational Linguistics COLING 1994, Kyoto, Japan, vol. II, pp. 1071–1075 (1994)
Kessler, B., Nunberg, G., Schütze, H.: Automatic detection of text genre. In: Cohen, P.R., Wahlster, W. (eds.) Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Somerset, New Jersey, pp. 32–38 (1997)
Lee, Y.-B., Myaeng, S.: Text genre classification with genre-revealing and subjectrevealing features. In: Proc. 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 145–150. ACM Press, New York (2002) ISBN 1- 58113-561-0
Levenshtein, V.: Binary codes capable of correcting deletions insertions and reversals. ISov Phys Dokl 6, 707–710 (1966)
U. of Leipzig. Wortschatz (1995), http://wortschatz.uni-leipzig.de
Rehm, G.: Towards AutomaticWeb Genre Identification. In: Proceedings of the 35th Hawaii International Conference on System Sciences (HICSS 2002), January 2002, IEEE Computer Society, Los Alamitos (2002)
Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., Liu, X.: Genre based navigation on the web. In: Proceedings of the 34th Hawaii International Conference on System Sciences (2001)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: Proceedings of the 18th Int. Conference on Computational Linguistics, Saarbrücken, Germany (2000)
University of Stuttgart. The decision tree tagger (1996), http://www.ims.uni-stuttgart.de
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Meyer zu Eissen, S., Stein, B. (2004). Genre Classification of Web Pages. In: Biundo, S., Frühwirth, T., Palm, G. (eds) KI 2004: Advances in Artificial Intelligence. KI 2004. Lecture Notes in Computer Science(), vol 3238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30221-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-30221-6_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23166-0
Online ISBN: 978-3-540-30221-6
eBook Packages: Springer Book Archive