Automatic Genre Detection of Web Documents

Lim, Chul Su; Lee, Kong Joo; Kim, Gil Chang

doi:10.1007/978-3-540-30211-7_33

Automatic Genre Detection of Web Documents

Chul Su Lim²²,
Kong Joo Lee²³ &
Gil Chang Kim²²

Conference paper

1655 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3248))

Abstract

A genre or a style is another view of documents different from a subject or a topic. The genre is also a criterion to classify the documents. There have been several studies on detecting a genre of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect genres of web documents. Web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proc. of Computational Linguistics, pp. 1071–1075 (1994)
Google Scholar
Karlgren, J., Bretan, I., Dewe, J., Hallberg, A., Wolkert, N.: Iterative information retrieval using fast clustering and usage-specific genres. In: Proc. of the DELOS Workshop on User Interfaces in Digital Libraries, pp. 85–92 (1998)
Google Scholar
Michos, S., Stamatatos, E., Kokkinakis, G.: An empirical text categorizing computational model based on stylistic aspects. In: Proc. of the Eighth Int. Conf. on Tools with Artificial Intelligence, pp. 71–77 (1996)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Text genre detection using common word frequencies. In: COLING, pp. 808–814 (2000)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471–495 (2000)
Article Google Scholar
Lee, Y.B., Myaeng, S.H.: Text genre classification with genre-revealing and subjectrevealing features. In: ACM SIGIR., pp. 145–150 (2002)
Google Scholar
Dewe, J., Bretan, I., Karlgren, J.: Assembling a balanced corpus from the internet. In: Nordic Computational Linguistics Conference, pp. 100–107 (1998)
Google Scholar
Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: ACM SIGIR., pp. 27–34 (2002)
Google Scholar
Daelemans, W., Zavrel, J., van der Sloot, K.: Timbl: Tilburg memory based learner version 4.3 reference guide. Technical Report ILK-0210, Tilburg University (2002)
Google Scholar
Pierre, J.: Practical issues for automated categorization of web pages. In: ECDL 2000 Workshop on the Semantic Web (2000)
Google Scholar
Caruana, R., Freitag, D.: Greedy attribute selection. In: Int. Conf. on Machine Learning, pp. 28–36 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Computer Science, Department of EECS, KAIST, Taejon
Chul Su Lim & Gil Chang Kim
School of Computer & Information Technology, KyungIn Women’s College, Incheon
Kong Joo Lee

Authors

Chul Su Lim
View author publications
You can also search for this author in PubMed Google Scholar
Kong Joo Lee
View author publications
You can also search for this author in PubMed Google Scholar
Gil Chang Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Behavior Design Corporation, IV Science-Based Industrial Park Hsinchu, 2F, No.5, Industry E. Rd, Taiwan
Keh-Yih Su
University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, JST CREST, Honcho 4-1-8, Kawaguchi-shi,, 332-0012, Saitama,
Jun’ichi Tsujii
Pohang University of Science and Technology (POSTECH), AITrc, Republic of Korea
Jong-Hyeok Lee
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lim, C.S., Lee, K.J., Kim, G.C. (2005). Automatic Genre Detection of Web Documents. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_33

Download citation

DOI: https://doi.org/10.1007/978-3-540-30211-7_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics