Abstract
Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
AMITAY, E. and CARMEL, D. and DARLOW, A. and LEMPEL, R. and SOFFER, A. (2003): The connectivity sonar. Proc. of the 14th ACM Conference on Hypertext, 28–47.
BOCK, H.H. (1974): Automatische Klassifikation. Vandenhoeck & Ruprecht, Göttingen.
BUNKE, H. and GÜNTER, S. and JIANG, X. (2001): Towards bridging the gap between statistical and structural pattern recognition. Proc. of the 2nd Int. Conf. on Advances in Pattern Recognition, Berlin, Springer, 1–11.
CHAKRABARTI, S. and DOM, B. and INDYK, P. (1998): Enhanced hypertext categorization using hyperlinks. Proc. of ACM SIGMOD, International Conf. on Management of Data, ACM Press, 307–318.
DEHMER, M. and MEHLER, A. (2004): A new method of similarity measuring for a specific class of directed graphs. Submitted to Tatra Mountain Journal, Slovakia.
FÜRNKRANZ, J. (2002): Hyperlink ensembles: a case study in hypertext classification. Information Fusion, 3(4), 299–312.
GIBSON, D. and KLEINBERG, J. and RAGHAVAN, P. (1998): Inferring web communities from link topology. Proc. of the 9th ACM Conf. on Hypertext, 225–234.
GLEIM, R. (2005): Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte, Proc. of GLDV’ 05, 42–53.
HSU, C.-W. and CHANG, C.-C. and LIN, C.-J. (2003): A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University.
JOACHIMS, T. (2002): Learning to classify text using support vector machines. Kluwer, Boston, 2002.
JOACHIMS, T. and CRISTIANINI, N. and SHAWE-TAYLOR, J. (2001): Composite kernels for hypertext categorisation. Proc. of the 11th ICML, 250–257.
KLEINBERG, J. (1999): Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632
KOSALA, R. and BLOCKEEL, H. (2000): Web mining research: A survey. SIGKDD Explorations, 2(1), 1–15.
MEHLER, A. and DEHMER, M. and GLEIM, R. (2004): Towards logical hypertext structure — a graph-theoretic perspective. Proc. of I2CS’ 04, Berlin, Springer.
MIZUUCHI, Y. and TAJIMA, K. (1999): Finding context paths for web pages. Proc. of the 10th ACM Conference on Hypertext and Hypermedia, 13–22.
REHM, G. (2002): Towards automatic web genre identification. Proc. of the Hawai’i Int. Conf. on System Sciences.
RIEGER, B. (1989): Unscharfe Semantik. Peter Lang, Frankfurt a.M.
YANG, Y. (1999): An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1,1/2, 67–88.
YANG, Y. and SLATTERY, S. and GHANI, R. (2002): A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3), 219–241.
YOSHIOKA, T. and HERMAN, G. (2000): Coordinating information using genres. Technical report, Massachusetts Institute of Technology.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer Berlin · Heidelberg
About this paper
Cite this paper
Mehler, A., Gleim, R., Dehmer, M. (2006). Towards Structure-sensitive Hypertext Categorization. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_49
Download citation
DOI: https://doi.org/10.1007/3-540-31314-1_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31313-7
Online ISBN: 978-3-540-31314-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)