Abstract
Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bag-of-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation stays in the framework of IR specific models since it neglects the content-based structuring inherent to hypertext units. This paper approaches hypertext modelling from the perspective of graph-theory. It presents an XML-based format for representing websites as hypergraphs. These hypergraphs are used to shed light on the relation of hypertext structure types and their web-based instances. We place emphasis on two characteristics of this relation: In terms of realizational ambiguity we speak of functional equivalents to the manifestation of the same structure type. In terms of polymorphism we speak of a single web unit which manifests different structure types. It is shown that polymorphism is a prevalent characteristic of web-based units. This is done by means of a categorization experiment which analyses a corpus of hypergraphs representing the structure and content of pages of conference websites. On this background we plead for a revision of text representation models by means of hypergraphs which are sensitive to the manifold structuring of web documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adamic, L.A.: The small world of web. In: Abiteboul, S., Vercoustre, A.-M. (eds.) Research and Advanced Technology for Digital Libraries, pp. 443–452. Springer, Heidelberg (1999)
Agosti, M., Smeaton, A.F.: Information Retrieval and Hypertext. Kluwer, Boston (1996)
Allan, J.: Automatic hypertext link typing. In: Proceedings of the 7th ACM Conference on Hypertext, pp. 42–52. ACM, New York (1996)
Amitay, E., Carmel, D., Darlow, A., Lempel, R., Soffer, A.: The connectivity sonar: detecting site functionality by structural patterns. In: Proc. of the 14th ACM conference on Hypertext and Hypermedia, pp. 38–47 (2003)
Berge, C.: Hypergraphs: Combinatorics of Finite Sets. North Holland, Amsterdam (1989)
Botafogo, R.A., Rivlin, E., Shneiderman, B.: Structural analysis of hypertexts: Identifying hierarchies and useful metrics. ACM Transactions on Information Systems 10(2), 142–180 (1992)
Chakrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: Proc. of the 10th International World Wide Web Conference, Hong Kong, May 1-5, pp. 211–220 (2001)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Haas, L., Tiwary, A. (eds.) Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 307–318. ACM, New York (1998)
Chakrabarti, S., Joshi, M., Punera, K., Pennock, D.M.: The structure of broad topics on the web. In: Proc. of the 11th Internat. World Wide Web Conference, pp. 251–262. ACM Press, New York (2002)
Eiron, N., McCurley, K.S.: Untangling compound documents on the web. In: Proceedings of the 14th ACM conference on Hypertext and hypermedia, Nottingham, UK, pp. 85–94 (2003)
Fürnkranz, J.: Using links for classifying web-pages. Technical report, TR-OEFAI- 98-29 (1998)
Furner, J., Ellis, D., Willett, P.: The representation and comparison of hypertext structures using graphs. In: Agosti, M., Smeaton, A.F. (eds.) Information Retrieval and Hypertext, pp. 75–96. Kluwer, Boston (1996)
Halasz, F., Schwartz, M.: The Dexter hypertext reference model. Communications of the ACM 37(2), 30–39 (1994)
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to SVM classification. Technical report, Department of Computer Science and Information Technology, National Taiwan University (2003)
Joachims, T.: Learning to classify text using support vector machines. Kluwer, Boston (2002)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Kuhlen, R.: Hypertext: ein nichtlineares Medium zwischen Buch und Wissensbank. Springer, Heidelberg (1991)
Li, M., Chen, X., Xin, L., Ma, B., Vitányi, P.M.: The similarity metric. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 863–872. ACM Press, New York (2003)
Li, W.-S., Kolak, O., Vu, Q., Takano, H.: Defining logical domains in a web site. In: Proc. of the 11th ACM on Hypertext and Hypermedia, pp. 123–132 (2000)
Mizuuchi, Y., Tajima, K.: Finding context paths for web pages. In: Proceedings of the 10th ACM Conference on Hypertext and Hypermedia, pp. 13–22 (1999)
Mukherjea, S., Hara, Y.: Focus+context views of world-wide web nodes. In: Proceedings of the eighth ACM conference on Hypertext, pp. 187–196 (1997)
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: Extracting usable structures from the web. In: Proc. of the ACM SIGCHI Conference on Human Factors in Computing, pp. 118–125. ACM Press, New York (1996)
Power, R., Scott, D., Bouayad-Agha, N.: Document structure. Computational Linguistics 29(2), 211–260 (2003)
Rehm, G.: Towards automatic web genre identification – a corpus-based approach in the domain of academia by example of the academic’s personal homepage. In: Proc. of the Hawai’i Internat. Conf. on System Sciences, January 7-10 (2002)
Renear, A.: Out of praxis: Three (meta)theories of textuality. In: Sutherland, K. (ed.) Electronic Text. Investigations in Method and Theory, pp. 107–126. Clarendon Press, Oxford (1997)
Routledge, L., Bailey, B., van Ossenbruggen, J., Hardman, L., Geurts, J.: Generating presentation constraints from rhetorical structure. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pp. 19–28. ACM, New York (2000)
Spertus, E.: ParaSite: mining structural information on the web. In: Selected papers from the sixth international conference on World Wide Web, pp. 1205–1215. Elsevier, Amsterdam (1997)
Tajima, K., Tanaka, K.: New techniques for the discovery of logical documents in web. In: Internat. Symposium on Database Applications in Non-Traditional Environments, pp. 125–132. IEEE, Los Alamitos (1999)
Thüring, M., Hannemann, J., Haake, J.M.: Hypermedia and cognition: Designing for comprehension. Communications of the ACM 38(8), 57–66 (1995)
Winter, A., Kullbach, B., Riedinger, V.: An overview of the GXL graph exchange language. In: Diehl, S. (ed.) Software Visualization, pp. 324–336. Springer, Heidelberg (2002)
Yang, Y., Slattery, S., Ghani, R.: A study of approaches to hypertext categorization. Journal of Intelligent Information Systems 18(2-3), 219–241 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mehler, A., Dehmer, M., Gleim, R. (2006). Towards Logical Hypertext Structure. In: Böhme, T., Larios Rosillo, V.M., Unger, H., Unger, H. (eds) Innovative Internet Community Systems. IICS 2004. Lecture Notes in Computer Science, vol 3473. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11553762_14
Download citation
DOI: https://doi.org/10.1007/11553762_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28880-0
Online ISBN: 978-3-540-33995-3
eBook Packages: Computer ScienceComputer Science (R0)