Abstract
This chapter focuses on the structure-based classification of websites according to their hypertext type or genre. A website usually consists of several web pages. Its structure is given by their hyperlinks resulting in a directed graph. In order to represent the logical structure of a website, the underlying graph structure is represented as a so-called directed Generalized Tree (GT), in which a rooted spanning tree represents the logical core structure of the site. The remaining arcs are classified as reflexive, lateral, and vertical up- and downward arcs with respect to this kernel tree.
We consider unsupervised and supervised approaches for learning classifiers from a given web corpus. Quantitative Structure Analysis (QSA) is based on describing GTs using a series of attributes that characterize their structural complexity, and employs feature selection combined with unsupervised learning techniques. Kernel methods – the second class of approaches we consider – focus on typical substructures characterizing the classes. We present a series of tree, graph and GT kernels that are suitable for solving the problem and discuss the problem of scalability. All learning approaches are evaluated using a web corpus containing classified websites.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aiolli, F., Martino, G.D.S., Sperduti, A., Moschitti, A.: Efficient kernel-based learning for trees. In: CIDM, pp. 308–315. IEEE, New York (2007)
Biber, D.: Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, Cambridge (1995)
Blanchard, P., Volchenkov, D.: Mathematical Analysis of Urban Spatial Networks. Springer, Berlin (2009)
Bloehdorn, S., Moschitti, A.: Combined syntactic and semanitc kernels for text classification. In: Proceedings of the 29th European Conference on Information Retrieval, Rome, Italy (2007)
Bollobás, B., Riordan, O.M.: Mathematical results on scale-free random graphs. In: Bornholdt, S., Schuster, H.G. (eds.) Handbook of Graphs and Networks. From the Genome to the Internet, pp. 1–34. Wiley-VCH, Weinheim (2003)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group (1984)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002), http://www.cse.iitb.ac.in/~soumen/mining-the-web/
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm
Collins, M., Duffy, N.: Convolution kernels for natural language. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) NIPS, pp. 625–632. MIT Press, Cambridge (2001)
Collins, M., Duffy, N.: New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In: ACL, pp. 263–270 (2002)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge University Press, Cambridge (2000)
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: ACL 2004: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, p. 423 (2004), http://www.cs.umass.edu/~culotta/pubs/culotta04dependency.pdf , doi:http://dx.doi.org/10.3115/1218955.1219009
Cumby, C., Roth, D.: On kernel methods for relational learning. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 107–115. AAAI Press, Menlo Park (2003)
Dehmer, M.: Information processing in complex networks: Graph entropy and information functionals. Applied Mathematics and Computation 201, 82–94 (2008)
Dehmer, M., Emmert-Streib, F., Mehler, A., Kilian, J.: Measuring the structural similarity of web-based documents: A novel approach. International Journal of Computational Intelligence 3(1), 1–7 (2006)
Dehmer, M., Mehler, A., Emmert-Streib, F.: Graph-theoretical characterizations of generalized trees. In: Proceedings of the 2007 International Conference on Machine Learning: Models, Technologies & Applications (MLMTA 2007), Las Vegas, June 25-28, pp. 113–117 (2007)
Foscarini, F., Kim, Y., Lee, C.A., Mehler, A., Oliver, G., Ross, S.: On the notion of genre in digital preservation. In: Chanod, J.P., Dobreva, M., Rauber, A., Ross, S. (eds.) Proceedings of the Dagstuhl Seminar 10291 on Automation in Digital Preservation, July 18–23, Dagstuhl Seminar Proceedings. Leibniz Center for Informatics, Schloss Dagstuhl (2010)
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explorations 5(2), 49–58 (2003)
Gärtner, T.: A survey of kernels for structured data. SIGKDD Explor. Newsl. 5(1), 49–58 (2003), doi: http://doi.acm.org/10.1145/959242.959248
Gärtner, T., Flach, P.A., Wrobel, S.: On graph kernels: Hardness results and efficient alternatives. In: Proceedings of the 16th Annual Conference on Computational Learning Theory and the 7th Kernel Workshop (2003)
Gärtner, T., Lloyd, J.W., Flach, P.A.: Kernels and distances for structured data. Machine Learning 57(3), 205–232 (2004)
Geibel, P.: Induktion von merkmalsbasierten und logische Klassifikatoren für relationale Strukturen. Infix-Verlag (1999)
Geibel, P., Wysotzki, F.: Induction of Context Dependent Concepts. In: De Raedt, L. (ed.) Proceedings of the 5th International Workshop on Inductive Logic Programming, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, pp. 323–336 (1995)
Geibel, P., Wysotzki, F.: Learning relational concepts with decision trees. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, pp. 166–174. Morgan Kaufmann Publishers, San Francisco (1996)
Geibel, P., Wysotzki, F.: Relational learning with decision trees. In: Wahlster, W. (ed.) Proceedings of the 12th European Conference on Artificial Intelligence, pp. 428–432. J. Wiley and Sons, Ltd, Chichester (1996)
Geibel, P., Jain, B.J., Wysotzki, F.: Combining recurrent neural networks and support vector machines for structural pattern recognition. Neurocomputing 64, 63–105 (2005)
Geibel, P., Pustylnikov, O., Mehler, A., Gust, H., Kühnberger, K.-U.: Classification of documents based on the structure of their DOM trees. In: Ishikawa, M., Doya, K., Miyamoto, H., Yamakawa, T. (eds.) ICONIP 2007, Part II. LNCS, vol. 4985, pp. 779–788. Springer, Heidelberg (2008)
Gleim, R.: HyGraph: Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertexte. In: Fisseni, B., Schmitz, H.C., Schröder, B., Wagner, P. (eds) Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Frühjahrstagung 2005, 10. März – 01, Universität Bonn, Lang, Frankfurt a. M., pp. 42–53 (April 2005)
Haussler, D.: Convolution Kernels on Discrete Structure. Tech. Rep. UCSC-CRL-99-10, University of California at Santa Cruz, Santa Cruz, CA, USA (1999)
Hotho, A., Nürnberger, A., Paaß, G.: A Brief Survey of Text Mining. Journal for Language Technology and Computational Linguistics (JLCL) 20(1), 19–62 (2005)
Hunt, E.B., Marin, J., Stone, P.J.: Experiments in Induction. Academic Press, London (1966)
Joachims, T.: Learning to classify text using support vector machines. Kluwer, Boston (2002)
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Sammut, C., Hoffmann, A.G. (eds.) ICML, pp. 291–298. Morgan Kaufmann, San Francisco (2002)
Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernels between labeled graphs. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the Twentieth International Conference on Machine Learning, pp. 321–328. AAAI Press, Menlo Park (2003)
Kemp, C., Tenenbaum, J.B.: The discovery of structural form. Proceedings of the National Academy of Sciences 105(31), 10,687–10,692 (2008)
Kersting, K., Gärtner, T.: Fisher kernels for logical sequences. In: ECML, pp. 205–216 (2004)
Kondor, R.I., Shervashidze, N., Borgwardt, K.M.: The graphlet spectrum. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) ICML, ACM International Conference Proceeding Series, vol. 382, p. 67. ACM, New York (2009)
Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000)
Mehler, A.: Generalized shortest paths trees: A novel graph class applied to semiotic networks. In: Dehmer, M., Emmert-Streib, F. (eds.) Analysis of Complex Networks: From Biology to Linguistics, pp. 175–220. Wiley-VCH, Weinheim (2009)
Mehler, A.: Minimum spanning Markovian trees: Introducing context-sensitivity into the generation of spanning trees. In: Dehmer, M. (ed.) Structural Analysis of Complex Networks. Birkhäuser Publishing, Basel (2009)
Mehler, A.: A quantitative graph model of social ontologies by example of Wikipedia. In: Dehmer, M., Emmert-Streib, F., Mehler, A. (eds.) Towards an Information Theory of Complex Networks: Statistical Methods and Applications. Birkhäuser, Basel (2010)
Mehler, A.: Structure formation in the web. A graph-theoretical model of hypertext types. In: Witt, A., Metzing, D. (eds.) Linguistic Modeling of Information and Markup Languages. Contributions to Language Technology, Text, Speech and Language Technology, pp. 225–247. Springer, Dordrecht (2010)
Mehler, A., Lücking, A.: A structural model of semiotic alignment: The classification of multimodal ensembles as a novel machine learning task. In: Proceedings of IEEE Africon 2009, September 23-25. IEEE, Nairobi (2009)
Mehler, A., Waltinger, U.: Integrating content and structure learning: A model of hypertext zoning and sounding. In: Mehler, A., Kühnberger, K.U., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds.) Modeling, Learning and Processing of Text Technological Data Structures. SCI. Springer, Berlin (2010)
Mehler, A., Geibel, P., Pustylnikov, O.: Structural classifiers of text types: Towards a novel model of text representation. Journal for Language Technology and Computational Linguistics (JLCL) 22(2), 51–66 (2007)
Mehler, A., Waltinger, U., Wegner, A.: A formal text representation model based on lexical chaining. In: Proceedings of the KI 2007 Workshop on Learning from Non-Vectorial Data (LNVD 2007), September 10, Osnabrück, Universität Osnabrück, Osnabrück, pp. 17–26 (2007)
Mehler, A., Pustylnikov, O., Diewald, N.: Geography of social ontologies: Testing a variant of the Sapir-Whorf Hypothesis in the context of Wikipedia. Computer Speech and Language (2010), doi:10.1016/j.csl.2010.05.006
Mehler, A., Sharoff, S., Santini, M. (eds.): Genres on the Web: Computational Models and Empirical Studies. Springer, Dordrecht (2010)
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Alon, D.C.U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)
Moschitti, A.: A study on convolution kernels for shallow statistic parsing. In: ACL, pp. 335–342 (2004)
Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)
Muggleton, S., Lodhi, H., Amini, A., Sternberg, M.J.E.: Support vector inductive logic programming. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, pp. 163–175. Springer, Heidelberg (2005)
Power, R., Scott, D., Bouayad-Agha, N.: Document structure. Computational Linguistics 29(2), 211–260 (2003)
Pustylnikov, O., Mehler, A.: Structural differentiae of text types. A quantitative model. In: Proceedings of the 31st Annual Conference of the German Classification Society on Data Analysis, Machine Learning, and Applications (GfKl), pp. 655–662 (2007)
Quinlan, J.: Induction of Decision Trees. Machine Learning 1(1), 82–106 (1986)
Rehm, G.: Towards automatic web genre identification – a corpus-based approach in the domain of academia by example of the academic’s personal homepage. In: Proc. of the Hawaii Internat. Conf. on System Sciences (2002)
Rehm, G., Santini, M., Mehler, A., Braslavski, P., Gleim, R., Stubbe, A., Symonenko, S., Tavosanis, M., Vidulin, V.: Towards a reference corpus of web genres for the evaluation of genre identification systems. In: Proceedings of LREC 2008, Marrakech, Morocco (2008)
Santini, M.: Cross-testing a genre classification model for the web. In: [48] (2010)
Santini, M., Mehler, A., Sharoff, S.: Riding the rough waves of genre on the web: Concepts and research questions. In: [48], pp. 3–32 (2010)
Saunders, S.: Improved shortest path algorithms for nearly acyclic graphs. PhD thesis, University of Canterbury, Computer Science (2004)
Schoelkopf, B., Smola, A.J.: Learning with Kernels. The MIT Press, Cambridge (2002)
Sharoff, S.: In the garden and in the jungle. Comparing genres in the BNC and Internet. In: [48] (2010)
Smola, A., Kondor, R.: Kernels and regularization on graphs. In: Schölkopf, B., Warmuth, M. (eds.) Proceedings of the Annual Conference on Computational Learning Theory and Kernel Workshop. LNCS. Springer, Heidelberg (2003)
Stubbe, A., Ringlstetter, C., Goebel, R.: Elements of a learning interface for genre qualified search. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 791–797. Springer, Heidelberg (2007)
Unger, S., Wysotzki, F.: Lernfähige Klassifizierungssysteme (Classifier Systems that are able to Learn). Akademie-Verlag, Berlin (1981)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Weisfeiler, B.: On Construction and Identification of Graphs. No. 558 in Lecture Notes in Mathematics. Springer, Berlin (1976)
Wysotzki, F., Kolbe, W., Selbig, J.: Concept Learning by Structured Examples - An Algebraic Approach. In: Proceedings of the Seventh IJCAI (1981)
Zhang, D., Mao, R.: Extracting community structure features for hypertext classification. In: Pichappan, P., Abraham, A. (eds.) ICDIM, pp. 436–441. IEEE, Los Alamitos (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Geibel, P., Mehler, A., Kühnberger, KU. (2011). Learning Methods for Graph Models of Document Structure. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-22613-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22612-0
Online ISBN: 978-3-642-22613-7
eBook Packages: EngineeringEngineering (R0)