Abstract
This work presents a methodology for grouping structurally similar XML documents using clustering algorithms. Modeling XML documents with tree-like structures, we face the ‘clustering XML documents by structure’ problem as a ‘tree clustering’ problem, exploiting distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.
Work supported in part by DELOS Network of Excellence on Digital Libraries, IST programme of the EC FP6, no G038-507618, and by PYTHAGORAS EPEAEK II programme, EU and Greek Ministry of Education.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)
Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proc. of the VLDB Conference, Edinburgh, Scotland, UK (1999)
Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM SIGMOD Conference, USA (1996)
Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proc. of the ICDE Conference, San Jose, USA (2002)
Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management. Addison Wesley, Reading (2003)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proc. of WebDB 2002 (2002)
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proc. of the ACM SIGMOD Conference, Texas, USA (2000)
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)
Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Stanford (1999)
Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)
Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26 (1979)
van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)
Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)
Wang, Y., DeWitt, D., Cai, J.-Y.: X-Diff: An effective change detection algorithm for XML documents. In: Proc. of the ICDE Conference, Bangalore, India (2003)
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dalamagas, T., Cheng, T., Winkel, KJ., Sellis, T. (2004). Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_54
Download citation
DOI: https://doi.org/10.1007/978-3-540-30192-9_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)