Clustering XML Documents Using Structural Summaries

Dalamagas, Theodore; Cheng, Tao; Winkel, Klaas-Jan; Sellis, Timos

doi:10.1007/978-3-540-30192-9_54

Theodore Dalamagas²¹,
Tao Cheng²²,
Klaas-Jan Winkel²³ &
…
Timos Sellis²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3268))

Included in the following conference series:

International Conference on Extending Database Technology

1314 Accesses

Abstract

This work presents a methodology for grouping structurally similar XML documents using clustering algorithms. Modeling XML documents with tree-like structures, we face the ‘clustering XML documents by structure’ problem as a ‘tree clustering’ problem, exploiting distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.

Work supported in part by DELOS Network of Excellence on Digital Libraries, IST programme of the EC FP6, no G038-507618, and by PYTHAGORAS EPEAEK II programme, EU and Greek Ministry of Education.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Structure-Oriented Techniques for XML Document Partitioning

Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Clustering XML Documents Using Frequent Edge-Sets

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proc. of the VLDB Conference, Edinburgh, Scotland, UK (1999)
Google Scholar
Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM SIGMOD Conference, USA (1996)
Google Scholar
Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proc. of the ICDE Conference, San Jose, USA (2002)
Google Scholar
Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management. Addison Wesley, Reading (2003)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proc. of WebDB 2002 (2002)
Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proc. of the ACM SIGMOD Conference, Texas, USA (2000)
Google Scholar
Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)
Google Scholar
Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Stanford (1999)
Google Scholar
Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)
Article MATH MathSciNet Google Scholar
Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26 (1979)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)
Google Scholar
Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)
Article MATH Google Scholar
Wang, Y., DeWitt, D., Cai, J.-Y.: X-Diff: An effective change detection algorithm for XML documents. In: Proc. of the ICDE Conference, Bangalore, India (2003)
Google Scholar
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Electr. and Comp. Engineering, National Technical University of Athens, Zographou, 15773, Athens, Greece
Theodore Dalamagas & Timos Sellis
Department of Computer Science, University of California, Santa Barbara, CA, 93106, USA
Tao Cheng
Faculty of Computer Science, University of Twente, 7500 AE, Enschede, The Netherlands
Klaas-Jan Winkel

Authors

Theodore Dalamagas
View author publications
You can also search for this author in PubMed Google Scholar
Tao Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Klaas-Jan Winkel
View author publications
You can also search for this author in PubMed Google Scholar
Timos Sellis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Sidonia Systems, Grubmühl 20, D-82131, Stockdorf, Germany
Wolfgang Lindner
Università di Milano, Italy
Marco Mesiti
Functional Genomics Center Zurich (FGCZ), UZH / ETH Zurich, Winterthurerstrasse 190, CH–8057, Zurich, Switzerland
Can Türker
Computer Science Department, University of Crete, GREECE, and, Institute of Computer Science, FORTH-ICS, Greece
Yannis Tzitzikas
Aristotle University of Thessaloniki,
Athena I. Vakali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dalamagas, T., Cheng, T., Winkel, KJ., Sellis, T. (2004). Clustering XML Documents Using Structural Summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_54

Download citation

DOI: https://doi.org/10.1007/978-3-540-30192-9_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Clustering XML Documents Using Structural Summaries

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Structure-Oriented Techniques for XML Document Partitioning

Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Clustering XML Documents Using Frequent Edge-Sets

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Clustering XML Documents Using Structural Summaries

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Structure-Oriented Techniques for XML Document Partitioning

Mining Cluster Patterns in XML Corpora via Latent Topic Models of Content and Structure

Clustering XML Documents Using Frequent Edge-Sets

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation