Skip to main content

A Fine-Grained XML Structural Comparison Approach

  • Conference paper
Conceptual Modeling - ER 2007 (ER 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4801))

Included in the following conference series:

Abstract

As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing hierarchically structured data, e.g., XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as ordered labeled trees. Nevertheless, a thorough investigation of current approaches led us to identify several structural similarity aspects, i.e. sub-tree related similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an improved comparison method to deal with fine-grained sub-trees and leaf node repetitions, without increasing overall complexity with respect to current XML comparison methods. Our approach consists of two main algorithms for discovering the structural commonality between sub-trees and computing tree-based edit operations costs. A prototype has been developed to evaluate the optimality and performance of our method. Experimental results, on both real and synthetic XML data, demonstrate better performance with respect to alternative XML comparison methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. Association for Computing Machinery 23(1), 1–12 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science 29, 23–46 (2004)

    Google Scholar 

  3. Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM SIGMOD 1996, ACM Press, New York (1996)

    Google Scholar 

  4. Chawathe, S.: Comparing Hierarchical Data in External Memory. In: VLDB 1999, pp. 90–101 (1999)

    Google Scholar 

  5. Cobéna, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proc. of the IEEE Int. Conf. on Data Engineering, pp. 41–52. IEEE Computer Society Press, Los Alamitos (2002)

    Google Scholar 

  6. Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering XML documents by structure. Information Systems 31(3), 187–228 (2006)

    Article  MATH  Google Scholar 

  7. Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities Between XML Documents. In: Proc. of 5th SIGMOD Workshop on The Web and Databases (2002)

    Google Scholar 

  8. Gower, J.C., Ross, G.J.S.: Minimum Spanning Trees and Single Linkage Cluster Analysis. Applied Statistics 18, 54–64 (1969)

    Article  MathSciNet  Google Scholar 

  9. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. In: Proceedings of ACM SIGMOD 2002, pp. 287–298 (2002)

    Google Scholar 

  10. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering Algorithms and Validity Measures. In: SSDBM Conference, Virginia, USA (2001)

    Google Scholar 

  11. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  12. Myers, E.: An O(ND) Difference Algorithm and Its Variations. Algorithmica 1, 251–266 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  13. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the 5th SIGMOD Workshop on The Web and Databases (2002)

    Google Scholar 

  14. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)

    MATH  Google Scholar 

  15. Sanz, I., Mesiti, M., Guerrini, G., Berlanga Lavori, R.: Approximate Subtree Identification in Heterogeneous XML Documents Collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)

    Google Scholar 

  16. Schlieder, T.: Similarity Search in XML Data Using Cost-based Query Transformations. In: Proceedings of 4th SIGMOD Workshop on The Web and Databases (2001)

    Google Scholar 

  17. Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Pattern Matching in Strings, Trees and Arrays, ch. 14, Oxford University Press, Oxford (1995)

    Google Scholar 

  18. Wagner, J., Fisher, M.: The String-to-String correction problem. ACM J. 21, 168–173 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  19. Wong, C., Chandra, A.: Bounds for the String Editing Problem. ACM J. 23(1), 13–16 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  20. WWW Consortium, The Document Object Model, http://www.w3.org/DOM

  21. Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. of Computing 18(6), 1245–1262 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  22. Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric in XML Documents. In: Knowledge Management and Experience Management Workshop (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Christine Parent Klaus-Dieter Schewe Veda C. Storey Bernhard Thalheim

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tekli, J., Chbeir, R., Yetongnon, K. (2007). A Fine-Grained XML Structural Comparison Approach. In: Parent, C., Schewe, KD., Storey, V.C., Thalheim, B. (eds) Conceptual Modeling - ER 2007. ER 2007. Lecture Notes in Computer Science, vol 4801. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75563-0_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75563-0_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75562-3

  • Online ISBN: 978-3-540-75563-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics