Skip to main content

Approximate Top-k Structural Similarity Search over XML Documents

  • Conference paper
Frontiers of WWW Research and Development - APWeb 2006 (APWeb 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Abstract

With the development of XML applications, such as Digital Library, XML subscribe/publish system, and other XML repositories, top-k structural similarity search over XML documents is attracting more attention. The similarity of two XML documents can be measured by using the edit distance defined between XML trees in previous work. Since the computation of edit distances is time consuming, some recent work presented some approaches to calculate edit distance by using structural summaries to improve the algorithm performance. However, most existing algorithms for calculating edit distance between trees ignore the fact that nodes in a tree may be of different significance, and the same edit operation costs are assumed inappropriately for all nodes in XML document tree. This paper addresses this problem by proposing a summary structure which could be used to make the tree-based edit distance more rational; furthermore, a novel weighting scheme is proposed to indicate that some nodes are more important than others with respect for structural similarity. We introduce a new cost model for computing structural distance and takes weight information into account for nodes in distance computation in this paper. Compared with former techniques, our approach can approximately answer the top-k queries efficiently. We verify this approach through a series of experiments, and the results show that using weighted structural summaries for top-k queries is efficient and practical.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  2. Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: Clustering XML Documents using Structural Summaries. In: EDBT Workshops, pp. 547–556 (2004)

    Google Scholar 

  3. Tai, K.: The Tree-to-Tree Correction Problem. J. of the ACM 26(3), 422–433 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  4. Shasha, D., Wang, J., Zhang, K., Shih, F.: Exact and approximate algorithms for unordered tree matching. IEEE rans. Sys. Man. Cyber. 24, 668–678 (1994)

    Article  MathSciNet  Google Scholar 

  5. Selkow, S.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186

    Google Scholar 

  6. Zhang, K.: A constrained editing distance between unordered labeled trees. Algorithmica 15, 205–222 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  7. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42, 133–139 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  8. Castro, D., Golgher, P., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502–511 (2004)

    Google Scholar 

  9. Nierman, A., Jagadish, H.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)

    Google Scholar 

  10. Chawathe, S.: Comparing hierarchical data in extended memory. In: VLDB, pp. 90–101 (1999)

    Google Scholar 

  11. Kailing, K., Kriegel, H., Schönauer, S., Seidl, T.: Efficient Similarity Search for Hierarchical Data in Large Databases. In: EDBT, pp. 676–693 (2004)

    Google Scholar 

  12. Yang, R., Kalnis, P., Tung, K.: Similarity Evaluation on Tree-structured Data. In: SIGMOD, pp. 754–765 (2005)

    Google Scholar 

  13. Bertino, E., Guerrini, G., Mesiti, M.: Measuring the Structural Similarity among XML Documents and DTDs (2001), http://www.disi.unige.it/person/MesitiM

  14. http://www.cs.washington.edu/research/xmldatasets

  15. http://www.alphaworks.ibm.com/tech/xmlgenerator

  16. http://www.xmlfiles.com

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xie, T., Sha, C., Wang, X., Zhou, A. (2006). Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_29

Download citation

  • DOI: https://doi.org/10.1007/11610113_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31142-3

  • Online ISBN: 978-3-540-32437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics