Approximate Top-k Structural Similarity Search over XML Documents

Xie, Tao; Sha, Chaofeng; Wang, Xiaoling; Zhou, Aoying

doi:10.1007/11610113_29

Tao Xie²¹,
Chaofeng Sha²¹,
Xiaoling Wang²¹ &
…
Aoying Zhou²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Asia-Pacific Web Conference

876 Accesses

Abstract

With the development of XML applications, such as Digital Library, XML subscribe/publish system, and other XML repositories, top-k structural similarity search over XML documents is attracting more attention. The similarity of two XML documents can be measured by using the edit distance defined between XML trees in previous work. Since the computation of edit distances is time consuming, some recent work presented some approaches to calculate edit distance by using structural summaries to improve the algorithm performance. However, most existing algorithms for calculating edit distance between trees ignore the fact that nodes in a tree may be of different significance, and the same edit operation costs are assumed inappropriately for all nodes in XML document tree. This paper addresses this problem by proposing a summary structure which could be used to make the tree-based edit distance more rational; furthermore, a novel weighting scheme is proposed to indicate that some nodes are more important than others with respect for structural similarity. We introduce a new cost model for computing structural distance and takes weight information into account for nodes in distance computation in this paper. Compared with former techniques, our approach can approximately answer the top-k queries efficiently. We verify this approach through a series of experiments, and the results show that using weighted structural summaries for top-k queries is efficient and practical.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

XPloreRank: exploring XML data via you may also like queries

Article 11 August 2018

Search and Aggregation in XML Documents

No-but-semantic-match: computing semantically matched xml keyword search results

Article 13 October 2017

References

Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 1245–1262 (1989)
Article MATH MathSciNet Google Scholar
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: Clustering XML Documents using Structural Summaries. In: EDBT Workshops, pp. 547–556 (2004)
Google Scholar
Tai, K.: The Tree-to-Tree Correction Problem. J. of the ACM 26(3), 422–433 (1979)
Article MATH MathSciNet Google Scholar
Shasha, D., Wang, J., Zhang, K., Shih, F.: Exact and approximate algorithms for unordered tree matching. IEEE rans. Sys. Man. Cyber. 24, 668–678 (1994)
Article MathSciNet Google Scholar
Selkow, S.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186
Google Scholar
Zhang, K.: A constrained editing distance between unordered labeled trees. Algorithmica 15, 205–222 (1996)
Article MATH MathSciNet Google Scholar
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42, 133–139 (1992)
Article MATH MathSciNet Google Scholar
Castro, D., Golgher, P., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502–511 (2004)
Google Scholar
Nierman, A., Jagadish, H.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)
Google Scholar
Chawathe, S.: Comparing hierarchical data in extended memory. In: VLDB, pp. 90–101 (1999)
Google Scholar
Kailing, K., Kriegel, H., Schönauer, S., Seidl, T.: Efficient Similarity Search for Hierarchical Data in Large Databases. In: EDBT, pp. 676–693 (2004)
Google Scholar
Yang, R., Kalnis, P., Tung, K.: Similarity Evaluation on Tree-structured Data. In: SIGMOD, pp. 754–765 (2005)
Google Scholar
Bertino, E., Guerrini, G., Mesiti, M.: Measuring the Structural Similarity among XML Documents and DTDs (2001), http://www.disi.unige.it/person/MesitiM
http://www.cs.washington.edu/research/xmldatasets
http://www.alphaworks.ibm.com/tech/xmlgenerator
http://www.xmlfiles.com

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Fudan University, Shanghai, 200433, China
Tao Xie, Chaofeng Sha, Xiaoling Wang & Aoying Zhou

Authors

Tao Xie
View author publications
You can also search for this author in PubMed Google Scholar
Chaofeng Sha
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, Australia
Heng Tao Shen
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
Victoria University, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, T., Sha, C., Wang, X., Zhou, A. (2006). Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_29

Download citation

DOI: https://doi.org/10.1007/11610113_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics