Abstract
With the development of XML applications, such as Digital Library, XML subscribe/publish system, and other XML repositories, top-k structural similarity search over XML documents is attracting more attention. The similarity of two XML documents can be measured by using the edit distance defined between XML trees in previous work. Since the computation of edit distances is time consuming, some recent work presented some approaches to calculate edit distance by using structural summaries to improve the algorithm performance. However, most existing algorithms for calculating edit distance between trees ignore the fact that nodes in a tree may be of different significance, and the same edit operation costs are assumed inappropriately for all nodes in XML document tree. This paper addresses this problem by proposing a summary structure which could be used to make the tree-based edit distance more rational; furthermore, a novel weighting scheme is proposed to indicate that some nodes are more important than others with respect for structural similarity. We introduce a new cost model for computing structural distance and takes weight information into account for nodes in distance computation in this paper. Compared with former techniques, our approach can approximately answer the top-k queries efficiently. We verify this approach through a series of experiments, and the results show that using weighted structural summaries for top-k queries is efficient and practical.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18, 1245–1262 (1989)
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: Clustering XML Documents using Structural Summaries. In: EDBT Workshops, pp. 547–556 (2004)
Tai, K.: The Tree-to-Tree Correction Problem. J. of the ACM 26(3), 422–433 (1979)
Shasha, D., Wang, J., Zhang, K., Shih, F.: Exact and approximate algorithms for unordered tree matching. IEEE rans. Sys. Man. Cyber. 24, 668–678 (1994)
Selkow, S.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186
Zhang, K.: A constrained editing distance between unordered labeled trees. Algorithmica 15, 205–222 (1996)
Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42, 133–139 (1992)
Castro, D., Golgher, P., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502–511 (2004)
Nierman, A., Jagadish, H.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)
Chawathe, S.: Comparing hierarchical data in extended memory. In: VLDB, pp. 90–101 (1999)
Kailing, K., Kriegel, H., Schönauer, S., Seidl, T.: Efficient Similarity Search for Hierarchical Data in Large Databases. In: EDBT, pp. 676–693 (2004)
Yang, R., Kalnis, P., Tung, K.: Similarity Evaluation on Tree-structured Data. In: SIGMOD, pp. 754–765 (2005)
Bertino, E., Guerrini, G., Mesiti, M.: Measuring the Structural Similarity among XML Documents and DTDs (2001), http://www.disi.unige.it/person/MesitiM
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xie, T., Sha, C., Wang, X., Zhou, A. (2006). Approximate Top-k Structural Similarity Search over XML Documents. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_29
Download citation
DOI: https://doi.org/10.1007/11610113_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)