Abstract
Structural similarity search is a fundamental technology for XML data management. However, existing methods do not scale well with large volume of XML document. The pq-gram is an efficient way of extracting substructure from the tree-structured data for approximate structural similarity search. In this paper, we propose an effective framework GRAMS3 for evaluating structural similarity of XML data. First pq-grams of XML document are extracted; then we study the characteristics of pq-gram of XML and generate doc-gram vector using TGF-IGF model for XML tree; finally we employ locality sensitive hashing for efficiently structural similarity search of XML documents. An empirical study using both synthetic and real datasets demonstrates the framework is efficient.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)
Viyanon, W., Madria, S.K., Bhowmick, S.S.: XML Data Integration Based on Content and Structure Similarity Using Keys. In: OTM, pp. 484–493 (2008)
Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: background, current trends and future directions. Computer Science Review 3(3), 151–173 (2009)
Jiang, T., Wang, L., Zhang, K.: Alignment of Trees-An Alternative to Tree Edit. In: CPM, pp. 75–86 (1994)
Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD, pp. 754–765 (2005)
Okura, N., Hirata, K., Kuboyama, T., Harao, M.: The q-Gram Distance for Ordered Unlabeled Trees. IEIC Technical Report, 105(273), 25–29 (2005)
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)
UW XML Repository (2009), http://www.cs.washington.edu/research/xmldatasets/
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp. 563–576 (2009)
Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT, pp. 744–755 (2009)
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: WebDB, vol. 129, p. 134 (2000)
Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. JACM 42(6), 1145 (1995)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley, Reading (1999)
Sigmod Record (2009), http://www.sigmod.org/publications/sigmod-record/xml-edition
Xmark (2009), http://www.xml-benchmark.org/
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)
Lian, W., Cheung, D.W., Mamoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering XML documents by structure. In: TKDE, pp. 82–96 (2004)
Rafiei, D., Moise, D.L., Sun, D.: Finding Syntactic Similarities Between XML Documents. In: ICDESA, pp. 512–516 (2006)
Augsten, N., Böhlen, M., Gamper, J.: The pq-Gram Distance between Ordered Labeled Trees. TODS 35(1), 1–36 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yuan, P., Wang, X., Sha, C., Gao, M., Zhou, A. (2010). GRAMS3: An Efficient Framework for XML Structural Similarity Search. In: Yoshikawa, M., Meng, X., Yumoto, T., Ma, Q., Sun, L., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 6193. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14589-6_43
Download citation
DOI: https://doi.org/10.1007/978-3-642-14589-6_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14588-9
Online ISBN: 978-3-642-14589-6
eBook Packages: Computer ScienceComputer Science (R0)