GRAMS3: An Efficient Framework for XML Structural Similarity Search

Yuan, Peisen; Wang, Xiaoling; Sha, Chaofeng; Gao, Ming; Zhou, Aoying

doi:10.1007/978-3-642-14589-6_43

GRAMS³: An Efficient Framework for XML Structural Similarity Search

Peisen Yuan^22,23,
Xiaoling Wang²⁴,
Chaofeng Sha^22,23,
Ming Gao^22,23 &
…
Aoying Zhou^23,24

Conference paper

681 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6193))

Abstract

Structural similarity search is a fundamental technology for XML data management. However, existing methods do not scale well with large volume of XML document. The pq-gram is an efficient way of extracting substructure from the tree-structured data for approximate structural similarity search. In this paper, we propose an effective framework GRAMS³ for evaluating structural similarity of XML data. First pq-grams of XML document are extracted; then we study the characteristics of pq-gram of XML and generate doc-gram vector using TGF-IGF model for XML tree; finally we employ locality sensitive hashing for efficiently structural similarity search of XML documents. An empirical study using both synthetic and real datasets demonstrates the framework is efficient.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)
Article MathSciNet Google Scholar
Viyanon, W., Madria, S.K., Bhowmick, S.S.: XML Data Integration Based on Content and Structure Similarity Using Keys. In: OTM, pp. 484–493 (2008)
Google Scholar
Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: background, current trends and future directions. Computer Science Review 3(3), 151–173 (2009)
Article Google Scholar
Jiang, T., Wang, L., Zhang, K.: Alignment of Trees-An Alternative to Tree Edit. In: CPM, pp. 75–86 (1994)
Google Scholar
Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD, pp. 754–765 (2005)
Google Scholar
Okura, N., Hirata, K., Kuboyama, T., Harao, M.: The q-Gram Distance for Ordered Unlabeled Trees. IEIC Technical Report, 105(273), 25–29 (2005)
Google Scholar
Augsten, N., Böhlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)
Google Scholar
UW XML Repository (2009), http://www.cs.washington.edu/research/xmldatasets/
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp. 563–576 (2009)
Google Scholar
Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. In: EDBT, pp. 744–755 (2009)
Google Scholar
Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable techniques for clustering the web. In: WebDB, vol. 129, p. 134 (2000)
Google Scholar
Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. JACM 42(6), 1145 (1995)
Article MathSciNet Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Sigmod Record (2009), http://www.sigmod.org/publications/sigmod-record/xml-edition
Xmark (2009), http://www.xml-benchmark.org/
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: WebDB, pp. 61–66 (2002)
Google Scholar
Lian, W., Cheung, D.W., Mamoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering XML documents by structure. In: TKDE, pp. 82–96 (2004)
Google Scholar
Rafiei, D., Moise, D.L., Sun, D.: Finding Syntactic Similarities Between XML Documents. In: ICDESA, pp. 512–516 (2006)
Google Scholar
Augsten, N., Böhlen, M., Gamper, J.: The pq-Gram Distance between Ordered Labeled Trees. TODS 35(1), 1–36 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, 200433, P.R. China
Peisen Yuan, Chaofeng Sha & Ming Gao
Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, 200433, P.R. China
Peisen Yuan, Chaofeng Sha, Ming Gao & Aoying Zhou
Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, Shanghai, 200062, P.R. China
Xiaoling Wang & Aoying Zhou

Authors

Peisen Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chaofeng Sha
View author publications
You can also search for this author in PubMed Google Scholar
Ming Gao
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo, 606-8501, Kyoto, Japan
Masatoshi Yoshikawa
Information School, Renmin University of China, 100872, Beijing, China
Xiaofeng Meng
Graduate School of Engineering, University of Hyogo, 2167 Shosha, Himeji, 671-2280, Hyogo, Japan
Takayuki Yumoto
Graduate School of Informatics, Kyoto University, Yoshidahonmachi, Sakyo, 606-8501, Kyoto, Japan
Qiang Ma
Institute of HCI and Media Integration, Tsinghua University, 100084, Bejing, China
Lifeng Sun
Department of Information Science, Ochanomizu University, 2-1-1, Otsuka, Bunkyo-ku, 112-8610, Tokyo, Japan
Chiemi Watanabe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, P., Wang, X., Sha, C., Gao, M., Zhou, A. (2010). GRAMS³: An Efficient Framework for XML Structural Similarity Search. In: Yoshikawa, M., Meng, X., Yumoto, T., Ma, Q., Sun, L., Watanabe, C. (eds) Database Systems for Advanced Applications. DASFAA 2010. Lecture Notes in Computer Science, vol 6193. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14589-6_43

Download citation

DOI: https://doi.org/10.1007/978-3-642-14589-6_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14588-9
Online ISBN: 978-3-642-14589-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics