Skip to main content

Approximate Joins for XML Using g-String

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6309))

Abstract

When integrating XML documents from autonomous databases, exact joins often fail for the data items representing the same real world object may not be exactly the same. Thus the join must be approximate. Tree-edit-distance-based join methods have high join quality but low efficiency. Comparatively, other methods with higher efficiency cannot perform the join as effectively as tree edit distance does.

To keep the balance between efficiency and effectiveness, in this paper, we propose a novel method to approximately join XML documents. In our method, trees are transformed to g-strings with each entry a tiny subtree. Then the distance between two trees is evaluated as the g-string distance between their corresponding g-strings. To make the g-string based join method scale to large XML databases, we propose the g-bag distance as the lower bound of the g-string distance. With g-bag distance, only a very small part of g-string distance need to be computed directly. Thus the whole join process can be done very efficiently. We theoretically analyze the properties of the g-string distance. Experiments with synthetic and various real world data confirm the effectiveness and efficiency of our method and suggest that our technique is both scalable and useful.

Supported by the National Science Foundation of China (No 60703012, 60773063), the NSFC-RGC of China(No. 60831160525), National Grant of Fundamental Research 973 Program of China (No.2006CB303000), National Grant of High Technology 863 Program of China (No. 2009AA01Z149), Key Program of the National Natural Science Foundation of China (No. 60933001), National Postdoctor Foundtaion of China (No. 20090450126), Development Program for Outstanding Young Teachers in Harbin Institute of Technology (no. HITQNJS.2009.052).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Augsten, N., Böhlen, M.H., Dyreson, C.E., Gamper, J.: Approximate joins for data-centric xml. In: ICDE, pp. 814–823 (2008)

    Google Scholar 

  2. Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB, pp. 301–312 (2005)

    Google Scholar 

  3. Augsten, N., Böhlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010)

    Google Scholar 

  4. Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  5. Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. In: Arge, L., Cachin, C., Jurdziński, T., Tarlecki, A. (eds.) ICALP 2007. LNCS, vol. 4596, pp. 146–157. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  6. Garofalakis, M.N., Kumar, A.: Xml stream processing using tree-edit distance embeddings. ACM Trans. Database Syst. 30(1), 279–332 (2005)

    Article  Google Scholar 

  7. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate xml joins. In: SIGMOD Conference, pp. 287–298 (2002)

    Google Scholar 

  8. Kailing, K., Kriegel, H.-P., Schönauer, S., Seidl, T.: Efficient similarity search for hierarchical data in large databases. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 676–693. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Klein, P.N.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 91–102. Springer, Heidelberg (1998)

    Google Scholar 

  10. Kuboyama, T.: Matching and Learning in Trees (2007)

    Google Scholar 

  11. Shapiro, B.A., Zhang, K.: Comparing multiple rna secondary structures using tree comparisons. Computer Applications in the Biosciences 6(4), 309–318 (1990)

    Google Scholar 

  12. Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  13. Tatikonda, S., Parthasarathy, S.: Hashing Tree-Structured Data: Methods and Applications. In: ICDE (to appear, 2010)

    Google Scholar 

  14. Valiente, G.: An efficient bottom-up distance between trees. In: SPIRE, pp. 212–219 (2001)

    Google Scholar 

  15. van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)

    MATH  Google Scholar 

  16. Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD Conference, pp. 754–765 (2005)

    Google Scholar 

  17. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, F., Wang, H., Zhang, C., Hao, L., Li, J., Gao, H. (2010). Approximate Joins for XML Using g-String. In: Lee, M.L., Yu, J.X., Bellahsène, Z., Unland, R. (eds) Database and XML Technologies. XSym 2010. Lecture Notes in Computer Science, vol 6309. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15684-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15684-7_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15683-0

  • Online ISBN: 978-3-642-15684-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics