Similarity Join on XML Based on k-Generation Set Distance

Wang, Yue; Wang, Hongzhi; Wang, Yang; Gao, Hong

doi:10.1007/978-3-642-28635-3_11

Yue Wang¹⁸,
Hongzhi Wang¹⁸,
Yang Wang¹⁸ &
…
Hong Gao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7142))

Included in the following conference series:

International Conference on Web-Age Information Management

899 Accesses
1 Citations

Abstract

Similarity join is applied very widely nowadays since data items representing the same real-world objects may be different due to various conventions. Another reason for similarity join is that the efficiency of traditional methods is really low. Therefore, a method with both high efficiency and high join quality is in need. In the paper, we put forward two new edit operations (reversing and mapping) together with related algorithms concerning similarity join based on the new defined measure. In our method, computing tree edit distance is replaced by computing k-generation set distance between trees. The join process is simplified largely by applying the new method. The time complexity of our method is O(n ² ), where n is the tree size. We have proved that our method owns some advantages over others. And it can be scaled to large data sets as well.

This research is partially supported by National Science Foundation of China under Grant No. 61003046, No. 60831160525, No. 61111130189. Key Program of the National Natural Science Foundation of China under Grant No. 60933001, National Postdoctoral Foundation of China under Grant No. 20090450126, No. 201003447, Doctoral Fund of Ministry of Education of China under Grant No. 20102302120054.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Augsten, N., Bohlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proc. of the 31st VLDB Conferences, Trondheim, Norway, pp. 301–312 (2005)
Google Scholar
Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)
Article MathSciNet MATH Google Scholar
Li, F., Wang, H., Hao, L., Li, J., Gao, H.: pq-hash: An Efficient Method for Approximate XML Joins. In: Shen, H.T., Pei, J., Özsu, M.T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y., Shao, J. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 125–134. Springer, Heidelberg (2010)
Chapter Google Scholar
Li, F., Wang, H., Zhang, C., Hao, L., Li, J., Gao, H.: Approximate Joins for XML Using g-String. In: Lee, M.L., Yu, J.X., Bellahsène, Z., Unland, R. (eds.) XSym 2010. LNCS, vol. 6309, pp. 3–17. Springer, Heidelberg (2010)
Chapter Google Scholar
Augsten, N., Bohlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010)
Google Scholar
Tatikonda, S., Parthasarathy, S.: Hashing Tree-Structured Data: Methods and Applications. In: ICDE (2010) (to appear)
Google Scholar
Dulucq, S., Touzet, H.: Analysis of Tree Edit Distance Algorithms. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 83–95. Springer, Heidelberg (2003)
Chapter Google Scholar
Han, Z., Wang, H., Gao, H., Li, J., Luo, J.: Clustering-Based Approximate Join Method on XML Documents. Journal of Computer Research and Development (suppl.), 81–86 (2009); ISSN:1000-1239/CN 11-1177/TP46
Google Scholar
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML Joins. ACM SIGMOD (June 4-6, 2002)
Google Scholar
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Integrating XML Data Sources Using Approximate Joins. ACM Transactions on Database Systems 31(1), 161–207 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The School of Computer Science and Technology, Harbin Institute of Technology, China
Yue Wang, Hongzhi Wang, Yang Wang & Hong Gao

Authors

Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Wuhan University, 430072, Hubei, China
Liwei Wang , Jingjue Jiang , Liang Hong & Bin Liu , , &
Renmin University of China, 100872, Beijing, China
Jiaheng Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Wang, H., Wang, Y., Gao, H. (2012). Similarity Join on XML Based on k-Generation Set Distance. In: Wang, L., Jiang, J., Lu, J., Hong, L., Liu, B. (eds) Web-Age Information Management. WAIM 2011. Lecture Notes in Computer Science, vol 7142. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28635-3_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-28635-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28634-6
Online ISBN: 978-3-642-28635-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics