PG-Join: Proximity Graph Based String Similarity Joins

Kazimianec, Michail; Augsten, Nikolaus

doi:10.1007/978-3-642-22351-8_17

Michail Kazimianec¹⁹ &
Nikolaus Augsten¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6809))

Included in the following conference series:

International Conference on Scientific and Statistical Database Management

1511 Accesses

Abstract

In many applications, for example, in data integration scenarios, strings must be matched if they are similar. String similarity joins, which match all pairs of similar strings from two datasets, are of particular interest and have recently received much attention in the database research community. Most approaches, however, assume a global similarity threshold; all string pairs that exceed the threshold form a match in the join result. The global threshold approach has two major problems: (a) the threshold depends on the (mostly unknown) data distribution, (b) often there is no single threshold that is good for all string pairs.

In this paper we propose the PG-Join algorithm, a novel string similarity join that requires no configuration and uses an adaptive threshold. PG-Join computes a so-called proximity graph to derive an individual threshold for each string. Computing the proximity graph efficiently is essential for the scalability of PG-Join. To this end we develop a new and fast algorithm, PG-I, that computes the proximity graph in two steps: First an efficient approximation is computed, then the approximation error is fixed incrementally until the adaptive threshold is stable. Our extensive experiments on real-world and synthetic data show that PG-I is up to five times faster than the state-of-the-art algorithm and suggest that PG-Join is a useful and effective join paradigm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Proceedings of the 27th Int. Conf. on Very Large Data Bases, VLDB 2001, pp. 491–500. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: Proc. VLDB Endow., vol. 1, pp. 933–944 (2008)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd Int. Conf. on Data Engineering, ICDE 2006, p. 5. IEEE Computer Society, Los Alamitos (2006)
Google Scholar
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: Proceedings of the 2010 Int. Conf. on Management of Data, SIGMOD 2010, pp. 327–338. ACM, New York (2010)
Google Scholar
Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35, 4:1–4:36 (2008)
Article Google Scholar
Mazeika, A., Böhlen, M.H.: Cleansing databases of misspelled proper nouns. In: CleanDB (2006)
Google Scholar
Kazimianec, M., Augsten, N.: Exact and efficient proximity graph computation. In: Catania, B., Ivanović, M., Thalheim, B. (eds.) ADBIS 2010. LNCS, vol. 6295, pp. 289–304. Springer, Heidelberg (2010)
Chapter Google Scholar
Kazimianec, M., Augsten, N.: PG-skip: Proximity graph based clustering of long strings. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA 2011, Part II. LNCS, vol. 6588, pp. 31–46. Springer, Heidelberg (2011)
Chapter Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18, 16–23 (2003)
Article Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of the 8th Int. Conf. on Database Systems for Advanced Applications, DASFAA 2003, p. 137. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In: Proceedings of the 1998 Int. Conf. on Management of Data, SIGMOD 1998, pp. 237–248. ACM, New York (1998)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng. 19, 1–16 (2007)
Article Google Scholar
Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th Int. Conf. on Research and Development in Information Retrieval, SIGIR 2006, pp. 284–291. ACM, New York (2006)
Google Scholar
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science 92, 191–211 (1992)
Article MATH Google Scholar
Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: Proc. of the 33rd Int. Conf. on Very Large Data Bases, VLDB 2007, pp. 303–314. VLDB Endow. (2007)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33, 31–88 (2001)
Article Google Scholar
Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Information Systems 36(1), 62–78 (2011)
Article Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th Int. Conf. on Data Engineering, ICDE 2008, pp. 257–266. IEEE Computer Society, Los Alamitos (2008)
Google Scholar
Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Butterworths (1979)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Free University of Bozen-Bolzano, Dominikanerplatz 3, 39100, Bozen, Italy
Michail Kazimianec & Nikolaus Augsten

Authors

Michail Kazimianec
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Evergreen State College, 98505, Olympia, WA, USA
Judith Bayard Cushing
CNRI and University of Virginia, 22908, Charlottesville, VA, USA
James French
Gonzaga University, 99258, Spokane, WA, USA
Shawn Bowers

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kazimianec, M., Augsten, N. (2011). PG-Join: Proximity Graph Based String Similarity Joins. In: Bayard Cushing, J., French, J., Bowers, S. (eds) Scientific and Statistical Database Management. SSDBM 2011. Lecture Notes in Computer Science, vol 6809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22351-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-22351-8_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22350-1
Online ISBN: 978-3-642-22351-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics