FrepJoin: an efficient partition-based algorithm for edit similarity join

Luo, Ji-zhou; Shi, Sheng-fei; Wang, Hong-zhi; Li, Jian-zhong

doi:10.1631/FITEE.1601347

FrepJoin: an efficient partition-based algorithm for edit similarity join

Published: 15 December 2017

Volume 18, pages 1499–1510, (2017)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Ji-zhou Luo ORCID: orcid.org/0000-0002-3302-3917^1,2,
Sheng-fei Shi¹,
Hong-zhi Wang¹ &
…
Jian-zhong Li¹

62 Accesses
3 Citations
Explore all metrics

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

MinJoin++: a fast algorithm for string similarity joins under edit distance

Article 21 August 2023

GFSF: A Novel Similarity Join Method Based on Frequency Vector

References

Afrati, F.N., Sarma, A.D., Menestrina, D., et al., 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509. https://doi.org/10.1109/ICDE.2012.66
Google Scholar
Arasu, A., Ganti, V., Kaushik, R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/1242572.1242591
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., et al., 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324. https://doi.org/10.1145/872757.872796
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2): 60–66.
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698. https://doi.org/10.1109/ICDE.2006.9
Google Scholar
Dong, X., Halevy, A.Y., Yu, C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.
Google Scholar
Feng, J., Wang, J., Li, G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4): 437–461. https://doi.org/10.1007/s00778-011-0252-8
Article Google Scholar
Ge, T., Li, Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow., 4(11): 772–782.
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., et al., 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.
Google Scholar
Hadjieleftheriou, M., Srivastava, D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.
Google Scholar
Henzinger, M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291. https://doi.org/10.1145/1148170.1148222
Google Scholar
Ji, S., Li, G., Li, C., et al., 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380. https://doi.org/10.1145/1526709.1526760
Google Scholar
Li, C., Lu, J., Lu, Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266. https://doi.org/10.1109/ICDE.2008.4497434
Google Scholar
Li, G., Deng, D., Wang, J., et al., 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow., 5(3): 253–264. https://doi.org/10.14778/2078331.2078340
Article Google Scholar
Metwally, A., Agrawal, D., Abbadi, A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250. https://doi.org/10.1145/1242572.1242606
Google Scholar
Navarro, G., Salmela, L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221. https://doi.org/10.1007/978-3-642-03784-9_21
Google Scholar
Qin, J., Wang, W., Xiao, C., et al., 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng., 25(8): 1916–1929. https://doi.org/10.1109/TKDE.2012.79
Article Google Scholar
Sarawagi, S., Kirpal, A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754. https://doi.org/10.1145/1007568.1007652
Google Scholar
Scott, D.W., 1979. On optimal and data-based histograms. Biometrika, 66: 605–610. https://doi.org/10.1093/biomet/66.3.605
Article MathSciNet Google Scholar
Vernica, R., Carey, M.J., Li, C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506. https://doi.org/10.1145/1807167.1807222
Google Scholar
Wang, J., Li, G., Feng, J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469. https://doi.org/10.1109/ICDE.2011.5767865
Google Scholar
Wang, W., Xiao, C., Lin, X., et al., 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770. https://doi.org/10.1145/1559845.1559925
Google Scholar
Xiao, C., Wang, W., Lin, X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow., 1(1): 933–944. https://doi.org/10.14778/1453856.1453957
Article MathSciNet Google Scholar
Xiao, C., Wang, W., Lin, X., et al., 2008b. Efficient similarity joins for near duplicate detection. Int. WorldWideWeb Conf., p.131–140. https://doi.org/10.1145/2000824.2000825
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Ji-zhou Luo, Sheng-fei Shi, Hong-zhi Wang & Jian-zhong Li
Guangdong Key Laboratory of Popular High Performance Computers, Key Laboratory of Service Computing and Application, Shenzhen, 518000, China
Ji-zhou Luo

Authors

Ji-zhou Luo
View author publications
You can also search for this author in PubMed Google Scholar
Sheng-fei Shi
View author publications
You can also search for this author in PubMed Google Scholar
Hong-zhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jian-zhong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji-zhou Luo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, Jz., Shi, Sf., Wang, Hz. et al. FrepJoin: an efficient partition-based algorithm for edit similarity join. Frontiers Inf Technol Electronic Eng 18, 1499–1510 (2017). https://doi.org/10.1631/FITEE.1601347

Download citation

Received: 17 June 2016
Accepted: 15 December 2016
Published: 15 December 2017
Issue Date: October 2017
DOI: https://doi.org/10.1631/FITEE.1601347

Keywords

CLC number

TP311.13

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FrepJoin: an efficient partition-based algorithm for edit similarity join

Abstract

Access this article

Similar content being viewed by others

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

MinJoin++: a fast algorithm for string similarity joins under edit distance

GFSF: A Novel Similarity Join Method Based on Frequency Vector

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

CLC number

Navigation

FrepJoin: an efficient partition-based algorithm for edit similarity join

Abstract

Access this article

Similar content being viewed by others

A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

MinJoin++: a fast algorithm for string similarity joins under edit distance

GFSF: A Novel Similarity Join Method Based on Frequency Vector

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CLC number

Search

Navigation