Skip to main content
Log in

FrepJoin: an efficient partition-based algorithm for edit similarity join

  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Afrati, F.N., Sarma, A.D., Menestrina, D., et al., 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509. https://doi.org/10.1109/ICDE.2012.66

    Google Scholar 

  • Arasu, A., Ganti, V., Kaushik, R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.

    Google Scholar 

  • Bayardo, R.J., Ma, Y., Srikant, R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140. https://doi.org/10.1145/1242572.1242591

    Google Scholar 

  • Chaudhuri, S., Ganjam, K., Ganti, V., et al., 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324. https://doi.org/10.1145/872757.872796

    Google Scholar 

  • Chaudhuri, S., Ganti, V., Kaushik, R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 29(2): 60–66.

    Google Scholar 

  • Chaudhuri, S., Ganti, V., Kaushik, R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698. https://doi.org/10.1109/ICDE.2006.9

    Google Scholar 

  • Dong, X., Halevy, A.Y., Yu, C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.

    Google Scholar 

  • Feng, J., Wang, J., Li, G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4): 437–461. https://doi.org/10.1007/s00778-011-0252-8

    Article  Google Scholar 

  • Ge, T., Li, Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow., 4(11): 772–782.

    Google Scholar 

  • Gravano, L., Ipeirotis, P.G., Jagadish, H.V., et al., 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.

    Google Scholar 

  • Hadjieleftheriou, M., Srivastava, D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.

    Google Scholar 

  • Henzinger, M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291. https://doi.org/10.1145/1148170.1148222

    Google Scholar 

  • Ji, S., Li, G., Li, C., et al., 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380. https://doi.org/10.1145/1526709.1526760

    Google Scholar 

  • Li, C., Lu, J., Lu, Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266. https://doi.org/10.1109/ICDE.2008.4497434

    Google Scholar 

  • Li, G., Deng, D., Wang, J., et al., 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow., 5(3): 253–264. https://doi.org/10.14778/2078331.2078340

    Article  Google Scholar 

  • Metwally, A., Agrawal, D., Abbadi, A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250. https://doi.org/10.1145/1242572.1242606

    Google Scholar 

  • Navarro, G., Salmela, L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221. https://doi.org/10.1007/978-3-642-03784-9_21

    Google Scholar 

  • Qin, J., Wang, W., Xiao, C., et al., 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng., 25(8): 1916–1929. https://doi.org/10.1109/TKDE.2012.79

    Article  Google Scholar 

  • Sarawagi, S., Kirpal, A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754. https://doi.org/10.1145/1007568.1007652

    Google Scholar 

  • Scott, D.W., 1979. On optimal and data-based histograms. Biometrika, 66: 605–610. https://doi.org/10.1093/biomet/66.3.605

    Article  MathSciNet  Google Scholar 

  • Vernica, R., Carey, M.J., Li, C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506. https://doi.org/10.1145/1807167.1807222

    Google Scholar 

  • Wang, J., Li, G., Feng, J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469. https://doi.org/10.1109/ICDE.2011.5767865

    Google Scholar 

  • Wang, W., Xiao, C., Lin, X., et al., 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770. https://doi.org/10.1145/1559845.1559925

    Google Scholar 

  • Xiao, C., Wang, W., Lin, X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow., 1(1): 933–944. https://doi.org/10.14778/1453856.1453957

    Article  MathSciNet  Google Scholar 

  • Xiao, C., Wang, W., Lin, X., et al., 2008b. Efficient similarity joins for near duplicate detection. Int. WorldWideWeb Conf., p.131–140. https://doi.org/10.1145/2000824.2000825

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ji-zhou Luo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, Jz., Shi, Sf., Wang, Hz. et al. FrepJoin: an efficient partition-based algorithm for edit similarity join. Frontiers Inf Technol Electronic Eng 18, 1499–1510 (2017). https://doi.org/10.1631/FITEE.1601347

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.1601347

Keywords

CLC number

Navigation