Skip to main content

String Similarity Join with Different Thresholds

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9403))

  • 2945 Accesses

Abstract

String similarity join is an essential operation of many applications that need to find all similar string pairs from given two collections. The existing approaches are using the uniform and predefined similarity thresholds. While in real applications, regarding that the longer string pairs typically tolerate many more typos, it is necessary to apply variable thresholds to different strings instead of a constant one. Therefore, we proposed a solution for string similarity joins with different similarity thresholds in one procedure. In order to support different similarity thresholds, we devised the similarity aware index and index probing technique. To our best knowledge, it is the first work to address the problem. Experimental results on real-world datasets show that our solution can tackle with different similarity thresholds efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140. ACM (2007)

    Google Scholar 

  2. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, pp. 61–72. IEEE (2006)

    Google Scholar 

  3. Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96. ACM (2005)

    Google Scholar 

  4. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. TKDE 19(1), 1–16 (2007)

    Google Scholar 

  5. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, e.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500. ACM (2001)

    Google Scholar 

  6. Hernández, M., Stolfo, S.: The merge/purge problem for large databases. In: SIGMOD, pp. 127–138. ACM (1995)

    Google Scholar 

  7. Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. In: PVLDB, pp. 625–636. ACM (2014)

    Google Scholar 

  8. Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with synonyms. In: SIGMOD, pp. 373–384. ACM (2013)

    Google Scholar 

  9. Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: SIGKDD, pp. 267–270. ACM (1996)

    Google Scholar 

  10. Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)

    Article  MATH  Google Scholar 

  11. Rong, C., Lu, W., Wang, X., Du, X., Chen, Y., Tung, A.K.: Efficient and scalable processing of string similarity join. TKDE 25(10), 2217–2230 (2013)

    Google Scholar 

  12. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754. ACM (2004)

    Google Scholar 

  13. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Computer Vision, pp. 1470–1477. IEEE (2003)

    Google Scholar 

  14. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96. ACM (2012)

    Google Scholar 

  15. Winkler, W.: The state of record linkage and current research problems. In: Statistical Research Division (1999)

    Google Scholar 

  16. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann (1999)

    Google Scholar 

  17. Xiao, C., Wang, W., Lin, X., Yu, J.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140. ACM (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuitian Rong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Rong, C., Zhang, X. (2015). String Similarity Join with Different Thresholds. In: Zhang, S., Wirsing, M., Zhang, Z. (eds) Knowledge Science, Engineering and Management. KSEM 2015. Lecture Notes in Computer Science(), vol 9403. Springer, Cham. https://doi.org/10.1007/978-3-319-25159-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25159-2_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25158-5

  • Online ISBN: 978-3-319-25159-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics