Skip to main content

String Joins with Synonyms

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12114))

Included in the following conference series:

Abstract

String matching is a fundamental operation in many applications such as data integration, information retrieval and text mining. Since users express the same meaning in a variety of ways that are not textually similar, existing works have proposed variants of Jaccard similarity by using synonyms to consider semantics beyond textual similarities. However, they may produce a non-negligible number of false positives in some applications by employing set semantics and miss some true positives due to approximations. In this paper, we define new match relationships between a pair of strings under synonym rules and develop an efficient algorithm to verify the match relationships for a pair of strings. In addition, we propose two filtering methods to prune non-matching string pairs. We also develop join algorithms with synonyms based on the filtering methods and the match relationships. Experimental results with real-life datasets confirm the effectiveness of our proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)

    Article  MathSciNet  Google Scholar 

  2. Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40–49. IEEE, Cancun (2008)

    Google Scholar 

  3. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. ACM (2008)

    Google Scholar 

  4. Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 44(1), 1 (2012)

    Article  Google Scholar 

  5. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5. IEEE, Atlanta (2006)

    Google Scholar 

  6. Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. PVLDB 9(11), 864–875 (2016)

    Google Scholar 

  7. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems, pp. 2980–2988 (2015)

    Google Scholar 

  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, vol. 6. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  9. Deng, D., Li, G., Feng, J., Li, W.S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. IEEE (2013)

    Google Scholar 

  10. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, vol. 1, pp. 491–500. VLDB, Rome (2001)

    Google Scholar 

  11. Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)

    Google Scholar 

  12. Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: ACM SIGMOD, pp. 385–396 (2013)

    Google Scholar 

  13. Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endowment 9(12), 1197–1208 (2016)

    Article  Google Scholar 

  14. Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  15. Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with synonyms. In: ACM SIGMOD, New York, USA, pp. 373–384 (2013)

    Google Scholar 

  16. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  17. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34. ACM (2018)

    Google Scholar 

  18. Naumann, F., Herschel, M.: An introduction to duplicate detection. Synth. Lect. Data Manag. 2(1), 1–87 (2010)

    Article  Google Scholar 

  19. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  20. Tao, W., Deng, D., Stonebraker, M.: Approximate string joins with abbreviations. PVLDB 11(1), 53–65 (2017)

    Google Scholar 

  21. Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: ACM SIGMOD, Scottsdale, Arizona, USA, pp. 85–96 (2012)

    Google Scholar 

  22. Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: ACM SIGMOD, pp. 219–232 (2009)

    Google Scholar 

  23. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017M3C4A7063570).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyuseok Shim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Song, G., Lee, H., Shim, K., Park, Y., Kim, W. (2020). String Joins with Synonyms. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12114. Springer, Cham. https://doi.org/10.1007/978-3-030-59419-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59419-0_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59418-3

  • Online ISBN: 978-3-030-59419-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics