String Joins with Synonyms

Song, Gwangho; Lee, Hongrae; Shim, Kyuseok; Park, Yoonjae; Kim, Wooyeol

doi:10.1007/978-3-030-59419-0_24

Gwangho Song¹⁴,
Hongrae Lee¹⁵,
Kyuseok Shim¹⁴,
Yoonjae Park¹⁴ &
…
Wooyeol Kim¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12114))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

Abstract

String matching is a fundamental operation in many applications such as data integration, information retrieval and text mining. Since users express the same meaning in a variety of ways that are not textually similar, existing works have proposed variants of Jaccard similarity by using synonyms to consider semantics beyond textual similarities. However, they may produce a non-negligible number of false positives in some applications by employing set semantics and miss some true positives due to approximations. In this paper, we define new match relationships between a pair of strings under synonym rules and develop an efficient algorithm to verify the match relationships for a pair of strings. In addition, we propose two filtering methods to prune non-matching string pairs. We also develop join algorithms with synonyms based on the filtering methods and the match relationships. Experimental results with real-life datasets confirm the effectiveness of our proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)
Article MathSciNet Google Scholar
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40–49. IEEE, Cancun (2008)
Google Scholar
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250. ACM (2008)
Google Scholar
Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 44(1), 1 (2012)
Article Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5. IEEE, Atlanta (2006)
Google Scholar
Chu, X., Ilyas, I.F., Koutris, P.: Distributed data deduplication. PVLDB 9(11), 864–875 (2016)
Google Scholar
Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems, pp. 2980–2988 (2015)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, vol. 6. MIT Press, Cambridge (2001)
MATH Google Scholar
Deng, D., Li, G., Feng, J., Li, W.S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. IEEE (2013)
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, vol. 1, pp. 491–500. VLDB, Rome (2001)
Google Scholar
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)
Google Scholar
Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: ACM SIGMOD, pp. 385–396 (2013)
Google Scholar
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endowment 9(12), 1197–1208 (2016)
Article Google Scholar
Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Google Scholar
Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with synonyms. In: ACM SIGMOD, New York, USA, pp. 373–384 (2013)
Google Scholar
Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34. ACM (2018)
Google Scholar
Naumann, F., Herschel, M.: An introduction to duplicate detection. Synth. Lect. Data Manag. 2(1), 1–87 (2010)
Article Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Tao, W., Deng, D., Stonebraker, M.: Approximate string joins with abbreviations. PVLDB 11(1), 53–65 (2017)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: ACM SIGMOD, Scottsdale, Arizona, USA, pp. 85–96 (2012)
Google Scholar
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: ACM SIGMOD, pp. 219–232 (2009)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15 (2011)
Article Google Scholar

Download references

Acknowledgements

This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017M3C4A7063570).

Author information

Authors and Affiliations

Seoul National University, Seoul, South Korea
Gwangho Song, Kyuseok Shim, Yoonjae Park & Wooyeol Kim
Google, Mountain View, USA
Hongrae Lee

Authors

Gwangho Song
View author publications
You can also search for this author in PubMed Google Scholar
Hongrae Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kyuseok Shim
View author publications
You can also search for this author in PubMed Google Scholar
Yoonjae Park
View author publications
You can also search for this author in PubMed Google Scholar
Wooyeol Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kyuseok Shim .

Editor information

Editors and Affiliations

Dankook University, Yongin, Korea (Republic of)
Yunmook Nah
Peking University, Haidian, China
Bin Cui
Sungkyunkwan University, Suwon, Korea (Republic of)
Sang-Won Lee
Department of Systems Engineering and En, The Chinese University of Hong Kong, Hong Kong, Hong Kong
Jeffrey Xu Yu
Kangwon National University, Chunchon, Korea (Republic of)
Yang-Sae Moon
Korea Advanced Institute of Science and, Daejeon, Korea (Republic of)
Steven Euijong Whang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, G., Lee, H., Shim, K., Park, Y., Kim, W. (2020). String Joins with Synonyms. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12114. Springer, Cham. https://doi.org/10.1007/978-3-030-59419-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-59419-0_24
Published: 22 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59418-3
Online ISBN: 978-3-030-59419-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics