String similarity join with different similarity thresholds based on novel indexing techniques

Rong, Chuitian; Silva, Yasin N.; Li, Chunqing

doi:10.1007/s11704-016-5231-1

String similarity join with different similarity thresholds based on novel indexing techniques

Research Article
Published: 11 October 2016

Volume 11, pages 307–319, (2017)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Chuitian Rong¹,
Yasin N. Silva² &
Chunqing Li¹

89 Accesses
3 Citations
Explore all metrics

Abstract

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

String Similarity Join with Different Thresholds

String similarity search and join: a survey

Article 24 November 2015

String Joins with Synonyms

References

Monge A, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1996, 267–270
Google Scholar
Zhang Z J, Hadjieleftheriou M, Ooi B, Srivastava D. Bed-tree: an allpurpose index structure for string similarity search based on edit distance. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 915–926
Google Scholar
Lu W, Du X Y, Hadjieleftheriou M, Ooi B C. Efficiently supporting edit distance based string similarity search using b+-trees. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2983–2996
Article Google Scholar
Wang J N, Feng J H, Li G L. Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1219–1230
Article Google Scholar
Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2004, 743–754
Google Scholar
Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 61–72
Google Scholar
Bayardo R, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proceedings of the 16th ACM International Conference on World Wide Web. 2007, 131–140
Chapter Google Scholar
Xiao C, Wang W, Lin X M, Yu J. Efficient similarity joins for near duplicate detection. In: Proceedings of ACM International Conference on World Wide Web. 2008, 563–574
Google Scholar
Hernández M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138
Google Scholar
Winkler W E. The state of record linkage and current research problems. Technical Report, Statistical Research Division, U.S. Census Bureau. 1999
Google Scholar
Sivic J, Zisserman A. Video google: a text retrieval approach to object matching in videos. 2003, 1470–1477
Google Scholar
Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96
Google Scholar
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
Google Scholar
Arasu A, Ré C, Suciu D. Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 952–963
Google Scholar
Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In: Proceedings of the VLDB Endowment. 2001, 491–500
Google Scholar
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Article Google Scholar
Naumann F, Herschel M. An Introduction to duplicate detection. Synthesis Lectures on Data Management, 2010, 2(1): 1–87
Article MATH Google Scholar
Jiang Y, Li G L, Feng J H, Li W S. String similarity joins: an experimental evaluation. Proceedings of the VLDB Endowment, 2014, 7(8): 625–636
Article Google Scholar
Chaudhuri S, Kaushik R. Extending autocompletion to tolerate errors. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 707–718
Chapter Google Scholar
Deng D, Li G L, Feng J H. A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 673–684
Google Scholar
Wang J N, Li G L, Feng J H. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2012, 85–96
Google Scholar
Rong C T, LuW, Wang X L, Du X Y, Chen Y G, Tung A K H. Efficient and scalable processing of string similarity join. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10): 2217–2230
Article Google Scholar
Lu J H, Lin C B, Wang W, Li C, Wang H Y. String similarity measures and joins with synonyms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2013, 373–384
Google Scholar
Li G L, He J, Deng D, Li J. Efficient similarity join and search on multi-attribute data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2015, 1137–1151
Google Scholar
Salton G, McGill M J. Introduction to Modern Information Retrieval. New York: McGraw-Hill, Inc., 1986
MATH Google Scholar
Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. San Francisco, CA: Morgan Kaufmann, 1999
MATH Google Scholar

Download references

Acknowledgements

This work was supported by China Scholarship Council and the National Natural Science Foundation of China (Grant Nos. 61402329 and 51378350).

Author information

Authors and Affiliations

School of Computer Science & Software Engineering, Tianjin Polytechnic University, Tianjin, 300387, China
Chuitian Rong & Chunqing Li
School of Mathematical & Natural Sciences, Arizona State University, Tempe, AZ, 85281, USA
Yasin N. Silva

Authors

Chuitian Rong
View author publications
Search author on:PubMed Google Scholar
Yasin N. Silva
View author publications
Search author on:PubMed Google Scholar
Chunqing Li
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Chuitian Rong.

Additional information

Chuitian Rong is an associate professor at Tianjin Polytechnic University, China. He received his PhD degree from Renmin University of China, China in 2013. His research interests are database system, information retrieval, and big data analysis.

Yasin N. Silva is an associate professor of applied computing in the School of Mathematical & Natural Sciences at Arizona StateUniversity, USA.He received his PhD (2010) and MS (2006) in computer science from Purdue University, USA and his BS (2000) in computer engineering from the Pontificia Universidad Catolica, Peru. Yasin’s research areas deal with data management systems and privacy preservation in general.More specifically, he has been working on the areas of query processing and optimization, privacy assurance in database systems, big data management systems, scientific database systems, and the integration of new data processing technologies into the computing curricula.

Chunqing Li is a professor at Tianjin Polytechnic University, China. His research interests are database system and applications, big data analysis, and network management and applications.

Electronic supplementary material

Supplementary material, approximately 202 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rong, C., Silva, Y.N. & Li, C. String similarity join with different similarity thresholds based on novel indexing techniques. Front. Comput. Sci. 11, 307–319 (2017). https://doi.org/10.1007/s11704-016-5231-1

Download citation

Received: 12 June 2015
Accepted: 01 February 2016
Published: 11 October 2016
Issue Date: April 2017
DOI: https://doi.org/10.1007/s11704-016-5231-1

Keywords

Profiles

Chuitian Rong View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

String similarity join with different similarity thresholds based on novel indexing techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

String Similarity Join with Different Thresholds

String similarity search and join: a survey

String Joins with Synonyms

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 202 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now