Skip to main content
Log in

String similarity search and join: a survey

  • Review Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zhang C J, Chen L, Tong Y, Liu Z. Cleaning uncertain data with a noisy crowd. In: Proceedings of the 31st IEEE International Conference on Data Engineering. 2015, 6–17

    Google Scholar 

  2. Papotti P, Naumann F, Kruse S. Estimating data integration and cleaning effort. In: Proceedings of International Conference on Extending Database Technology. 2015, 61–72

    Google Scholar 

  3. Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: a data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, 1247–1261

    Google Scholar 

  4. Verma P, Kesswani N. Web usage mining framework for data cleaning and IP address identification. 2014, arXiv: 1408.5460v1

    Google Scholar 

  5. Maccio V J, Chiang F, Down D G. Models for distributed, large scale data cleaning. In: Proceedings of Pacific–Asia Conference on Advances in Knowledge Discovery and Data Mining. 2014, 369–380

    Google Scholar 

  6. Almeida R, Oliveira P, Braga L, Barroso J. Ontologies for reusing data cleaning knowledge. In: Proceedings of International Catholic Stewardship Council. 2012, 238–241

    Google Scholar 

  7. Fan J, Li G, Zhou L, Chen S, Hu J. SEAL: spatio–textual similarity search. The Proceedings of the VLDB Endowment, 2012, 5(9): 824–835

    Article  Google Scholar 

  8. Yu M, Li G, Wang T, Feng J, Gong Z. Efficient filtering algorithms for location–aware publish/subscribe. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(4): 950–963

    Article  Google Scholar 

  9. Li G, Ooi B C, Feng J, Wang J, Zhou L. EASE: an effective 3–in–1 keyword search method for unstructured, semi–structured and structured data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 903–914

    Google Scholar 

  10. Badgeley M A, Sealfon S C, Chikina M D. Hybrid Bayesian–rank integration approach improves the predictive power of genomic dataset aggregation. Bioinformatics, 2015, 31(2): 209–215

    Article  Google Scholar 

  11. Lui T, Tsui N, Chan L W, Wong C, Siu P, Yung B Y M. DECODE: an integrated differential co–expression and differential expression analysis of gene expression data. BMC Bioinformatics, 2015, 16: 182

    Article  Google Scholar 

  12. Arfaoui N, Akaichi J. Automating schema integration technique case study: generating data warehouse schema from data mart schemas. Communications in Computer and Information Science, 2015, 521: 200–209

    Article  Google Scholar 

  13. Nastase V, Fahrni A. Coarse–grained cross–lingual alignment of comparable texts with topic models and encyclopedic knowledge. 2014, arXiv: 1411.7820v1

    Google Scholar 

  14. Srikantaiah K C, Suraj M, Venugopal K R, Patnaik L.M. Similarity based dynamic web data extraction and integration system from search engine result pages for web content mining. ACEEE International Journal on Information Technology, 2013, 3(1): 42–49

    Google Scholar 

  15. Cevahir A. Scalable textual similarity search on large document collections through random indexing and K–means clustering. In: Proceedings of Pacific–Asia Conference on Advances in Knowledge Discovery and Data Mining. 2014, 231–238

    Google Scholar 

  16. Yin J, Wang J. A dirichlet multinomial mixture model–based approach for short text clustering. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, 233–242

    Google Scholar 

  17. Dai Z, Sun A, Liu X. Crest: cluster–based representation enrichment for short text classification. In: Proceedings of Pacific–Asia Conference on Advances in Knowledge Discovery and Data Mining. 2013, 256–267

    Chapter  Google Scholar 

  18. SureshReddy G, Rajinikanth T V, Rao A A. Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents. In: Proceedings of the 15th International Conference on Computer Systems and Technologies. 2014, 194–201

    Google Scholar 

  19. Liu S, Li G, Feng J. A prefix–filter based method for spatio–textual similarity join. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(10): 2354–2367

    Article  Google Scholar 

  20. Wang J, Li G, Kraska T, Franklin M J, Feng J. Leveraging transitive relations for crowdsourced joins. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2013, 229–240

    Google Scholar 

  21. Wang J, Li G, Yu J X, Feng J. Entity matching: how similar is similar. The Proceedings of the VLDB Endowment, 2011, 4(10): 622–633

    Article  Google Scholar 

  22. Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD international conference on Management of data. 2003, 313–324

    Google Scholar 

  23. Wang J, Li G, Feng J. Fast–join: an efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th IEEE International Conference on Data Engineering. 2011, 458–469

    Google Scholar 

  24. Wang J, Li G, Feng J. Extending string similarity join to tolerant fuzzy token matching. ACM Transactions on Database Systems, 2014, 39(1): 7

    Article  MathSciNet  MATH  Google Scholar 

  25. Nandi A, Jagadish H V. Effective phrase prediction. In: Proceedings of the 33rd International Conference on Very Large Databases, 2007, 219–230

    Google Scholar 

  26. Ji S, Li G, Li C, Feng J. Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference onWorld Wide Web. 2009, 371–380

    Chapter  Google Scholar 

  27. Chaudhuri S, Kaushik R. Extending autocompletion to tolerate errors. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 707–718

    Chapter  Google Scholar 

  28. Zheng Y, Bao Z, Shou L, Tung A K. MESA: a map service to support fuzzy type–ahead search over geo–textual data. Proceedings of the VLDB Endowment, 2014, 7(13): 1545–1548

    Article  Google Scholar 

  29. Li G, Ji S, Li C, Feng J. Efficient type–ahead search on relational data: a TASTIER approach. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 695–706

    Chapter  Google Scholar 

  30. Kavila S D, Ravva R, Bandaru R. Fuzzy type–ahead keyword search in RDF data. In: Proceedings of International Conference on Frontiers of Intelligent Computing: Theory and Applications. 2013, 67–73

    Google Scholar 

  31. Chandel A, Nagesh P C, Sarawagi S. Efficient batch top–k search for dictionary–based entity recognition. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 28

    Google Scholar 

  32. Cowan B, Zethelius S, Luk B, Baras T, Ukarde P, Zhang D. Named entity recognition in travel–related search queries. In: Proceedings of Association for the Advancement of Artificial Intelligence Conference, 2015, 3935–3941

    Google Scholar 

  33. Tang Z, Jiang L, Yang L, Li K, Li K. CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Cluster Computing, 2015, 18(2): 493–505

    Article  Google Scholar 

  34. Lu W, Fung G P C, Du X, Zhou X, Chen L, Deng K. Approximate entity extraction in temporal databases. World Wide Web, 2011, 14(2): 157–186

    Article  Google Scholar 

  35. Wang W, Xiao C, Lin X, Zhang C. Efficient approximate entity extraction with edit distance constraints. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 759–770

    Chapter  Google Scholar 

  36. Deng D, Li G, Feng J. An efficient trie–based method for approximate entity extraction with edit–distance constraints. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 762–773

    Google Scholar 

  37. Nakajima D, Mitsui Y, Samejima M, Akiyoshi M. An information extraction method from different structural web sites by word distances between a user instantiated label and similar entity. In: Proceedings of International Conference on Data Mining Workshops, 2011, 1177–1182

    Google Scholar 

  38. Deng D, Li G, Feng J, Duan Y, Gong Z. A unified framework for approximate dictionary–based entity extraction. The International Journal on Very Large Data Bases, 2015, 24(1): 143–167

    Article  Google Scholar 

  39. Kim Y, Shim K. Efficient top–k algorithms for approximate substring matching. In: Proceedings of ACMSIGMOD International Conference on Management of Data. 2013, 385–396

    Google Scholar 

  40. Tang N, Sidirourgos L, Boncz P A. Space–economical partial gram indices for exact substring matching. In: Proceedings of the 18th ACM International Conference on Information and Knowledge Management. 2009, 285–294

    Google Scholar 

  41. Ge T, Li Z. Approximate substring matching over uncertain strings. Proceedings of the VLDB Endowment, 2011, 4(11): 772–782

    Google Scholar 

  42. Warren R H, Tompa F W. Multi–column substring matching for database schema translation. In: Proceedings of the 32nd International Conference on Very Large Databases. 2006, 331–342

    Google Scholar 

  43. Jokinen P, Ukkonen E. Two algorithms for approximate string matching in static texts. In: Proceedings of the 16th International Symposium on Mathematical Foundations of Computer Science. 1991, 240–248

    Google Scholar 

  44. Li C, Wang B, Yang X. VGRAM: improving performance of approximate queries on string collections using variable–length grams. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 303–314

    Google Scholar 

  45. Yang X, Wang B, Li C. Cost–based variable–length–gram selection for string collections to support approximate queries efficiently. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2008, 353–364

    Google Scholar 

  46. Wang J, Li G, Deng D, Zhang Y, Feng J. Two birds with one stone: an efficient hierarchical framework for top–k and threshold–based string similarity search. In: Proceedings of International Conference on Data Engineering. 2015, 519–530

    Google Scholar 

  47. Deng D, Li G, Feng J, Li W S. Top–k string similarity search with edit distance constraints. In: Proceedings of the 29th IEEE International Conference on Data Engineering. 2013, 925–936

    Google Scholar 

  48. Wang X, Ding X, Tung A K H., Zhang Z. Efficient and effective KNN sequence search with approximate n–grams. Proceedings of the VLDB Endowment, 2013, 7(1): 1–12

    Article  Google Scholar 

  49. Fagin R, Lotem A, Naor M. Optimal aggregation algorithms for middleware. Journal of Computer and System Sciences, 2003, 66(4): 614–656

    Article  MathSciNet  MATH  Google Scholar 

  50. Li C, Lu J, Lu Y. Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th IEEE International Conference on Data Engineering. 2008, 257–266

    Google Scholar 

  51. Siragusa E, Weese D, Reinert K. Scalable string similarity search/join with approximate seeds and multiple backtracking. In: Proceedings of EDBT/ICDT Joint Conference. 2013, 370–374

    Google Scholar 

  52. Liu X, Li G, Feng J, Zhou L. Effective indices for efficient approximate string search and similarity join. In: Proceedings of the 9th IEEE International Conference on Web–Age Information Management. 2008, 127–134

    Google Scholar 

  53. Cui J, Meng D, Chen Z. Leveraging deletion neighborhoods and trie for efficient string similarity search and join. Lecture Notes in Computer Science, 2014, 8870: 1–13

    Article  Google Scholar 

  54. Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In: Proceedings of International Conference on Very Large Data Bases. 2001, 491–500

    Google Scholar 

  55. Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 5

    Google Scholar 

  56. Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient exact edit similarity query processing with the asymmetric signature scheme. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2011, 1033–1044

    Google Scholar 

  57. Rong C, Lu W, Wang X, Du X, Chen Y, Tung A K H. Efficient and scalable processing of string similarity join. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10): 2217–2230

    Article  Google Scholar 

  58. Xiao C, Wang W, Lin X. Ed–Join: an efficient algorithm for similarity joins with edit distance constraints. Proceedings of the VLDB Endowment, 2008, 1(1): 933–944

    Article  Google Scholar 

  59. Xiao C, Wang W, Lin X, Yu J X. Efficient similarity joins for near duplicate detection. In: Proceedings of International World Wide Web Conference. 2008, 131–140

    Google Scholar 

  60. Xiao C, Wang W, Lin X, Yu J X, Wang G. Efficient similarity joins for near–duplicate detection. ACM Transactions on Database Systems, 2011, 36(3): 15

    Article  Google Scholar 

  61. Wang W, Qin J, Xiao C, Lin X, Shen H T. VChunkJoin: an efficient algorithm for edit similarity joins. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(8): 1916–1929

    Article  Google Scholar 

  62. Qin J, Wang W, Xiao C, Lu Y, Lin X, Wang H. Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Transactions on Database Systems, 2013, 38(3): 16

    Article  MathSciNet  MATH  Google Scholar 

  63. Wang J, Li G, Feng J. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2012, 85–96

    Google Scholar 

  64. Deng D, Li G, Feng J. A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 673–684

    Google Scholar 

  65. Li G, Deng D, Wang J, Feng J. PASS–JOIN: a partition–based method for similarity joins. Proceedings of the VLDB Endowment, 2011, 5(3): 253–264

    Article  Google Scholar 

  66. Li G, Deng D, Feng J. A partition–based method for string similarity joins with edit–distance constraints. ACM Transactions on Database Systems, 2013, 38(2): 9

    Article  MathSciNet  MATH  Google Scholar 

  67. Li G, He J, Deng D, Li J. Efficient similarity join and search on multiattribute data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2015, 1137–1151

    Google Scholar 

  68. Ciaccia P, Patella M, Zezula P. M–tree: an efficient access method for similarity search in metric spaces. In: Proceedings of International Conference on Very Large Databases. 1997, 426–435

    Google Scholar 

  69. Aßfalg J, Borgwardt K M, Kriegel H P. 3D String: a feature string kernel for 3D object classification on voxelized data. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. 2006, 198–207

    Google Scholar 

  70. Bartolini I, Ciaccia P, Patella M. String matching with metric trees using an approximate distance. In: Proceedings of 9th International Symposium on String Processing and Information Retrieval. 2002, 271–283

    Chapter  Google Scholar 

  71. Wang J, Feng J, Li G. Trie–join: efficient trie–based string similarity joins with edit–distance constraints. Proceedings of the VLDB Endowment, 2013, (1–2): 1219–1230

    Google Scholar 

  72. Feng J, Wang J, Li G. Trie–join: a trie–based method for efficient string similarity joins. The International Journal on Very Large Data Bases, 2012, 21(4): 437–461

    Article  Google Scholar 

  73. Arasu A, Ganti V, Kaushik R. Efficient exact set–similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases. 2006, 918–929

    Google Scholar 

  74. Xiao C, Wang W, Lin X, Shang H. Top–k set similarity joins. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 916–927

    Google Scholar 

  75. Zhang Z, Hadjieleftheriou M, Ooi B C, Srivastava D. Bed–tree: an allpurpose index structure for string similarity search based on edit distance. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 915–926

    Google Scholar 

  76. Deng D, Li G, Hao S, Wang J, Feng J. Massjoin: a MapReduce–based method for scalable string similarity joins. In: Proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 340–351

    Google Scholar 

  77. Afrati F N, Sarma A D, Menestrina D, Parameswaran A G, Ullman J D. Fuzzy joins using MapReduce. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 498–509

    Google Scholar 

  78. Vernica R, CareyMJ, Li C. Efficient parallel set–similarity joins using MapReduce. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 495–506

    Google Scholar 

  79. Metwally A, Faloutsos C. V–SMART–join: a scalable MapReduce framework for all–pair similarity joins of multisets and vectors. Proceedings of the VLDB Endowment, 2012, 5(8): 704–715

    Article  Google Scholar 

  80. Deng D, Jiang Y, Li G, Li J, Yu C. Scalable column concept determination for web tables using large knowledge bases. Proceedings of the VLDB Endowment, 2013, 6(13): 1606–1617

    Article  Google Scholar 

  81. Li G, Wang J, Li C, Feng J. Supporting efficient top–k queries in typeahead search. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 355–364

    Google Scholar 

  82. Li G, Ji S, Li C, Feng J. Efficient fuzzy full–text type–ahead search. The International Journal on Very Large Data Bases, 2011, 20(4): 617–640

    Article  Google Scholar 

  83. Xiao C, Qin J, Wang W, Ishikawa Y, Tsuda K, Sadakane K. Efficient error–tolerant query autocompletion. Proceedings of the VLDB Endowment, 2013, 6(6): 373–384

    Article  Google Scholar 

  84. Li G, Feng J, Li C. Supporting search–as–you–type using SQL in databases. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(2): 461–475

    Article  Google Scholar 

  85. Li G, Deng D, Feng J. Faerie: efficient filtering algorithms for approximate dictionary–based entity extraction. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2011, 529–540

    Google Scholar 

  86. Li G, Hu J, Feng J, Tan K. Effective location identification from microblogs. In: Proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 880–891

    Google Scholar 

  87. Ukkonen E. Approximate string matching with q–grams and maximal matches. Theoretical Computer Science, 1992, 92(1): 191–211

    Article  MathSciNet  MATH  Google Scholar 

  88. Navarro G, Baeza–Yates R A, Sutinen E, Tarhio J. Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin, 2001, 24(4): 19–27

    Google Scholar 

  89. Jiang Y, Li G, Feng J, Li W. String similarity joins: an experimental evaluation. Proceedings of the VLDB Endowment, 2014, 7(8): 625–636

    Article  Google Scholar 

  90. Jiang Y, Deng D, Wang J, Li G, Feng J. Efficient parallel partitionbased algorithms for similarity search and join with edit distance constraints. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops. 2013, 341–348

    Chapter  Google Scholar 

  91. Wandelt S, Deng D, Gerdjikov S, Mishra S, Mitankin P, Patil M, Siragusa E, Tiskin A, Wang W, Wang J, Leser U. State–of–the–art in string similarity search and join. ACM SIGMOD Record, 2014, 43(1): 64–76

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoliang Li.

Additional information

Minghe Yu is currently a PhD candidate in the Department of Computer Science, Tsinghua University, China. She obtained her Bachelor degree in Department of Computer Science, Northeast University, China. Her research interests include data integration and spatio-textual data query.

Guoliang Li is an associate professor in the Department of Computer Science, Tsinghua University, China. He received his PhD in computer Science from Tsinghua University in 2009. He is an Awardee of the NSFC Excellent Young Scholars Program in 2014. His research interests include data cleaning and integration, spatial databases, and crowdsourcing.

Dong Deng is a PhD candidate in the Department of Computer Science, Tsinghua University, China. He obtained his Bachelor degree in Department of Computer Science, Beihang University, China. His research interests include data cleaning and integration, and database usability.

Jianhua Feng received his BS,MS and PhD degrees in computer science from Tsinghua University, China. He is currently working as a professor in Department of Computer Science, Tsinghua University. His main research interests include large-scale data management and analysis.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, M., Li, G., Deng, D. et al. String similarity search and join: a survey. Front. Comput. Sci. 10, 399–417 (2016). https://doi.org/10.1007/s11704-015-5900-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-015-5900-5

Keywords

Navigation