Skip to main content

Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8870))

Abstract

String similarity search and joins are primitive operations in database and information retrieval to address the poor data quality problem. Due to the high complexity of deletion neighborhoods, existing methods resort to hashing schemes to achieve reduction in space requirement of the index. However the introduced hash collisions need to be verified by the costly edit distance computation. In this paper, we focus on achieving a faster query speed with affordable memory consumptions. We propose a novel method that leverages the power of deletion neighborhoods and trie to answer the edit distance based string similarity query efficiently. We utilize the trie to share common prefixes of deletion neighborhoods and propose subtree merging optimization to reduce the index size. Then the index partition strategies are discussed and bit vector based verification method is proposed to speed up the query. The experimental results show that our method outperforms state-of-art methods on real dataset.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: 22nd IEEE International Conference on Data Engineering, p. 5. IEEE Press, New York (2006)

    Google Scholar 

  2. Chaudhuri, S., Kaushik, R.: Extending Auto-completion to Tolerate Errors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 707–718. ACM Press, New York (2009)

    Chapter  Google Scholar 

  3. Ukkonen, E.: Algorithm for Approximate String Matching. J. Information and Control 64(1-3), 100–118 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  4. Wang, W., Xiao, C., Lin, X.M., Zhang, C.: Efficient Approximate Entity Extraction with Edit Distance Constraints. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 759–770. ACM Press, New York (2009)

    Chapter  Google Scholar 

  5. Deng, D., Li, G.L., Feng, J.H.: An Efficient Trie-based Method for Approximate Entity Extraction with Edit-distance Constraints. In: 28th IEEE International Conference on Data Engineering, pp. 762–773. IEEE Press, Washington (2012)

    Google Scholar 

  6. Navarro, G.: A Guided Tour to Approximate String Matching. J. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  7. Aoe, J.I., Morimoto, K., Sato, T.: An Efficient Implementation of TrieStructures. J. Software: Practice and Experience 22, 695–721 (1992)

    Google Scholar 

  8. Xiao, C., Wang, W., Lin, X.M.: Ed-Join: An Efficient Algorithm For Similarity Joins with Edit Distance Constraints. In: 34th ACM International Conference on Very Large Data Bases, pp. 933–944. ACM Press, New York (2008)

    Google Scholar 

  9. Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 802–803. ACM Press, New York (2006)

    Google Scholar 

  10. Li, C., Wang, B., Yang, X.: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams. Journal Proc of VLDB Endowment, 303–314 (2007)

    Google Scholar 

  11. Li, G.L., Deng, D., Wang, J.N., Feng, J.H.: Pass-join: A Partition-based Method for Similarity Joins. Journal Proc of VLDB Endowment 5(3), 253–264 (2011)

    Article  Google Scholar 

  12. Wang, W., Qin, J.B., Chuan, X.M., Shen, H.T.: VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. IEEE Transactionson Knowledge and Data Engineering 25(8), 1916–1929 (2012)

    Article  MATH  Google Scholar 

  13. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N.: Approximate String Joins in a Database (almost) for Free. In: 34th International Conference on Very Large Data Bases, pp. 491–500. ACM Press, New York (2001)

    Google Scholar 

  14. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: 16th International World Wide Web Conference, pp. 131–140. ACM Press, New York (2007)

    Chapter  Google Scholar 

  15. Wang, J.N., Feng, J.H., Li, G.L.: Trie-join: Efficient Trie-based String Similarity Joins with Edit-distance Constraints. Journal Proc of VLDB Endowment 3(1-2), 1219–1230 (2010)

    Article  Google Scholar 

  16. Karch, D., Luxen, D., Sanders, P.: Improved Fast Similarity Search in Dictionaries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 173–178. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Bocek, T., Hunt, E., Stiller, B.: Fast Similarity Search in Large Dictionaries. Technique report, Zurich: University of Zurich (2007)

    Google Scholar 

  18. Qin, J.B., Wang, W., Lu, Y.F., Xiao, C., Lin, X.M.: Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1033–1044. ACM Press, New York (2011)

    Google Scholar 

  19. Mishra, S., Gandhi, T., Arora, A., Bhattachayrya, A.: Efficient Edit Distance based String Similarity Search Using Deletion Neighborhoods. In: Proceeding of the Joint EDBT/ICDT Workshops, pp. 375–383. ACM Press, New York (2013)

    Google Scholar 

  20. Ukkonen, E.: Finding Approximate Patterns in Strings. Journal of Algorithms 1, 132–137 (1985)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Cui, J., Meng, D., Chen, ZT. (2014). Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join. In: Jaafar, A., et al. Information Retrieval Technology. AIRS 2014. Lecture Notes in Computer Science, vol 8870. Springer, Cham. https://doi.org/10.1007/978-3-319-12844-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12844-3_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12843-6

  • Online ISBN: 978-3-319-12844-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics