Abstract
String similarity search and joins are primitive operations in database and information retrieval to address the poor data quality problem. Due to the high complexity of deletion neighborhoods, existing methods resort to hashing schemes to achieve reduction in space requirement of the index. However the introduced hash collisions need to be verified by the costly edit distance computation. In this paper, we focus on achieving a faster query speed with affordable memory consumptions. We propose a novel method that leverages the power of deletion neighborhoods and trie to answer the edit distance based string similarity query efficiently. We utilize the trie to share common prefixes of deletion neighborhoods and propose subtree merging optimization to reduce the index size. Then the index partition strategies are discussed and bit vector based verification method is proposed to speed up the query. The experimental results show that our method outperforms state-of-art methods on real dataset.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: 22nd IEEE International Conference on Data Engineering, p. 5. IEEE Press, New York (2006)
Chaudhuri, S., Kaushik, R.: Extending Auto-completion to Tolerate Errors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 707–718. ACM Press, New York (2009)
Ukkonen, E.: Algorithm for Approximate String Matching. J. Information and Control 64(1-3), 100–118 (1985)
Wang, W., Xiao, C., Lin, X.M., Zhang, C.: Efficient Approximate Entity Extraction with Edit Distance Constraints. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 759–770. ACM Press, New York (2009)
Deng, D., Li, G.L., Feng, J.H.: An Efficient Trie-based Method for Approximate Entity Extraction with Edit-distance Constraints. In: 28th IEEE International Conference on Data Engineering, pp. 762–773. IEEE Press, Washington (2012)
Navarro, G.: A Guided Tour to Approximate String Matching. J. ACM Computing Surveys 33(1), 31–88 (2001)
Aoe, J.I., Morimoto, K., Sato, T.: An Efficient Implementation of TrieStructures. J. Software: Practice and Experience 22, 695–721 (1992)
Xiao, C., Wang, W., Lin, X.M.: Ed-Join: An Efficient Algorithm For Similarity Joins with Edit Distance Constraints. In: 34th ACM International Conference on Very Large Data Bases, pp. 933–944. ACM Press, New York (2008)
Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 802–803. ACM Press, New York (2006)
Li, C., Wang, B., Yang, X.: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams. Journal Proc of VLDB Endowment, 303–314 (2007)
Li, G.L., Deng, D., Wang, J.N., Feng, J.H.: Pass-join: A Partition-based Method for Similarity Joins. Journal Proc of VLDB Endowment 5(3), 253–264 (2011)
Wang, W., Qin, J.B., Chuan, X.M., Shen, H.T.: VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. IEEE Transactionson Knowledge and Data Engineering 25(8), 1916–1929 (2012)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N.: Approximate String Joins in a Database (almost) for Free. In: 34th International Conference on Very Large Data Bases, pp. 491–500. ACM Press, New York (2001)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: 16th International World Wide Web Conference, pp. 131–140. ACM Press, New York (2007)
Wang, J.N., Feng, J.H., Li, G.L.: Trie-join: Efficient Trie-based String Similarity Joins with Edit-distance Constraints. Journal Proc of VLDB Endowment 3(1-2), 1219–1230 (2010)
Karch, D., Luxen, D., Sanders, P.: Improved Fast Similarity Search in Dictionaries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 173–178. Springer, Heidelberg (2010)
Bocek, T., Hunt, E., Stiller, B.: Fast Similarity Search in Large Dictionaries. Technique report, Zurich: University of Zurich (2007)
Qin, J.B., Wang, W., Lu, Y.F., Xiao, C., Lin, X.M.: Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1033–1044. ACM Press, New York (2011)
Mishra, S., Gandhi, T., Arora, A., Bhattachayrya, A.: Efficient Edit Distance based String Similarity Search Using Deletion Neighborhoods. In: Proceeding of the Joint EDBT/ICDT Workshops, pp. 375–383. ACM Press, New York (2013)
Ukkonen, E.: Finding Approximate Patterns in Strings. Journal of Algorithms 1, 132–137 (1985)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cui, J., Meng, D., Chen, ZT. (2014). Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join. In: Jaafar, A., et al. Information Retrieval Technology. AIRS 2014. Lecture Notes in Computer Science, vol 8870. Springer, Cham. https://doi.org/10.1007/978-3-319-12844-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-12844-3_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12843-6
Online ISBN: 978-3-319-12844-3
eBook Packages: Computer ScienceComputer Science (R0)