Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join

Cui, Jia; Meng, Dan; Chen, Zhong-Tao

doi:10.1007/978-3-319-12844-3_1

Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join

Jia Cui^22,24,
Dan Meng²³ &
Zhong-Tao Chen^22,24

Conference paper

1416 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8870))

Abstract

String similarity search and joins are primitive operations in database and information retrieval to address the poor data quality problem. Due to the high complexity of deletion neighborhoods, existing methods resort to hashing schemes to achieve reduction in space requirement of the index. However the introduced hash collisions need to be verified by the costly edit distance computation. In this paper, we focus on achieving a faster query speed with affordable memory consumptions. We propose a novel method that leverages the power of deletion neighborhoods and trie to answer the edit distance based string similarity query efficiently. We utilize the trie to share common prefixes of deletion neighborhoods and propose subtree merging optimization to reduce the index size. Then the index partition strategies are discussed and bit vector based verification method is proposed to speed up the query. The experimental results show that our method outperforms state-of-art methods on real dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chaudhuri, S., Ganti, V., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In: 22nd IEEE International Conference on Data Engineering, p. 5. IEEE Press, New York (2006)
Google Scholar
Chaudhuri, S., Kaushik, R.: Extending Auto-completion to Tolerate Errors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 707–718. ACM Press, New York (2009)
Chapter Google Scholar
Ukkonen, E.: Algorithm for Approximate String Matching. J. Information and Control 64(1-3), 100–118 (1985)
Article MATH MathSciNet Google Scholar
Wang, W., Xiao, C., Lin, X.M., Zhang, C.: Efficient Approximate Entity Extraction with Edit Distance Constraints. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 759–770. ACM Press, New York (2009)
Chapter Google Scholar
Deng, D., Li, G.L., Feng, J.H.: An Efficient Trie-based Method for Approximate Entity Extraction with Edit-distance Constraints. In: 28th IEEE International Conference on Data Engineering, pp. 762–773. IEEE Press, Washington (2012)
Google Scholar
Navarro, G.: A Guided Tour to Approximate String Matching. J. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Aoe, J.I., Morimoto, K., Sato, T.: An Efficient Implementation of TrieStructures. J. Software: Practice and Experience 22, 695–721 (1992)
Google Scholar
Xiao, C., Wang, W., Lin, X.M.: Ed-Join: An Efficient Algorithm For Similarity Joins with Edit Distance Constraints. In: 34th ACM International Conference on Very Large Data Bases, pp. 933–944. ACM Press, New York (2008)
Google Scholar
Koudas, N., Sarawagi, S., Srivastava, D.: Record Linkage: Similarity Measures and Algorithms. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 802–803. ACM Press, New York (2006)
Google Scholar
Li, C., Wang, B., Yang, X.: VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-length Grams. Journal Proc of VLDB Endowment, 303–314 (2007)
Google Scholar
Li, G.L., Deng, D., Wang, J.N., Feng, J.H.: Pass-join: A Partition-based Method for Similarity Joins. Journal Proc of VLDB Endowment 5(3), 253–264 (2011)
Article Google Scholar
Wang, W., Qin, J.B., Chuan, X.M., Shen, H.T.: VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. IEEE Transactionson Knowledge and Data Engineering 25(8), 1916–1929 (2012)
Article MATH Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N.: Approximate String Joins in a Database (almost) for Free. In: 34th International Conference on Very Large Data Bases, pp. 491–500. ACM Press, New York (2001)
Google Scholar
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: 16th International World Wide Web Conference, pp. 131–140. ACM Press, New York (2007)
Chapter Google Scholar
Wang, J.N., Feng, J.H., Li, G.L.: Trie-join: Efficient Trie-based String Similarity Joins with Edit-distance Constraints. Journal Proc of VLDB Endowment 3(1-2), 1219–1230 (2010)
Article Google Scholar
Karch, D., Luxen, D., Sanders, P.: Improved Fast Similarity Search in Dictionaries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 173–178. Springer, Heidelberg (2010)
Chapter Google Scholar
Bocek, T., Hunt, E., Stiller, B.: Fast Similarity Search in Large Dictionaries. Technique report, Zurich: University of Zurich (2007)
Google Scholar
Qin, J.B., Wang, W., Lu, Y.F., Xiao, C., Lin, X.M.: Efficient Exact Edit Similarity Query Processing with the Asymmetric Signature Scheme. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1033–1044. ACM Press, New York (2011)
Google Scholar
Mishra, S., Gandhi, T., Arora, A., Bhattachayrya, A.: Efficient Edit Distance based String Similarity Search Using Deletion Neighborhoods. In: Proceeding of the Joint EDBT/ICDT Workshops, pp. 375–383. ACM Press, New York (2013)
Google Scholar
Ukkonen, E.: Finding Approximate Patterns in Strings. Journal of Algorithms 1, 132–137 (1985)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jia Cui & Zhong-Tao Chen
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Dan Meng
University of Chinese Academy of Sciences, Beijing, China
Jia Cui & Zhong-Tao Chen

Authors

Jia Cui
View author publications
You can also search for this author in PubMed Google Scholar
Dan Meng
View author publications
You can also search for this author in PubMed Google Scholar
Zhong-Tao Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Visual Informatic, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
Azizah Jaafar
Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
Nazlena Mohamad Ali
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
Shahrul Azman Mohd Noah
Insight Centre for Data Analytics, Dublin City University, Glasnevin, 9, Dublin, Ireland
Alan F. Smeaton
Information Systems, Queensland University of Technology, 4001, Brisbane, QLD, Australia
Peter Bruza
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 40450, Shah Alam, Selangor, Malaysia
Zainab Abu Bakar & Nursuriati Jamil &
Cyber Security Center, Universiti Pertahanan Nasional Malaysia, Kem Sungai Besi, 57000, Kuala Lumpur, Malaysia
Tengku Mohd Tengku Sembok

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, J., Meng, D., Chen, ZT. (2014). Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join. In: Jaafar, A., et al. Information Retrieval Technology. AIRS 2014. Lecture Notes in Computer Science, vol 8870. Springer, Cham. https://doi.org/10.1007/978-3-319-12844-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-12844-3_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12843-6
Online ISBN: 978-3-319-12844-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics