Abstract
Locality Sensitive Hashing is a known technique applied for finding similar texts and it has been applied to plagiarism detection, mirror pages identification or to identify the original source of a news article. In this paper we will show how can Locality Sensitive Hashing be applied to identify misspelled people names (name, middle name and last name) or near duplicates. In our case, and due to the short length of the texts, using two similarity functions (the Jaccard Similarity and the Full Damerau-Levenshtein Distance) for measuring the similarity of the names allowed us to obtain better results than using a single one. All the experimental work was made using the statistical software R and the libraries: textreuse and stringdist.
Similar content being viewed by others
References
Chollampatt S, Hwee Tou N (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the Association for the Advancement of Artificial Intelligence. New Orleans, Luisiana, USA
Csardi G, Nepusz T (2006) The IGraph software package for complex network research. Int J Complex Syst 1695:1–9
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on P-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp 253–262, New York, NY, USA
Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character N-Gram embeddings. Comput Linguistics Netherlands 7:39–52
Hopcroft J, Tarjan R (1973) Algorithm 447: efficient algorithms for graph manipulation. Commun ACM 16(6):372–378
Karp R (1972) Reducibility among combinatorial problems. Complex Comput Comput 40:85–103
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53
Lai KH, Topaz M, Goss F, Zhou L (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195
Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8)
Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using LSH and IMM. In: Proceedings of the Workshops of the EDBT/ICDT, vol 1133, pp 41–49
Morris MR, Fourney A, Ali A, Vonessen L (2018) Understanding the needs of searchers with dyslexia. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol 32. Communications of the ACM, New York, pp 1–35
Mullen L (2016) textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.4
Paulevé L, Jégou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn Lett 31 (11):1348–1358
Pérez A, Atutxa A, Casillas A, Gojenola K, Sellart A (2018) Inferred joint multigram models for medical term normalization according to ICD. Int J Med Inform 110:111–117
R Core Team (2015) R: a language and environment for statistical computing. r foundation for statistical computing, Vienna, Austria
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, New York
Satuluri V, Parthasarathy S (2012) Bayesian locality sensitive hashing for fast similarity search. In: Proceedings of the 38th International Conference on Very Large Data Bases, vol 5, pp 430–441, Istambul, Turkey
Shapiro D, Japkowicz N, Lemay M, Bolic M (2018) Fuzzy string matching with a deep neural network. Appl Artif Intell 32(1):1–12
Subhashree VK, Tharini C (2017) An energy efficient routing and fault tolerant data aggregation (EERFTDA) algorithm for wireless sensor networks. J High Speed Netw 23:15–32
Tong Q, Li X, Yuan B (2017) A highly scalable clustering scheme using boundary information. Pattern Recogn Lett 89(1):1–7
van der Loo M (2014) The stringdist package for approximate string matching. The R J 6:111–122
Wickham H, Francois R (2016) dplyr: A Grammar of Data Manipulation. R package version 0.5.0
Yuhua J, Liang B, Peng W, Jinlin G, Yuxiang X, Tianyuan Y (2017) Utilizing locality-sensitive hash learning for cross-media retrieval. In: International conference on multimedia modeling, pp 550–561, Reykjavik, Iceland
Acknowledgements
This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks to RAMSES project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700326. Website: http://ramses2020.eu.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: R Source Code
Appendix: R Source Code
In this appendix the R snippet, used to calculate the buckets of candidates (using Locality Sensitive Hashing) the Jaccard distance and the Damerau-Levenshtein distance, can be found.
The R snippet requires the following libraries: R Base ([16]), Dplyr ([23]), Igraph ([2]), Stringdist ([22]) and Textreuse ([13]).
Rights and permissions
About this article
Cite this article
Turrado García, F., García Villalba, L.J., Sandoval Orozco, A.L. et al. Locating similar names through locality sensitive hashing and graph theory. Multimed Tools Appl 78, 29853–29866 (2019). https://doi.org/10.1007/s11042-018-6375-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6375-9