Locating similar names through locality sensitive hashing and graph theory

Turrado García, Fernando; García Villalba, Luis Javier; Sandoval Orozco, Ana Lucila; Aranda Ruiz, Francisco Damián; Aguirre Juárez, Andrés; Kim, Tai-Hoon

doi:10.1007/s11042-018-6375-9

Locating similar names through locality sensitive hashing and graph theory

Published: 31 July 2018

Volume 78, pages 29853–29866, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Fernando Turrado García¹,
Luis Javier García Villalba¹,
Ana Lucila Sandoval Orozco¹,
Francisco Damián Aranda Ruiz¹,
Andrés Aguirre Juárez¹ &
…
Tai-Hoon Kim²

383 Accesses
Explore all metrics

Abstract

Locality Sensitive Hashing is a known technique applied for finding similar texts and it has been applied to plagiarism detection, mirror pages identification or to identify the original source of a news article. In this paper we will show how can Locality Sensitive Hashing be applied to identify misspelled people names (name, middle name and last name) or near duplicates. In our case, and due to the short length of the texts, using two similarity functions (the Jaccard Similarity and the Full Damerau-Levenshtein Distance) for measuring the similarity of the names allowed us to obtain better results than using a single one. All the experimental work was made using the statistical software R and the libraries: textreuse and stringdist.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Control Variates for Similarity Search

Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data

On the Problem of $$p_1^{-1}$$ in Locality-Sensitive Hashing

References

Chollampatt S, Hwee Tou N (2018) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the Association for the Advancement of Artificial Intelligence. New Orleans, Luisiana, USA
Csardi G, Nepusz T (2006) The IGraph software package for complex network research. Int J Complex Syst 1695:1–9
Google Scholar
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176
Article Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on P-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp 253–262, New York, NY, USA
Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of english and dutch clinical free-text with word and character N-Gram embeddings. Comput Linguistics Netherlands 7:39–52
Google Scholar
Hopcroft J, Tarjan R (1973) Algorithm 447: efficient algorithms for graph manipulation. Commun ACM 16(6):372–378
Article Google Scholar
Karp R (1972) Reducibility among combinatorial problems. Complex Comput Comput 40:85–103
Article MathSciNet Google Scholar
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53
Article MATH Google Scholar
Lai KH, Topaz M, Goss F, Zhou L (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195
Article Google Scholar
Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Doklady 10(8)
Malhotra P, Agarwal P, Shroff G (2014) Graph-parallel entity resolution using LSH and IMM. In: Proceedings of the Workshops of the EDBT/ICDT, vol 1133, pp 41–49
Morris MR, Fourney A, Ali A, Vonessen L (2018) Understanding the needs of searchers with dyslexia. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, vol 32. Communications of the ACM, New York, pp 1–35
Mullen L (2016) textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.4
Paulevé L, Jégou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn Lett 31 (11):1348–1358
Article Google Scholar
Pérez A, Atutxa A, Casillas A, Gojenola K, Sellart A (2018) Inferred joint multigram models for medical term normalization according to ICD. Int J Med Inform 110:111–117
Article Google Scholar
R Core Team (2015) R: a language and environment for statistical computing. r foundation for statistical computing, Vienna, Austria
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, New York
Book Google Scholar
Satuluri V, Parthasarathy S (2012) Bayesian locality sensitive hashing for fast similarity search. In: Proceedings of the 38th International Conference on Very Large Data Bases, vol 5, pp 430–441, Istambul, Turkey
Article Google Scholar
Shapiro D, Japkowicz N, Lemay M, Bolic M (2018) Fuzzy string matching with a deep neural network. Appl Artif Intell 32(1):1–12
Article Google Scholar
Subhashree VK, Tharini C (2017) An energy efficient routing and fault tolerant data aggregation (EERFTDA) algorithm for wireless sensor networks. J High Speed Netw 23:15–32
Article Google Scholar
Tong Q, Li X, Yuan B (2017) A highly scalable clustering scheme using boundary information. Pattern Recogn Lett 89(1):1–7
Article Google Scholar
van der Loo M (2014) The stringdist package for approximate string matching. The R J 6:111–122
Article Google Scholar
Wickham H, Francois R (2016) dplyr: A Grammar of Data Manipulation. R package version 0.5.0
Yuhua J, Liang B, Peng W, Jinlin G, Yuxiang X, Tianyuan Y (2017) Utilizing locality-sensitive hash learning for cross-media retrieval. In: International conference on multimedia modeling, pp 550–561, Reykjavik, Iceland
Google Scholar

Download references

Acknowledgements

This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks to RAMSES project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700326. Website: http://ramses2020.eu.

Author information

Authors and Affiliations

Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Information Technology and Computer Science, Office 431, Universidad Complutense de Madrid (UCM), Calle Profesor José García Santesmases, 9, Ciudad Universitaria, 28040, Madrid, Spain
Fernando Turrado García, Luis Javier García Villalba, Ana Lucila Sandoval Orozco, Francisco Damián Aranda Ruiz & Andrés Aguirre Juárez
Department of Convergence Security, Sungshin Women’s University, 249-1 Dongseon-Dong 3-ga, Seoul, 136-742, Korea
Tai-Hoon Kim

Authors

Fernando Turrado García
View author publications
You can also search for this author in PubMed Google Scholar
Luis Javier García Villalba
View author publications
You can also search for this author in PubMed Google Scholar
Ana Lucila Sandoval Orozco
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Damián Aranda Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Andrés Aguirre Juárez
View author publications
You can also search for this author in PubMed Google Scholar
Tai-Hoon Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luis Javier García Villalba.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: R Source Code

In this appendix the R snippet, used to calculate the buckets of candidates (using Locality Sensitive Hashing) the Jaccard distance and the Damerau-Levenshtein distance, can be found.

The R snippet requires the following libraries: R Base ([16]), Dplyr ([23]), Igraph ([2]), Stringdist ([22]) and Textreuse ([13]).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Turrado García, F., García Villalba, L.J., Sandoval Orozco, A.L. et al. Locating similar names through locality sensitive hashing and graph theory. Multimed Tools Appl 78, 29853–29866 (2019). https://doi.org/10.1007/s11042-018-6375-9

Download citation

Received: 20 May 2018
Revised: 27 June 2018
Accepted: 03 July 2018
Published: 31 July 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11042-018-6375-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locating similar names through locality sensitive hashing and graph theory

Abstract

Access this article