Synonyms
Definition
Record linkage refers to the task of extracting record information from various input data sources and combining them in such a way that each output record corresponds a distinct real-world entity.
Overview
Record linkage is part of the broader area of data integration and more specifically data cleaning. It is most commonly used as a means to identify duplicates in a dataset or multiple datasets. The biggest challenge when executing record linkage algorithms is the trade-off between quality and performance. That is, record linkage is often run on high- volume datasets for which even sophisticated algorithms will not be able to provide high- quality results in a suitable timeframe. Thus, techniques such as blocking or incremental computation are applied to improve performance at the cost of decreased result quality. Additional challenges in record linkage include various types of input sources that are not necessarily...
This is a preview of subscription content, log in via an institution.
References
Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 25th international conference on data engineering, ICDE, 29 Mar–2 Apr 2009, Shanghai, pp 952–963. https://doi.org/10.1109/ICDE.2009.43
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113
Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J Int J Very Large Data Bases 18(1):255–276
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480
Dong XL, Srivastava D (2015) Big data integration. Synth Lect Data Manag 7(1):1–198
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Gruenheid A, Dong XL, Srivastava D (2014) Incremental record linkage. Proc VLDB Endow 7(9):697–708
Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Disc 2(1):9–37
Jaro MA (1978) Unimatch: a record linkage system: users manual. Bureau of the Census, Washington DC
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Konda P, Das S, C PSG, Doan A, Ardalan A, Ballard JR, Li H, Panahi F, Zhang H, Naughton JF, Prasad S, Krishnan G, Deep R, Raghavendra V (2016) Magellan: toward building entity matching management systems. PVLDB 9(12):1197–1208. http://www.vldb.org/pvldb/vol9/p1197-pkonda.pdf
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics Doklady, vol 10, pp 707–710
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178
Monge AE, Elkan C et al (1996) The field matching problem: algorithms and applications. In: KDD, pp 267–270
Russell R (1922) Index. US Patent 1,435,663. https://www.google.com/patents/US1435663
Verroios V, Garcia-Molina H (2015) Entity resolution with crowd errors. In: 31st IEEE international conference on data engineering, ICDE 2015, Seoul, 13–17 Apr 2015, pp 219–230. https://doi.org/10.1109/ICDE.2015.7113286
Wang J, Kraska T, Franklin MJ, Feng J (2012) Crowder: crowdsourcing entity resolution. Proc VLDB Endow 5(11):1483–1494
Wang J, Li G, Kraska T, Franklin MJ, Feng J (2014) Leveraging transitive relations for crowdsourced joins. CoRR abs/1408.6916. http://arxiv.org/abs/1408.6916
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this entry
Cite this entry
Gruenheid, A. (2018). Record Linkage. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_19-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-63962-8_19-1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering