Record Linkage

Gruenheid, Anja

doi:10.1007/978-3-319-63962-8_19-1

Record Linkage

Anja Gruenheid³

Living reference work entry
First Online: 05 February 2018

322 Accesses

Synonyms

Duplicate detection; Entity resolution

Definition

Record linkage refers to the task of extracting record information from various input data sources and combining them in such a way that each output record corresponds a distinct real-world entity.

Overview

Record linkage is part of the broader area of data integration and more specifically data cleaning. It is most commonly used as a means to identify duplicates in a dataset or multiple datasets. The biggest challenge when executing record linkage algorithms is the trade-off between quality and performance. That is, record linkage is often run on high- volume datasets for which even sophisticated algorithms will not be able to provide high- quality results in a suitable timeframe. Thus, techniques such as blocking or incremental computation are applied to improve performance at the cost of decreased result quality. Additional challenges in record linkage include various types of input sources that are not necessarily...

This is a preview of subscription content, log in via an institution.

References

Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 25th international conference on data engineering, ICDE, 29 Mar–2 Apr 2009, Shanghai, pp 952–963. https://doi.org/10.1109/ICDE.2009.43
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113
Article MathSciNet Google Scholar
Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J Int J Very Large Data Bases 18(1):255–276
Article Google Scholar
Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23
Article Google Scholar
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480
Google Scholar
Dong XL, Srivastava D (2015) Big data integration. Synth Lect Data Manag 7(1):1–198
Article Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Article Google Scholar
Gruenheid A, Dong XL, Srivastava D (2014) Incremental record linkage. Proc VLDB Endow 7(9):697–708
Article Google Scholar
Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293
Article Google Scholar
Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Disc 2(1):9–37
Google Scholar
Jaro MA (1978) Unimatch: a record linkage system: users manual. Bureau of the Census, Washington DC
Google Scholar
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Article Google Scholar
Konda P, Das S, C PSG, Doan A, Ardalan A, Ballard JR, Li H, Panahi F, Zhang H, Naughton JF, Prasad S, Krishnan G, Deep R, Raghavendra V (2016) Magellan: toward building entity matching management systems. PVLDB 9(12):1197–1208. http://www.vldb.org/pvldb/vol9/p1197-pkonda.pdf
Article Google Scholar
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics Doklady, vol 10, pp 707–710
Google Scholar
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178
Google Scholar
Monge AE, Elkan C et al (1996) The field matching problem: algorithms and applications. In: KDD, pp 267–270
Google Scholar
Russell R (1922) Index. US Patent 1,435,663. https://www.google.com/patents/US1435663
Google Scholar
Verroios V, Garcia-Molina H (2015) Entity resolution with crowd errors. In: 31st IEEE international conference on data engineering, ICDE 2015, Seoul, 13–17 Apr 2015, pp 219–230. https://doi.org/10.1109/ICDE.2015.7113286
Wang J, Kraska T, Franklin MJ, Feng J (2012) Crowder: crowdsourcing entity resolution. Proc VLDB Endow 5(11):1483–1494
Article Google Scholar
Wang J, Li G, Kraska T, Franklin MJ, Feng J (2014) Leveraging transitive relations for crowdsourced joins. CoRR abs/1408.6916. http://arxiv.org/abs/1408.6916

Download references

Author information

Authors and Affiliations

Google Inc., Madison, WI, USA
Anja Gruenheid

Authors

Anja Gruenheid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anja Gruenheid .

Editor information

Editors and Affiliations

School of Comp. Sci. and Engineering, University of New South Wales School of Comp. Sci. and Engineering, Eveleigh, New South Wales, Australia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Database Systems Group, Technische Universität Dresden, 01062, Dresden, Saxony, Deutschland
Maik Thiele

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Gruenheid, A. (2018). Record Linkage. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_19-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_19-1
Published: 05 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics