Skip to main content

Record Linkage

  • Living reference work entry
  • First Online:
  • 322 Accesses

Synonyms

Duplicate detection; Entity resolution

Definition

Record linkage refers to the task of extracting record information from various input data sources and combining them in such a way that each output record corresponds a distinct real-world entity.

Overview

Record linkage is part of the broader area of data integration and more specifically data cleaning. It is most commonly used as a means to identify duplicates in a dataset or multiple datasets. The biggest challenge when executing record linkage algorithms is the trade-off between quality and performance. That is, record linkage is often run on high- volume datasets for which even sophisticated algorithms will not be able to provide high- quality results in a suitable timeframe. Thus, techniques such as blocking or incremental computation are applied to improve performance at the cost of decreased result quality. Additional challenges in record linkage include various types of input sources that are not necessarily...

This is a preview of subscription content, log in via an institution.

References

  • Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 25th international conference on data engineering, ICDE, 29 Mar–2 Apr 2009, Shanghai, pp 952–963. https://doi.org/10.1109/ICDE.2009.43

  • Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113

    Article  MathSciNet  Google Scholar 

  • Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang SE, Widom J (2009) Swoosh: a generic approach to entity resolution. VLDB J Int J Very Large Data Bases 18(1):255–276

    Article  Google Scholar 

  • Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S (2003) Adaptive name matching in information integration. IEEE Intell Syst 18(5):16–23

    Article  Google Scholar 

  • Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480

    Google Scholar 

  • Dong XL, Srivastava D (2015) Big data integration. Synth Lect Data Manag 7(1):1–198

    Article  Google Scholar 

  • Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  Google Scholar 

  • Gruenheid A, Dong XL, Srivastava D (2014) Incremental record linkage. Proc VLDB Endow 7(9):697–708

    Article  Google Scholar 

  • Hassanzadeh O, Chiang F, Miller RJ, Lee HC (2009) Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1):1282–1293

    Article  Google Scholar 

  • Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Disc 2(1):9–37

    Google Scholar 

  • Jaro MA (1978) Unimatch: a record linkage system: users manual. Bureau of the Census, Washington DC

    Google Scholar 

  • Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420

    Article  Google Scholar 

  • Konda P, Das S, C PSG, Doan A, Ardalan A, Ballard JR, Li H, Panahi F, Zhang H, Naughton JF, Prasad S, Krishnan G, Deep R, Raghavendra V (2016) Magellan: toward building entity matching management systems. PVLDB 9(12):1197–1208. http://www.vldb.org/pvldb/vol9/p1197-pkonda.pdf

    Article  Google Scholar 

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics Doklady, vol 10, pp 707–710

    Google Scholar 

  • McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178

    Google Scholar 

  • Monge AE, Elkan C et al (1996) The field matching problem: algorithms and applications. In: KDD, pp 267–270

    Google Scholar 

  • Russell R (1922) Index. US Patent 1,435,663. https://www.google.com/patents/US1435663

    Google Scholar 

  • Verroios V, Garcia-Molina H (2015) Entity resolution with crowd errors. In: 31st IEEE international conference on data engineering, ICDE 2015, Seoul, 13–17 Apr 2015, pp 219–230. https://doi.org/10.1109/ICDE.2015.7113286

  • Wang J, Kraska T, Franklin MJ, Feng J (2012) Crowder: crowdsourcing entity resolution. Proc VLDB Endow 5(11):1483–1494

    Article  Google Scholar 

  • Wang J, Li G, Kraska T, Franklin MJ, Feng J (2014) Leveraging transitive relations for crowdsourced joins. CoRR abs/1408.6916. http://arxiv.org/abs/1408.6916

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anja Gruenheid .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Gruenheid, A. (2018). Record Linkage. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_19-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_19-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics