Skip to main content

Blind Data Linkage Using n-gram Similarity Comparisons

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Abstract

Integrating or linking data from different sources is an increasingly important task in the preprocessing stage of many data mining projects. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. If no common unique entity identifiers (keys) are available in all data sources, the linkage needs to be performed using the available identifying attributes, like names and addresses. Data confidentiality often limits or even prohibits successful data linkage, as either no consent can be gained (for example in biomedical studies) or the data holders are not willing to release their data for linkage by other parties. We present methods for confidential data linkage based on hash encoding, public key encryption and n-gram similarity comparison techniques, and show how blind data linkage can be performed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bellare, M., Canetti, R., Krawczyk, H.: Message authentication using hash functions – the HMAC construction. RSA Laboratories, CryptoBytes 2, 15 (1996)

    Google Scholar 

  2. Borst, F., Allaert, F.A., Quantin, C.: The Swiss Solution for Anonymous Chaining Patient Files. In: MEDINFO 2001 (2001)

    Google Scholar 

  3. Diffie, W., Hellman, M.E.: New directions in cryptography. IEEE Trans. Inform. Theory IT22(6), 644–654 (1976)

    Article  MathSciNet  Google Scholar 

  4. Fellegi, I., Sunter, A.: A Theory for Record Linkage. Journal of the American Statistical Society (1969)

    Google Scholar 

  5. Kelman, C.W., Bass, A.J., Holman, C.D.J.: Research use of linked health data – A best practice protocol. ANZ Journal of Public Health 26, 3 (2002)

    Google Scholar 

  6. Lait, A.J., Randell, B.: An Assessment of Name Matching Algorithms, Technical Report, Dept. of Computing Science, University of Newcastle upon Tyne, UK (1993)

    Google Scholar 

  7. Quantin, C., Bouzelat, H., Allaert, F.A.A., Benhamiche, A.M., Faivre, J., Dusserre, L.: How to ensure data quality of an epidemiological follow-up: Quality assessment of an anonymous record linkage procedure. Intl. Journal of Medical Informatics 49, 117–122 (1998)

    Article  Google Scholar 

  8. Schneider, B.: Applied Cryptography, 2nd edn. John Wiley & Sons, Chichester (1996)

    Google Scholar 

  9. Winkler, W.E.: The State of Record Linkage and Current Research Problems. RR99/03, US Bureau of the Census (1999)

    Google Scholar 

  10. Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi- Sunter model of record linkage. RR00/05, US Bureau of the Census (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Churches, T., Christen, P. (2004). Blind Data Linkage Using n-gram Similarity Comparisons. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24775-3_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22064-0

  • Online ISBN: 978-3-540-24775-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics