Skip to main content
Log in

ScaDS Research on Scalable Privacy-preserving Record Linkage

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

Privacy-preserving record linkage (PPRL) supports the matching and integration of person-related data, e.g., on patients or customers without compromising privacy. It is based on the encoding of sensitive attribute values needed for matching and often involves trusted parties for linkage. We report on recent research results from the Big Data center ScaDS Dresden/Leipzig to improve the efficiency, scalability and quality of PPRL, and to apply PPRL in the medical domain. In particular, we present the use of pivot-based filtering techniques and LSH (locality-sensitive hashing)-based blocking to reduce the number of comparisons. Furthermore, we report on parallel linkage implementations based on Apache Flink supporting scalability to millions of records.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Bachteler T, Reiher J, Schnell R (2013) Similarity filtering with multibit trees for record linkage. GRLC, Working Paper WP-GRLC-2013-02

    Google Scholar 

  2. Bloom B (1970) Space/time trade-offs in hash coding with allowable errors. CACM 13(7):422–426. https://doi.org/10.1145/362686.362692

    Article  MATH  Google Scholar 

  3. Brown AP, Borgs C, Randall SM, Schnell R (2017) Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets. BMC Med Inform Decis Mak 17(1):83. https://doi.org/10.1186/s12911-017-0478-5

    Article  Google Scholar 

  4. Carbone P et al (2015) Apache Flink: Stream and batch processing in a single engine. IEEE TCDE 36(4):28–38

    Google Scholar 

  5. Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  6. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555. https://doi.org/10.1109/TKDE.2011.127

    Article  Google Scholar 

  7. Christen P, Vatsalan D (2013) Flexible and extensible generation and corruption of personal data. In: ACM CIKM, pp 1165–1168 https://doi.org/10.1145/2505515.2507815

    Google Scholar 

  8. Clark DE (2004) Practical introduction to record linkage for injury research. Inj Prev 10(3):186–191. https://doi.org/10.1136/ip.2003.004580

    Article  Google Scholar 

  9. Durham EA (2012) A framework for accurate, efficient private record linkage. Faculty of the Graduate School of Vanderbilt University, Nashville, TN, (Ph.D. thesis)

    Google Scholar 

  10. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16. https://doi.org/10.1109/TKDE.2007.250581

    Article  Google Scholar 

  11. Franke M, Sehili Z, Gladbach M, Rahm E (2018) Post-processing methods for high quality privacy-preserving record linkage. In: Data privacy management, Cryptocurrencies and Blockchain technology. Springer, Berlin, Heidelberg, pp 263–278 https://doi.org/10.1007/978-3-030-00305-0_19

    Chapter  Google Scholar 

  12. Franke M, Sehili Z, Rahm E (2018) Parallel privacy preserving record linkage using LSH-based blocking. In: IoTBDS, pp 195–203 https://doi.org/10.5220/0006682701950203

    Google Scholar 

  13. Gionis A, Indyk P, Motwani R et al (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th VLDB Conference, vol 99, pp 518–529

    Google Scholar 

  14. Gladbach M, Sehili Z, Kudraß T, Christen P, Rahm E (2018) Distributed privacy-preserving record linkage using pivot-based filter techniques. In: ICDE-W, pp 33–38 https://doi.org/10.1109/ICDEW.2018.00013

    Google Scholar 

  15. Hernández MA, Stolfo SJ (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37. https://doi.org/10.1023/A:1009761603038

    Article  Google Scholar 

  16. Herzog TN, Scheuren FJ, Winkler WE (2007) Data quality and record linkage techniques, 1st edn. Springer, Berlin, Heidelberg https://doi.org/10.1007/0-387-69505-2

    MATH  Google Scholar 

  17. Jiang Y, Li G, Feng J, Li WS (2014) String similarity joins: an experimental evaluation. Proc VLDB Endow 7(8):625–636. https://doi.org/10.14778/2732296.2732299

    Article  Google Scholar 

  18. Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. DKE 69(2):197–210. https://doi.org/10.1016/j.datak.2009.10.003

    Article  Google Scholar 

  19. Kuehni CE, Rueegg CS, Michel G, Rebholz CE, Strippoli MPF, Niggli FK, Egger M, von der Weid NX (2012) Cohort profile: the Swiss childhood cancer survivor study. Int J Epidemiol 41(6):1553–1564. https://doi.org/10.1093/ije/dyr142

    Article  Google Scholar 

  20. Lablans M, Borg A, Ückert F (2015) A RESTful interface to pseudonymization services in modern web applications. BMC Med Inform Decis Mak. https://doi.org/10.1186/s12911-014-0123-5

    Google Scholar 

  21. Malin BA, Emam KE, O’Keefe CM (2013) Biomedical data privacy: problems, perspectives, and recent advances. J Am Med Inform Assoc 20(1):2–6. https://doi.org/10.1136/amiajnl-2012-001509

    Article  Google Scholar 

  22. Mao R, Zhang P, Li X, Liu X, Lu M (2016) Pivot selection for metric-space indexing. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-016-0504-4

    Google Scholar 

  23. Odell M, Russell R (1918) The soundex coding system. US Patents 1261167

    Google Scholar 

  24. Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13

    Google Scholar 

  25. Schnell R, Bachteler T, Reiher J (2009) Privacy-preserving record linkage using Bloom filters. BMC Med Inform Decis Mak 9(1):41. https://doi.org/10.1186/1472-6947-9-41

    Article  Google Scholar 

  26. Schnell R, Bachteler T, Reiher J (2011) A novel error-tolerant anonymous linking code. GRLC, No. WP-GRLC-2011-02

    Google Scholar 

  27. Schnell R, Borgs C (2016) Randomized response and balanced bloom filters for privacy preserving record linkage. In: IEEE ICDMW, pp 218–224 https://doi.org/10.1109/ICDMW.2016.0038

    Google Scholar 

  28. Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with PPJoin. In: Proc. BTW

    Google Scholar 

  29. Sehili Z, Rahm E (2016) Speeding up privacy preserving record linkage for metric space similarity measures. Datenbank Spektrum 16(3):227–236. https://doi.org/10.1007/s13222-016-0222-9

    Article  Google Scholar 

  30. Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969. https://doi.org/10.1016/j.is.2012.11.005

    Article  Google Scholar 

  31. Vatsalan D, Sehili Z, Christen P, Rahm E (2017) Privacy-preserving record linkage for big data: current approaches and research challenges. Handb Big Data Technol. https://doi.org/10.1007/978-3-319-49340-4_25

    Google Scholar 

  32. Winter A, Stäubert S, Ammon D, Aiche S, Beyan O, Bischoff V, Daumke P, Decker S, Funkat G, Gewehr JE, de Greiff A, Haferkamp S, Hahn U, Henkel A, Kirsten T, Klöss T, Lippert J, Löbe M, Lowitsch V, Maassen O, Maschmann J, Meister S, Mikolajczyk R, Nüchter M, Pletz MW, Rahm E, Riedel M, Saleh K, Schuppert A, Smers S, Stollenwerk A, Uhlig S, Wendt T, Zenker S, Fleig W, Marx G, Scherag A, Löffler M (2018) Smart Medical Information Technology for Healthcare (SMITH). Methods Inf Med 57(1):e92–e105. https://doi.org/10.3414/ME18-02-0004

    Google Scholar 

  33. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, pp 131–140 https://doi.org/10.1145/1367497.1367516

    Chapter  Google Scholar 

  34. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Springer, Berlin, Heidelberg https://doi.org/10.1007/0-387-29151-2

    MATH  Google Scholar 

Download references

Acknowledgements

This work was partially funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziad Sehili.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Franke, M., Gladbach, M., Sehili, Z. et al. ScaDS Research on Scalable Privacy-preserving Record Linkage. Datenbank Spektrum 19, 31–40 (2019). https://doi.org/10.1007/s13222-019-00305-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-019-00305-y

Keywords

Navigation