Skip to main content
Log in

Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

  • Fachbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive attribute values such that the identification of persons is prevented but records can still be matched. In this paper we study how to improve the efficiency and scalability of PPRL by restricting the search space for matching encoded records. We focus on similarity measures for metric spaces and investigate the use of M‑trees as well as pivot-based solutions. Our evaluation shows that the new schemes outperform previous filter approaches by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Since the number of queries also grows linearly with the data volume, runtimes increase almost quadratically.

References

  1. Agrawal R, Kiernan J, Srikant R, Xu Y (2002) Hippocratic databases. In: Proc. VLDB conf, pp 143–154

    Google Scholar 

  2. Bachteler T, Reiher J, Schnell R (2013) Similarity Filtering with Multibit Trees for Record Linkage. Tech. Rep. WP-GRLC-2013-01. German Record Linkage Center

  3. Bozkaya T, Özsoyoglu ZM (1999) Indexing large metric spaces for similarity search queries. ACM Trans Database Syst 24(3):361–404

    Article  Google Scholar 

  4. Brin S (1995) Near neighbor search in large metric spaces. In: Proc. VLDB conf, pp 574–584

    Google Scholar 

  5. Christen P (2005) Probabilistic Data Generation for Deduplication and Data Linkage. In: Proc. 6th Int. Conf. Intelligent Data Engineering and Automated Learning, pp 109–116

    Google Scholar 

  6. Christen P (2012) Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer

  7. Ciaccia P, Patella M, Zezula P (1997) M‑tree: An efficient access method for similarity search in metric spaces. In: Proc. VLDB conf, pp 426–435

    Google Scholar 

  8. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate Record Detection: A Survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  9. Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey of recent developments. ACM Comput Surv 42(4):14

    Article  Google Scholar 

  10. Jiang Y, Li G, Feng J, Li WS (2014) String similarity joins: An experimental evaluation. PVLDB 7(8):625–636

    Google Scholar 

  11. Kirsch A, Mitzenmacher M (2006) Less Hashing, Same Performance: Building a Better Bloom Filter. In: Proc. ESA Symp, pp 456–467

    Google Scholar 

  12. Kolb L, Thor A, Rahm E (2012) Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12):1878–1881

    Google Scholar 

  13. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1) 484--493

    Google Scholar 

  14. Kristensen TG, Nielsen J, Pedersen CNS (2010) A tree-based method for the rapid screening of chemical fingerprints. Algorithms Mol Biol 5:9

    Article  Google Scholar 

  15. Ngomo ACN, Auer S (2011) Limes-a time-efficient approach for large-scale link discovery on the web of data. In: Proc. IJCAI

    Google Scholar 

  16. Niedermeyer F, Steinmetzer S, Kroll M, Schnell R (2014) Cryptanalysis of basic bloom filters used for privacy preserving record linkage. J Priv Confidentiality 6(2):59–79

    Google Scholar 

  17. Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: Proc.ACM SIGMOD conf, pp 653–664

    Google Scholar 

  18. Schnell R, Bachteler T, Reiher J (2011) A Novel Error-Tolerant Anonymous Linking Code. Tech. Rep. WP-GRLC-2011-02. German Record Linkage Center, Duisburg

    Google Scholar 

  19. Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with PPJoin. In: Proc. BTW, pp 85–104

    Google Scholar 

  20. Vaidya J, Zhu Y, Clifton CW (2006) Privacy Preserving Data Mining. Advances in Information Security, vol. 19. Springer

  21. Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969

    Article  Google Scholar 

  22. Xiao C, Wang W, Lin X, Yu JX (2008) Efficient Similarity Joins for Near Duplicate Detection. In: Proc. 17th Int. Conf. on World Wide Web, pp 131–140

    Google Scholar 

  23. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Springer

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziad Sehili.

Additional information

This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sehili, Z., Rahm, E. Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures. Datenbank Spektrum 16, 227–236 (2016). https://doi.org/10.1007/s13222-016-0222-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-016-0222-9

Keywords

Navigation