Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

Sehili, Ziad; Rahm, Erhard

doi:10.1007/s13222-016-0222-9

Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

Fachbeitrag
Published: 01 June 2016

Volume 16, pages 227–236, (2016)
Cite this article

Datenbank-Spektrum Aims and scope Submit manuscript

Ziad Sehili¹ &
Erhard Rahm¹

235 Accesses
6 Citations
Explore all metrics

Abstract

The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive attribute values such that the identification of persons is prevented but records can still be matched. In this paper we study how to improve the efficiency and scalability of PPRL by restricting the search space for matching encoded records. We focus on similarity measures for metric spaces and investigate the use of M‑trees as well as pivot-based solutions. Our evaluation shows that the new schemes outperform previous filter approaches by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Post-processing Methods for High Quality Privacy-Preserving Record Linkage

An Overview of Big Data Issues in Privacy-Preserving Record Linkage

Secure and Accurate Two-Step Hash Encoding for Privacy-Preserving Record Linkage

Notes

Since the number of queries also grows linearly with the data volume, runtimes increase almost quadratically.

References

Agrawal R, Kiernan J, Srikant R, Xu Y (2002) Hippocratic databases. In: Proc. VLDB conf, pp 143–154
Google Scholar
Bachteler T, Reiher J, Schnell R (2013) Similarity Filtering with Multibit Trees for Record Linkage. Tech. Rep. WP-GRLC-2013-01. German Record Linkage Center
Bozkaya T, Özsoyoglu ZM (1999) Indexing large metric spaces for similarity search queries. ACM Trans Database Syst 24(3):361–404
Article Google Scholar
Brin S (1995) Near neighbor search in large metric spaces. In: Proc. VLDB conf, pp 574–584
Google Scholar
Christen P (2005) Probabilistic Data Generation for Deduplication and Data Linkage. In: Proc. 6th Int. Conf. Intelligent Data Engineering and Automated Learning, pp 109–116
Google Scholar
Christen P (2012) Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer
Ciaccia P, Patella M, Zezula P (1997) M‑tree: An efficient access method for similarity search in metric spaces. In: Proc. VLDB conf, pp 426–435
Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate Record Detection: A Survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey of recent developments. ACM Comput Surv 42(4):14
Article Google Scholar
Jiang Y, Li G, Feng J, Li WS (2014) String similarity joins: An experimental evaluation. PVLDB 7(8):625–636
Google Scholar
Kirsch A, Mitzenmacher M (2006) Less Hashing, Same Performance: Building a Better Bloom Filter. In: Proc. ESA Symp, pp 456–467
Google Scholar
Kolb L, Thor A, Rahm E (2012) Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12):1878–1881
Google Scholar
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1) 484--493
Google Scholar
Kristensen TG, Nielsen J, Pedersen CNS (2010) A tree-based method for the rapid screening of chemical fingerprints. Algorithms Mol Biol 5:9
Article Google Scholar
Ngomo ACN, Auer S (2011) Limes-a time-efficient approach for large-scale link discovery on the web of data. In: Proc. IJCAI
Google Scholar
Niedermeyer F, Steinmetzer S, Kroll M, Schnell R (2014) Cryptanalysis of basic bloom filters used for privacy preserving record linkage. J Priv Confidentiality 6(2):59–79
Google Scholar
Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: Proc.ACM SIGMOD conf, pp 653–664
Google Scholar
Schnell R, Bachteler T, Reiher J (2011) A Novel Error-Tolerant Anonymous Linking Code. Tech. Rep. WP-GRLC-2011-02. German Record Linkage Center, Duisburg
Google Scholar
Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with PPJoin. In: Proc. BTW, pp 85–104
Google Scholar
Vaidya J, Zhu Y, Clifton CW (2006) Privacy Preserving Data Mining. Advances in Information Security, vol. 19. Springer
Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969
Article Google Scholar
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient Similarity Joins for Near Duplicate Detection. In: Proc. 17th Int. Conf. on World Wide Web, pp 131–140
Google Scholar
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Springer

Download references

Author information

Authors and Affiliations

Institut für Informatik, Universität Leipzig, PF 100920, 04009, Leipzig, Germany
Ziad Sehili & Erhard Rahm

Authors

Ziad Sehili
View author publications
You can also search for this author in PubMed Google Scholar
Erhard Rahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziad Sehili.

Additional information

This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sehili, Z., Rahm, E. Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures. Datenbank Spektrum 16, 227–236 (2016). https://doi.org/10.1007/s13222-016-0222-9

Download citation

Received: 05 February 2016
Accepted: 28 April 2016
Published: 01 June 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s13222-016-0222-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

Abstract

Access this article

Similar content being viewed by others

Post-processing Methods for High Quality Privacy-Preserving Record Linkage

An Overview of Big Data Issues in Privacy-Preserving Record Linkage

Secure and Accurate Two-Step Hash Encoding for Privacy-Preserving Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures

Abstract

Access this article

Similar content being viewed by others

Post-processing Methods for High Quality Privacy-Preserving Record Linkage

An Overview of Big Data Issues in Privacy-Preserving Record Linkage

Secure and Accurate Two-Step Hash Encoding for Privacy-Preserving Record Linkage

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation