Abstract
The analysis of person-related data in Big Data applications faces the tradeoff of finding useful results while preserving a high degree of privacy. This is especially challenging when person-related data from multiple sources need to be integrated and analyzed. Privacy-preserving record linkage (PPRL) addresses this problem by encoding sensitive attribute values such that the identification of persons is prevented but records can still be matched. In this paper we study how to improve the efficiency and scalability of PPRL by restricting the search space for matching encoded records. We focus on similarity measures for metric spaces and investigate the use of M‑trees as well as pivot-based solutions. Our evaluation shows that the new schemes outperform previous filter approaches by an order of magnitude.
Similar content being viewed by others
Notes
Since the number of queries also grows linearly with the data volume, runtimes increase almost quadratically.
References
Agrawal R, Kiernan J, Srikant R, Xu Y (2002) Hippocratic databases. In: Proc. VLDB conf, pp 143–154
Bachteler T, Reiher J, Schnell R (2013) Similarity Filtering with Multibit Trees for Record Linkage. Tech. Rep. WP-GRLC-2013-01. German Record Linkage Center
Bozkaya T, Özsoyoglu ZM (1999) Indexing large metric spaces for similarity search queries. ACM Trans Database Syst 24(3):361–404
Brin S (1995) Near neighbor search in large metric spaces. In: Proc. VLDB conf, pp 574–584
Christen P (2005) Probabilistic Data Generation for Deduplication and Data Linkage. In: Proc. 6th Int. Conf. Intelligent Data Engineering and Automated Learning, pp 109–116
Christen P (2012) Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer
Ciaccia P, Patella M, Zezula P (1997) M‑tree: An efficient access method for similarity search in metric spaces. In: Proc. VLDB conf, pp 426–435
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate Record Detection: A Survey. IEEE Trans Knowl Data Eng 19(1):1–16
Fung B, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey of recent developments. ACM Comput Surv 42(4):14
Jiang Y, Li G, Feng J, Li WS (2014) String similarity joins: An experimental evaluation. PVLDB 7(8):625–636
Kirsch A, Mitzenmacher M (2006) Less Hashing, Same Performance: Building a Better Bloom Filter. In: Proc. ESA Symp, pp 456–467
Kolb L, Thor A, Rahm E (2012) Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12):1878–1881
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1) 484--493
Kristensen TG, Nielsen J, Pedersen CNS (2010) A tree-based method for the rapid screening of chemical fingerprints. Algorithms Mol Biol 5:9
Ngomo ACN, Auer S (2011) Limes-a time-efficient approach for large-scale link discovery on the web of data. In: Proc. IJCAI
Niedermeyer F, Steinmetzer S, Kroll M, Schnell R (2014) Cryptanalysis of basic bloom filters used for privacy preserving record linkage. J Priv Confidentiality 6(2):59–79
Scannapieco M, Figotin I, Bertino E, Elmagarmid AK (2007) Privacy preserving schema and data matching. In: Proc.ACM SIGMOD conf, pp 653–664
Schnell R, Bachteler T, Reiher J (2011) A Novel Error-Tolerant Anonymous Linking Code. Tech. Rep. WP-GRLC-2011-02. German Record Linkage Center, Duisburg
Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with PPJoin. In: Proc. BTW, pp 85–104
Vaidya J, Zhu Y, Clifton CW (2006) Privacy Preserving Data Mining. Advances in Information Security, vol. 19. Springer
Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient Similarity Joins for Near Duplicate Detection. In: Proc. 17th Int. Conf. on World Wide Web, pp 131–140
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach. Springer
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).
Rights and permissions
About this article
Cite this article
Sehili, Z., Rahm, E. Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures. Datenbank Spektrum 16, 227–236 (2016). https://doi.org/10.1007/s13222-016-0222-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-016-0222-9