Accurate privacy-preserving record linkage for databases with missing values
Introduction
Organisations in many domains increasingly construct large databases containing millions of records, where these records contain detailed information about people, such as customers, patients, tax payers, or travellers. Often such databases need to be shared and integrated to facilitate advanced analytics and processing [1]. Record linkage [2] is one major task that needs to be conducted when databases are to be integrated. Record linkage aims to identify and match records that refer to the same entities in different databases [1], [3]. In many application domains, unique entity identifiers (such as social security numbers) are not available in the databases to be linked, and therefore the linkage needs to be conducted based on the available quasi-identifying attributes. These commonly contain personal identifying information, such as names, addresses, dates of birth, and so on. Data quality aspects such as typographical errors, variations, and changes of values over time, are common in most quasi-identifying attributes used for linkage. As a result, approximate comparison functions are required when records are being compared [1].
One major data quality aspect that so far has only seen limited attention in record linkage is missing values [4], [5], [6], [7], [8]. Missing values can occur for a variety of reasons, and they can have different characteristics, as we discuss in Section 3. If records have missing values in the quasi-identifying attributes used for linkage, then the similarities calculated between records will likely be lower, potentially affecting the final linkage quality. If alternatively those records or quasi-identifying attributes that have missing values are not used for a linkage, then linkage quality will again likely suffer [7].
Besides linkage quality, privacy is increasingly also of concern when databases that contain personal information are to be linked across organisations [9]. Certain types of linkages might not be possible if the databases contain sensitive information about individuals (such as patients or tax payers) that cannot be shared with other organisations. Privacy-preserving record linkage (PPRL) is concerned with the development of techniques that facilitate the linkage of sensitive databases across organisations without the need of any sensitive plaintext data to be exchanged or shared [9].
PPRL techniques generally encode the values in the sensitive quasi-identifying attributes used for linkage before they are being sent to the organisation that conducts the linkage [9]. The outcome of such a PPRL protocol will only be the set of linked record identifiers, but no sensitive identifying information of these records. No database owner involved in a PPRL protocol should learn anything about the records of any of the other databases that are being linked, and the organisation conducting the linkage (or any external party, even a malicious adversary) must not be able to learn anything about the databases that are being linked [9].
One popular PPRL technique that is now being used in practical applications world-wide [10], [11], [12] is Bloom filter (BF) encoding [13]. As we describe in Section 3, BFs are bit vectors into which elements of a set are hashed into [14] to facilitate efficient similarity calculations between values encoded into BFs.
To the best of our knowledge, only one approach has so far investigated how to overcome the challenge of missing values in the PPRL process. The basic idea of the approach by Chi et al. [5] is to find the records that are closest (k-nearest neighbours) to a given record that has a missing value, and to use similarities calculated between the available attribute values to estimate the similarity a missing value would have had with the corresponding value in another record. This approach works best if there are groups of records that have similar attribute values, such as people in a household that have the same last name and address. While this approach has shown to be able to improve linkage quality in the presence of missing values in such situations, its drawbacks are that it is sensitive with regard to the blocking technique employed, and the use of attribute-level BFs makes it vulnerable to cryptanalysis attacks [15], [16], [17], as we show in Section 9.3.
Contributions: We propose a novel PPRL technique based on BF encoding for linking databases that contain missing values in the quasi-identifying attributes used for linking records. We first identify the patterns of occurrences of missing values across all records in a database, and use these patterns to generate a lattice structure that represents the different patterns of missingness, as illustrated in Fig. 1. Using a secure set intersection protocol [18] we then find the missingness patterns that are common across the databases to be linked, and we group lattice nodes to generate partitions of records that need to be compared.
We then present two methods in a three-party scenario to link these partitions in a privacy preserving way; (1) an iterative method that requires multiple communication steps but that needs less record pairs to be compared, and (2) a batch method that only requires one communication step but leads to a larger number of record pairs to be compared. To improve the privacy of our approach, we employ bit sampling from BFs based on matching weights calculated for the quasi-identifying attributes being compared [19], and different random permutations for the different partitions of BFs.
Using an extensive evaluation on large real-world databases we show that our approach can achieve high linkage quality even with significant amounts of missing values. Our approach substantially outperforms both traditional BF encoding [19] (that does not consider missing values) as well as the k-nearest neighbour missingness approach for PPRL proposed by Chi et al. [5].
Section snippets
Related work
A first computer based technique for record linkage was proposed by Newcombe et al. [20]. The approach compared records and classified record pairs into matches and non-matches using probabilities calculated based on the likelihood that two records refer to the same person. This idea was formalised by Fellegi and Sunter in 1969 [21], and their method to calculate match and non-match weights is still in use today [2].
Missing data has always been a challenge for linking records across databases
Background
In this section we first describe different types of missing data, and introduce the concept of a lattice structure which is a key component of our approach. We then describe BF encoding as it has been used in the context of PPRL.
For notation, throughout this paper we use italics type letters for integers, strings, and BFs; bold lowercase letters for lists and sets; and uppercase bold letters for lattices, and for lists and sets of lists and sets. Lists are shown with square and sets with curly
Protocol overview
Our proposed PPRL approach for missing data involves three parties, the two database owners (DOs) and a linkage unit (LU). Each DO has a database, and , respectively, which they aim to link without having to share the sensitive quasi-identifying attribute values in their own database with any other party. The LU is used to conduct the linkage where it only identifies the matching record pairs between the two databases.
As outlined in Fig. 4, the DOs first agree on the parameters to be used
Encoding and missing pattern processing steps
Each DO first performs the common steps outlined in Algorithm 1 and described in the following subsections.
Linkage steps
Based on the partitions generated, we now describe two different methods of how the DOs communicate with the LU, and how the LU compares the record pairs in the partitions it receives from the DOs.
Analysis
We now analyse our approach with regard to privacy, linkage quality, and scalability. We focus on the specific aspects of grouping records with different missingness patterns into partitions, encoding these partitions using different weights assigned to non-missing attributes, and applying partition specific permutations of bit positions. For general discussions about the privacy of BF encoding for PPRL we refer the reader to Durham et al. [19] and Christen et al. [15], [25].
Experimental evaluation
We now first describe the setup and then the data sets we used to evaluate our proposed approach to deal with missing values in the context of PPRL.
Results and discussion
We now present and discuss our results, starting with the linkage quality results we obtained with our approach compared to the baseline approaches, followed by a discussion of runtime results. We then assess the privacy of our approach compared to the baselines using a recently proposed attack method [15].
Conclusions and future work
We have presented a novel approach to consider missing values in privacy-preserving record linkage (PPRL) using record-level Bloom filter (BF) encoding. Our approach is based on a lattice of missingness patterns, from which partitions of BFs are generated. Each record can be inserted into multiple partitions using different grouping methods, which ensures records with different missingness patterns will be appropriately compared. We conducted an extensive analysis and experimental evaluation of
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This research has been partially funded by the Australian Research Council under project DP160101934. We like to thank A. Vidanage for his assistance.
References (40)
- et al.
Privacy preserving record linkage in the presence of missing values
Elsevier Inf. Syst.
(2017) - et al.
A new computationally efficient algorithm for record linkage with field dependency and missing data imputation
Elsevier Int. J. Med. Inform.
(2018) - et al.
Improving record linkage performance in the presence of missing linkage data
Elsevier J. Biomed. Inform.
(2014) - et al.
A taxonomy of privacy-preserving record linkage techniques
Elsevier Inf. Syst.
(2013) - et al.
Privacy-preserving record linkage on large real world datasets
Elsevier J. Biomed. Inform.
(2014) - et al.
Formal anonymity models for efficient privacy-preserving joins
Elsevier Data Knowl. Eng.
(2009) Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection
(2012)- et al.
Data Quality and Record Linkage Techniques
(2007) - et al.
- I.C. Anindya, M. Kantarcioglu, B. Malin, Determining the impact of missing values on blocking in record linkage, in:...
Record linkage: a missing data problem
On the accuracy and scalability of probabilistic data linkage over the Brazilian 114 million cohort
IEEE J. Biomed. Health Inform.
Privacy preserving probabilistic record linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality
BMC Med Res Methodol.
Privacy-preserving record linkage using Bloom filters
BMC Med. Inform. Decis. Mak.
Space/time trade-offs in hash coding with allowable errors
ACM Commun. ACM
Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage
IEEE Trans. Knowl. Data Eng.
Cryptanalysis of basic Bloom filters used for privacy preserving record linkage
J. Priv. Conf.
Composite Bloom filters for secure record linkage
IEEE Trans. Knowl. Data Eng.
Cited by (5)
[Vision Paper] Privacy-Preserving Data Integration
2023, Proceedings - 2023 IEEE International Conference on Big Data, BigData 2023Bloom Filter: A Data Structure for Computer Networking, Big Data, Cloud Computing, Internet of Things, Bioinformatics and Beyond
2023, Bloom Filter: A Data Structure for Computer Networking, Big Data, Cloud Computing, Internet of Things, Bioinformatics and BeyondA TAXONOMY OF ATTACKS ON PRIVACY-PRESERVING RECORD LINKAGE
2022, Journal of Privacy and Confidentiality