Accurate privacy-preserving record linkage for databases with missing values

https://doi.org/10.1016/j.is.2021.101959Get rights and content

Highlights

  • Privacy-preserving record linkage (PPRL) aims to link sensitive data across databases.

  • The PPRL process can be challenged by missing data, leading to poor linkage quality.

  • We propose a novel Bloom filter based PPRL approach to improve linkage quality.

  • We develop two methods to group encoded records based on their missingness patterns.

  • The methods have trade-offs between the number of communication and comparison steps.

  • We compare our methods to existing PPRL approaches in terms of quality and privacy.

  • Experiments on real databases show our methods outperforms existing approaches.

Abstract

Privacy-preserving record linkage is the process of matching records that refer to the same entity across sensitive databases held by different organisations. This process is often challenging because no unique entity identifiers, such as social security numbers, are available in the databases to be linked. Therefore, quasi-identifying attributes such as names and addresses, are required to identify records that are similar and likely refer to the same entity. Such quasi-identifiers are however often not allowed to be shared between organisations due to privacy and confidentiality concerns. Besides variations and errors in the values used for linking, quasi-identifiers can have missing values. A popular approach to link sensitive data in a privacy-preserving way is to encode quasi-identifying values into Bloom filters, bit vectors that allow approximate similarities between values to be calculated. However, with existing Bloom filter encoding approaches missing values can lead to missed true matches because they affect the similarities calculated between Bloom filters. In this paper we propose a novel approach to consider missing values in privacy-preserving record linkage by adapting Bloom filter encoding based on the patterns of missingness identified in the databases to be linked. We build a lattice structure of missingness patterns, and then generate partitions of Bloom filters over this lattice. In each partition the non-missing encoded quasi-identifying attributes are assigned different weights during the Bloom filter generation process. This results in more accurate similarity calculation and better linkage quality. To improve the privacy of our approach, each partition is encoded independently which prevents both dictionary and frequency-based attacks. We evaluate our approach on large databases that contain different amounts and patterns of missing values, showing that it can substantially outperform both Bloom filter encoding that does not consider missing values, and an earlier Bloom filter based approach for linking sensitive databases that do contain missing values.

Introduction

Organisations in many domains increasingly construct large databases containing millions of records, where these records contain detailed information about people, such as customers, patients, tax payers, or travellers. Often such databases need to be shared and integrated to facilitate advanced analytics and processing [1]. Record linkage [2] is one major task that needs to be conducted when databases are to be integrated. Record linkage aims to identify and match records that refer to the same entities in different databases [1], [3]. In many application domains, unique entity identifiers (such as social security numbers) are not available in the databases to be linked, and therefore the linkage needs to be conducted based on the available quasi-identifying attributes. These commonly contain personal identifying information, such as names, addresses, dates of birth, and so on. Data quality aspects such as typographical errors, variations, and changes of values over time, are common in most quasi-identifying attributes used for linkage. As a result, approximate comparison functions are required when records are being compared [1].

One major data quality aspect that so far has only seen limited attention in record linkage is missing values [4], [5], [6], [7], [8]. Missing values can occur for a variety of reasons, and they can have different characteristics, as we discuss in Section 3. If records have missing values in the quasi-identifying attributes used for linkage, then the similarities calculated between records will likely be lower, potentially affecting the final linkage quality. If alternatively those records or quasi-identifying attributes that have missing values are not used for a linkage, then linkage quality will again likely suffer [7].

Besides linkage quality, privacy is increasingly also of concern when databases that contain personal information are to be linked across organisations [9]. Certain types of linkages might not be possible if the databases contain sensitive information about individuals (such as patients or tax payers) that cannot be shared with other organisations. Privacy-preserving record linkage (PPRL) is concerned with the development of techniques that facilitate the linkage of sensitive databases across organisations without the need of any sensitive plaintext data to be exchanged or shared [9].

PPRL techniques generally encode the values in the sensitive quasi-identifying attributes used for linkage before they are being sent to the organisation that conducts the linkage [9]. The outcome of such a PPRL protocol will only be the set of linked record identifiers, but no sensitive identifying information of these records. No database owner involved in a PPRL protocol should learn anything about the records of any of the other databases that are being linked, and the organisation conducting the linkage (or any external party, even a malicious adversary) must not be able to learn anything about the databases that are being linked [9].

One popular PPRL technique that is now being used in practical applications world-wide [10], [11], [12] is Bloom filter (BF) encoding [13]. As we describe in Section 3, BFs are bit vectors into which elements of a set are hashed into [14] to facilitate efficient similarity calculations between values encoded into BFs.

To the best of our knowledge, only one approach has so far investigated how to overcome the challenge of missing values in the PPRL process. The basic idea of the approach by Chi et al. [5] is to find the records that are closest (k-nearest neighbours) to a given record that has a missing value, and to use similarities calculated between the available attribute values to estimate the similarity a missing value would have had with the corresponding value in another record. This approach works best if there are groups of records that have similar attribute values, such as people in a household that have the same last name and address. While this approach has shown to be able to improve linkage quality in the presence of missing values in such situations, its drawbacks are that it is sensitive with regard to the blocking technique employed, and the use of attribute-level BFs makes it vulnerable to cryptanalysis attacks [15], [16], [17], as we show in Section 9.3.

Contributions: We propose a novel PPRL technique based on BF encoding for linking databases that contain missing values in the quasi-identifying attributes used for linking records. We first identify the patterns of occurrences of missing values across all records in a database, and use these patterns to generate a lattice structure that represents the different patterns of missingness, as illustrated in Fig. 1. Using a secure set intersection protocol [18] we then find the missingness patterns that are common across the databases to be linked, and we group lattice nodes to generate partitions of records that need to be compared.

We then present two methods in a three-party scenario to link these partitions in a privacy preserving way; (1) an iterative method that requires multiple communication steps but that needs less record pairs to be compared, and (2) a batch method that only requires one communication step but leads to a larger number of record pairs to be compared. To improve the privacy of our approach, we employ bit sampling from BFs based on matching weights calculated for the quasi-identifying attributes being compared [19], and different random permutations for the different partitions of BFs.

Using an extensive evaluation on large real-world databases we show that our approach can achieve high linkage quality even with significant amounts of missing values. Our approach substantially outperforms both traditional BF encoding [19] (that does not consider missing values) as well as the k-nearest neighbour missingness approach for PPRL proposed by Chi et al. [5].

Section snippets

Related work

A first computer based technique for record linkage was proposed by Newcombe et al. [20]. The approach compared records and classified record pairs into matches and non-matches using probabilities calculated based on the likelihood that two records refer to the same person. This idea was formalised by Fellegi and Sunter in 1969 [21], and their method to calculate match and non-match weights is still in use today [2].

Missing data has always been a challenge for linking records across databases

Background

In this section we first describe different types of missing data, and introduce the concept of a lattice structure which is a key component of our approach. We then describe BF encoding as it has been used in the context of PPRL.

For notation, throughout this paper we use italics type letters for integers, strings, and BFs; bold lowercase letters for lists and sets; and uppercase bold letters for lattices, and for lists and sets of lists and sets. Lists are shown with square and sets with curly

Protocol overview

Our proposed PPRL approach for missing data involves three parties, the two database owners (DOs) and a linkage unit (LU). Each DO has a database, DA and DB, respectively, which they aim to link without having to share the sensitive quasi-identifying attribute values in their own database with any other party. The LU is used to conduct the linkage where it only identifies the matching record pairs between the two databases.

As outlined in Fig. 4, the DOs first agree on the parameters to be used

Encoding and missing pattern processing steps

Each DO first performs the common steps outlined in Algorithm 1 and described in the following subsections.

Linkage steps

Based on the partitions generated, we now describe two different methods of how the DOs communicate with the LU, and how the LU compares the record pairs in the partitions it receives from the DOs.

Analysis

We now analyse our approach with regard to privacy, linkage quality, and scalability. We focus on the specific aspects of grouping records with different missingness patterns into partitions, encoding these partitions using different weights assigned to non-missing attributes, and applying partition specific permutations of bit positions. For general discussions about the privacy of BF encoding for PPRL we refer the reader to Durham et al. [19] and Christen et al. [15], [25].

Experimental evaluation

We now first describe the setup and then the data sets we used to evaluate our proposed approach to deal with missing values in the context of PPRL.

Results and discussion

We now present and discuss our results, starting with the linkage quality results we obtained with our approach compared to the baseline approaches, followed by a discussion of runtime results. We then assess the privacy of our approach compared to the baselines using a recently proposed attack method [15].

Conclusions and future work

We have presented a novel approach to consider missing values in privacy-preserving record linkage (PPRL) using record-level Bloom filter (BF) encoding. Our approach is based on a lattice of missingness patterns, from which partitions of BFs are generated. Each record can be inserted into multiple partitions using different grouping methods, which ensures records with different missingness patterns will be appropriately compared. We conducted an extensive analysis and experimental evaluation of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research has been partially funded by the Australian Research Council under project DP160101934. We like to thank A. Vidanage for his assistance.

References (40)

  • GoldsteinH. et al.

    Record linkage: a missing data problem

  • PitaR. et al.

    On the accuracy and scalability of probabilistic data linkage over the Brazilian 114 million cohort

    IEEE J. Biomed. Health Inform.

    (2018)
  • SchmidlinK. et al.

    Privacy preserving probabilistic record linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality

    BMC Med Res Methodol.

    (2015)
  • SchnellR. et al.

    Privacy-preserving record linkage using Bloom filters

    BMC Med. Inform. Decis. Mak.

    (2009)
  • BloomB.

    Space/time trade-offs in hash coding with allowable errors

    ACM Commun. ACM

    (1970)
  • ChristenP. et al.

    Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage

    IEEE Trans. Knowl. Data Eng.

    (2018)
  • M. Kuzu, M. Kantarcioglu, E. Durham, B. Malin, A Constraint satisfaction cryptanalysis of Bloom filters in private...
  • NiedermeyerF. et al.

    Cryptanalysis of basic Bloom filters used for privacy preserving record linkage

    J. Priv. Conf.

    (2014)
  • C. Dong, L. Chen, Z. Wen, When private set intersection meets big data: an efficient and scalable protocol, in: ACM...
  • DurhamE.A. et al.

    Composite Bloom filters for secure record linkage

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • Cited by (5)

    View full text