Blockchain-based Privacy-Preserving Record Linkage: enhancing data privacy in an untrusted environment

doi:10.1016/j.is.2021.101826

Information Systems

Volume 102, December 2021, 101826

https://doi.org/10.1016/j.is.2021.101826 Get rights and content

Highlights

•
Blockchain networks can be employed to audit Privacy-Preserving Record Linkage.
•
Bloom Filter is divided into multiples splits to reduce the amount of information available.
•
The auditability of the PPRL process reduces the trust needed by the PPRL parties.

Abstract

Privacy-Preserving Record Linkage (PPRL) intends to integrate private data from several data sources held by different parties. Due to recent laws and regulations (e.g, General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health-care, credit analysis, public policy evaluation, and national security. However, the majority of the PPRL approaches consider an unrealistic adversary model, particularly the Honest but Curious (HBC) model, which assumes that all PPRL parties will follow a pre-agreed data integration protocol, and will not try to break the confidentiality of the data handled during the process. The HBC model is hard to employ in real-world applications, mainly because of the need to trust other parties fully. To overcome the limitations associated with the majority of the adversary models considered by PPRL approaches, we propose a protocol that considers covert adversaries, i.e., adversaries that may deviate arbitrarily from the protocol specification in an attempt to cheat. In such protocol, however, the honest parties are able to detect this misbehavior with a high probability. To provide a proof-of-concept implementation of this protocol, we employ the Blockchain technology and propose an improvement in the most used anonymization technique for PPRL, the Bloom Filter. The evaluation carried out using several real-world data sources has demonstrated the effectiveness (linkage quality) obtained by our contributions, as well as the ability to detect the misbehavior of a malicious adversary during the PPRL execution.

Introduction

Governments and private organizations collect and process large data sources to extract knowledge and assist in decision-making, data mining, and data analytics tasks. These organizations usually need to integrate and link matching records (entities) that correspond to the same real-world entity from various sources (held by different parties) with the purpose of increasing quality or enriching information [1]. In this scenario, the Record Linkage (RL) task is commonly employed to identify matching entities across multiple data sources [2]. It is important to remark that the data sources used as input in RL are usually independently projected using different data models (with distinct representations for the same real-world concepts) and technologies (i.e., relational databases, graph-oriented databases, and so on).

In general, RL needs to be applied in scenarios that hold sensitive personal information about the entities to be linked. However, sharing or exchanging such values among different organizations is often prohibited due to privacy and confidentiality concerns [1], [3], [4]. Thus, in scenarios where the privacy of the entities matters, Privacy-Preserving Record Linkage is employed. PPRL is the task of performing RL over data sources that belong to different parties (suppliers) without revealing any information that could compromise the privacy of the parties’ data [1].

As examples of private data usage, we have banks that seek to reduce credit fraud, national security applications that aim to identify terrorists, and medical data that need to be integrated for a variety of purposes such as patient profiles and disease outbreak prediction, public health policy evaluation, new drug studies, to name a few [1], [5].

According to recent PPRL surveys [1], [3], [4], [6], most of the state-of-the-art PPRL techniques assume an Honest But Curious (HBC) security model during the matching process. This model considers that the parties will follow a pre-agreed protocol, but will attempt to learn all possible information from legitimately received messages [4]. In contrast to the HBC model, we have the malicious security model, which assumes that the parties will not follow the pre-agreed protocol by refusing to participate, performing arbitrary computation in the input data, or aborting the protocol at any time [3].

Since the HBC model requires the parties to fully trust each other, which is not realistic for real-world applications [1], [4], and the malicious model is computationally expensive, due to the anonymization techniques (i.e., homomorphic encryption) and communication cost [1]. The need for new security models that enable an auditability of the similarity computations (i.e., models that lie in between the HBC and the malicious models) is reported as an open problem [1], [3], [4], [6], [7].

In this context, this paper introduces a novel security protocol that enables the auditability of the computations, also named as covert adversaries [8], performed during a PPRL process. In other words, the proposed protocol allows the parties to detect, with high probability, the misbehave of a malicious party during the entities’ similarity computation. To implement the protocol, we employ a decentralized environment where untrusted (or semi-trusted) parties perform the computations required by the PPRL.

As the decentralized computing environment, we use (public and private) Blockchain networks in order to provide auditability to PPRL. The Blockchain technology aims to provide shared data access and computation for parties that do not trust each other [9]. The computations and the data sharing are possible due to an immutable cryptographically append-only log, which is replicated and managed in a decentralized environment composed by untrusted parties [10].

Blockchain platforms are being used by applications of different domains [9], [10], [11], [12], including government, healthcare, and IoT. In such applications, the Blockchain is treated as a shared database or data processing platform which is employed when the participants do not trust each other. For this reason, we use the Blockchain (Smart Contract) as a Semi-Trusted Third Party (STTP) in PPRL.

However, the Blockchain does not provide a mechanism to preserve the privacy of the entities during the PPRL process. Actually, the Blockchain reduces the privacy of the PPRL by replicating the entire data amongst the untrusted parties. To overcome this limitation, we also propose an improvement for the most prevalent anonymization technique used in PPRL applications: Bloom Filter [13]. Such improvement, named Splitting Bloom Filter (SBF), offers a lower risk of privacy leakage and, additionally, reduces the possibility of collusion between malicious parties.

Contributions. We summarize the major contributions of this work as follows:

•
We propose a novel privacy-preserving protocol that considers a covert security model, which lies between the HBC and Malicious models;
•
We present a Proof-of-Concept implementation of our privacy-preserving protocol using the Blockchain technology;
•
We propose an improvement to the Bloom Filter (SBF) that increases the privacy guarantees of the entity comparisons performed in the PPRL, by reducing the amount of information shared amongst the PPRL parties; and
•
We perform theoretical and empirical evaluations using real-world data (18 data sources from nine domains) to assess the efficacy, efficiency, and privacy capabilities of our contributions.

Outline. The remaining of this paper is organized as follows. In Section 2, we present the required concepts for a better understating of our work. In Section 3, we formalize the problem tackled in this paper. Related work is discussed in Section 4. In Sections 5, 6 Auditable blockchain-based PPRL, 7 Evaluation, we describe the SBF, the proposed Privacy-Preserving Protocol and its Proof-of-Concept implementation, respectively. In Section 8, we evaluate (theoretically and empirically) the efficacy, efficiency, and privacy capabilities of our contributions using real-world data sources. Finally, in Section 9, we conclude the paper with an outlook on future research directions.

Section snippets

Preliminaries and background

For a better understanding of our approach, this section provides an overview of the techniques and concepts commonly used in PPRL and Blockchain.

Problem formulation

With the basic concepts of PPRL, anonymization and Blockchain defined in Section 2, we can now formalize the problem tackled in this paper. To address the privacy-guarantees of the PPRL problem, we assume that multiple parties and an STTP engage in a PPRL process that considers an accountable security model (covert adversaries).

Problem Statement 1 Decentralized Execution of the Decision Model

Given multiple participants ( $p \geq 2$ ) with their respective anonymized data sources ( $D^{τ}$ ), and a Semi-Trusted Third-Party (STTP) implemented as a Blockchain Smart Contract

Splitting bloom filter

As presented in Section 2, Bloom Filters may fail to preserve the privacy of the data in two cases. The first one occurs when an adversary has a list of possible values assumed by the entities [7], [25], [26], [27]. The second case happens when an adversary is able to access several anonymized entities and execute a pattern mining attack [7], [25]. Both vulnerabilities may endanger the privacy of the entities employed in PPRL. Considering that the entities involved in PPRL belong to a unique

An auditable protocol for PPRL

In this section, we present a novel privacy-preserving linkage protocol, which considers covert adversaries to audit the comparison and classification steps of PPRL. This protocol uses SBF to reduce the amount of information shared during the PPRL execution. Moreover, this protocol uses the decentralized SBF characteristic (presented in Eq. (5)) to provide audibility in the similarity computations performed by the PPRL parties. The protocol also considers a semi-trusted third party (STTP),

Auditable blockchain-based PPRL

In this section, we describe the Auditable Blockchain-based Privacy-preserving Record Linkage (ABEL), a Proof-of-Concept implementation of the 3PAC protocol presented in Section 5.

Different from the traditional HBC protocols, which consider an external (neutral) party in which all participants trust [4], [14], [35], ABEL considers a semi-trusted third party (STTP), which parties do not fully trust. Because they do not rely on STTP, the parties need to audit STTP computations. In order to enable

Evaluation

The main goal of this section is to evaluate the efficacy (quality), efficiency, and the privacy capabilities introduced by our contributions, the SBF and ABEL approaches. To this end, we present an empirical and theoretical evaluation of them.

Related work

This research is closely related to PPRL and decentralized data management. In this sense, we can list the following works as related to the problem tackled in this work.

Vatsalan and Christen [15] proposed a protocol to identify matching sets of entities (records) held by multiple parties that have a similarity above a certain threshold under the semi-honest adversary model. The protocol divides and distributes the BF chunks amongst the parties. To calculate the entities’ overall similarity,

Conclusion and future work

In this paper, we have presented a PPRL protocol that considers a covert adversary model. This protocol considers a decentralized computing environment (Blockchain) and uses a novel implementation of the Bloom Filter (SBF), which reduces the information available to the PPRL parties. Specifically, SBF decreases the available information in case of collusion amongst malicious parties. The Blockchain technology was employed to implement a semi-trusted party and, consequently, provides

CRediT authorship contribution statement

Thiago Nóbrega: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Carlos Eduardo S. Pires: Methodology, Writing - original draft, Writing - review & editing. Dimas Cassimiro Nascimento: Methodology, Writing - original draft, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by the Brazilian National Council of Technological and Scientific Development (CNPq) .

References (49)

VatsalanD. et al.
A taxonomy of privacy-preserving record linkage techniques
Inf. Syst.
(2013)
DingX. et al.
Privacy preserving similarity joins using mapreduce
Inform. Sci.
(2019)
JiangW. et al.
Transforming semi-honest protocols to ensure accountability
Data Knowl. Eng.
(2008)
MiklauG. et al.
A formal analysis of information disclosure in data exchange
J. Comput. System Sci.
(2007)
ChristenP. et al.
A flexible data generator for privacy-preserving data mining and record linkage
(2012)
ChristenP. et al.
Challenges for privacy preservation in data integration
J. Data Inf. Qual.
(2014)
VatsalanD. et al.
An Overview of Big Data Issues in Privacy-Preserving Record Linkage, Vol. 2
(2019)
PitaR. et al.
A spark-based workflow for probabilistic record linkage of healthcare data
PapadakisG.A. et al.
A survey of blocking and filtering techniques for entity resolution
(2019)
VidanageA. et al.
Efficient pattern mining based cryptanalysis for privacy-preserving record linkage

AumannY. et al.

Security against covert adversaries: Efficient protocols for realistic adversaries

J. Cryptol.

(2010)

El-HindiM. et al.

Blockchaindb - towards a shared database on blockchains

NathanS. et al.

Blockchain meets database

Proc. VLDB Endow.

(2019)

DasuT. et al.

Unchain your blockchain

DinhT.T.A. et al.

Blockbench: A framework for analyzing private blockchains

SchnellR. et al.

Privacy-preserving record linkage using bloom filters

BMC Med. Inform. Decis. Mak.

(2009)

NóbregaT.P. et al.

Blind attribute pairing for privacy-preserving record linkage

VatsalanD. et al.

Multi-party privacy-preserving record linkage using bloom filters

(2016)

AraujoT.B. et al.

Spark-based streamlined metablocking

BatiniC. et al.

RanbadugeT. et al.

Privacy-preserving temporal record linkage

VatsalanD. et al.

Scalable multi-database privacy-preserving record linkage using counting bloom filters

(2017)

VatsalanD. et al.

Privacy-preserving record linkage

(2018)

VatsalanD. et al.

Privacy-preserving record linkage for big data : Current approaches and research challenges

Cited by (15)

Towards automatic Privacy-Preserving Record Linkage: A Transfer Learning based classification step
2023, Data and Knowledge Engineering
Privacy-Preserving Record Linkage (PPRL) intends to identify records that match the same real-world entities across disparate data sources while preserving the privacy of the individual entities. To identify matching records across different data sources and still preserve the privacy of the information, PPRL needs to consider several restrictions due to privacy limitations. For instance, PPRL is executed over anonymized (or encrypted) data to avoid re-identification. Moreover, the classification step of PPRL does not have access to labeled information (indicating if a pair of records is a match) and an oracle (specialist) to label a few instances. These limitations make it hard to employ automatic classification techniques. Most PPRL techniques use a simple threshold (defined by a specialist) to define whether a pair of records represent the same real-world entity or not. To overcome these problems, we present a Transfer Learning-based unsupervised classification step to PPRL, which leverages the information available in public (or synthetic) datasets to train accurate classifiers in a privacy-preserving context. We evaluate our approach using real-world and synthetic data, and the results demonstrate that our unsupervised classification step is able to overcome the most used classification strategies in PPRL.
Blockchain technology for cybersecurity: A text mining literature analysis
2022, International Journal of Information Management Data Insights
Blockchain, the technology infrastructure behind the famous cryptocurrency bitcoin, can take away the notion of trust from centralized organizations to a decentralized platform that is mathematically verifiable and cryptographically secure. It is gaining more significant momentum exponentially and disrupts the way businesses function beyond the digital currency aspects. This work presents a text mining literature analysis of research articles published in major digital libraries on blockchain technology and cybersecurity. This literature analysis employs automated text mining approaches such as topic modeling and keyphrase extraction for unearthing the themes from a vast body of literature. This analysis highlights the multidisciplinary nature of blockchain technology within the cybersecurity domain. The findings also show the cyber threats and vulnerabilities that evolve with blockchain technology developments. This analysis also showcases the computer security research community’s vulnerabilities and provides future research dimensions that are crucial for designing secure blockchain applications and platforms.
Explanation and answers to critiques on: Blockchain-based Privacy-Preserving Record Linkage
2022, Information Systems
Citation Excerpt :
Furthermore, the PPRL solution needs to provide a comprised among privacy, efficiency, and quality according to the needs of the PPRL parties’ requirements. In this context, the Blockchain-based Privacy-Preserving Record Linkage (BC-PPRL) [5] introduces a novel protocol that enables the auditability of the computations performed by the PPRL parties. Furthermore, the BC-PPRL compares the records iteratively (using a fraction of the original information) to audit the computation performed by different parties and reduce the information shared during the PPRL process.
The “Blockchain-based Privacy-Preserving Record Linkage—Enhancing Data Privacy in an Untrusted Environment” (BC-PPRL) uses Blockchain technology to provide accountability to the computation performed during the comparison step of PPRL. The BC-PPRL utilizes small fragments (splits) of the encoded records to iterative compute the similarity of the records and classify them into matches and non-matches, without sharing the complete information of the encoded records. Christen et al. propose a novel attack that leverages the exchanged information by the BC-PPRL. In this work, we acknowledge the Christen et al. findings and provide a detailed explanation of how the privacy of BB-PPRL could be compromised. We also make available a simplified version of the BC-PPRL, the datasets, and version (ported to python 3) of the attack that could be executed in the google cloud environment at: https://github.com/thiagonobrega/bcpprl-simplified.
A critique and attack on “Blockchain-based privacy-preserving record linkage”
2022, Information Systems
Citation Excerpt :
We experimentally assess these concerns and develop a simple attack method to validate our concerns on real-world data in Section 4. We conclude our paper in Section 5 with a summary of our findings, and provide further issues we identified in the paper by Nóbrega et al. [19] in the Appendix. The central privacy assumption of the BC-PPRL protocol, besides the blockchain mechanism which is aimed at detecting cheating parties, is that each party in the protocol has access to less information compared to existing PPRL protocols where full BFs are sent from the database owners to the linkage unit.
Privacy-preserving record linkage (PPRL) is the process of identifying records in sensitive databases that refer to the same entities in applications where no private or confidential data can be shared by the owners of the databases being linked. In their paper “Blockchain-based Privacy-Preserving Record Linkage — Enhancing Data Privacy in an Untrusted Environment” (Nóbrega et al., 2021) (named BC-PPRL in the following), Nóbrega et al. (2021) proposed the use of blockchain technologies to provide accountability of the parties involved in a PPRL protocol and thereby allow the detection of misbehaving parties. While the use of blockchain techniques is an interesting and novel contribution to the research area of PPRL, as we show in this paper both theoretically and practically using a simple attack method, the BC-PPRL approach has some serious privacy weaknesses. We specifically highlight that one key aspect of the proposed approach, the exchange of Bloom filter segments between the database owners, can reveal substantially more sensitive information compared to what is stated in the paper by Nóbrega et al. (2021). Using a real-world data set we show how our attack can allow a database owner to reidentify with high accuracy a large number of the sensitive values that were encoded in the Bloom filter segments they receive from another database owner. We make the code and data sets of our attack available at: https://github.com/anushkavidanage/bc-pprlSegmentAtomAttack/.
Design Principle of an Automatic Engagement Estimation System in a Synchronous Distance Learning Practice
2024, IEEE Access
Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework
2023, arXiv

View all citing articles on Scopus

View full text

Blockchain-based Privacy-Preserving Record Linkage: enhancing data privacy in an untrusted environment

Highlights

Abstract

Introduction

Section snippets

Preliminaries and background

Problem formulation

Splitting bloom filter

An auditable protocol for PPRL

Auditable blockchain-based PPRL

Evaluation

Related work

Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Syst.

Inform. Sci.

Data Knowl. Eng.

J. Comput. System Sci.

A flexible data generator for privacy-preserving data mining and record linkage

Challenges for privacy preservation in data integration

J. Data Inf. Qual.

An Overview of Big Data Issues in Privacy-Preserving Record Linkage, Vol. 2

A spark-based workflow for probabilistic record linkage of healthcare data

A survey of blocking and filtering techniques for entity resolution

Efficient pattern mining based cryptanalysis for privacy-preserving record linkage

Security against covert adversaries: Efficient protocols for realistic adversaries

J. Cryptol.

Blockchaindb - towards a shared database on blockchains

Blockchain meets database

Proc. VLDB Endow.

Unchain your blockchain

Blockbench: A framework for analyzing private blockchains

Privacy-preserving record linkage using bloom filters

BMC Med. Inform. Decis. Mak.

Blind attribute pairing for privacy-preserving record linkage

Multi-party privacy-preserving record linkage using bloom filters

Spark-based streamlined metablocking

Privacy-preserving temporal record linkage

Scalable multi-database privacy-preserving record linkage using counting bloom filters

Privacy-preserving record linkage

Privacy-preserving record linkage for big data : Current approaches and research challenges