Elsevier

Information Systems

Volume 102, December 2021, 101826
Information Systems

Blockchain-based Privacy-Preserving Record Linkage: enhancing data privacy in an untrusted environment

https://doi.org/10.1016/j.is.2021.101826Get rights and content

Highlights

  • Blockchain networks can be employed to audit Privacy-Preserving Record Linkage.

  • Bloom Filter is divided into multiples splits to reduce the amount of information available.

  • The auditability of the PPRL process reduces the trust needed by the PPRL parties.

Abstract

Privacy-Preserving Record Linkage (PPRL) intends to integrate private data from several data sources held by different parties. Due to recent laws and regulations (e.g, General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health-care, credit analysis, public policy evaluation, and national security. However, the majority of the PPRL approaches consider an unrealistic adversary model, particularly the Honest but Curious (HBC) model, which assumes that all PPRL parties will follow a pre-agreed data integration protocol, and will not try to break the confidentiality of the data handled during the process. The HBC model is hard to employ in real-world applications, mainly because of the need to trust other parties fully. To overcome the limitations associated with the majority of the adversary models considered by PPRL approaches, we propose a protocol that considers covert adversaries, i.e., adversaries that may deviate arbitrarily from the protocol specification in an attempt to cheat. In such protocol, however, the honest parties are able to detect this misbehavior with a high probability. To provide a proof-of-concept implementation of this protocol, we employ the Blockchain technology and propose an improvement in the most used anonymization technique for PPRL, the Bloom Filter. The evaluation carried out using several real-world data sources has demonstrated the effectiveness (linkage quality) obtained by our contributions, as well as the ability to detect the misbehavior of a malicious adversary during the PPRL execution.

Introduction

Governments and private organizations collect and process large data sources to extract knowledge and assist in decision-making, data mining, and data analytics tasks. These organizations usually need to integrate and link matching records (entities) that correspond to the same real-world entity from various sources (held by different parties) with the purpose of increasing quality or enriching information [1]. In this scenario, the Record Linkage (RL) task is commonly employed to identify matching entities across multiple data sources [2]. It is important to remark that the data sources used as input in RL are usually independently projected using different data models (with distinct representations for the same real-world concepts) and technologies (i.e., relational databases, graph-oriented databases, and so on).

In general, RL needs to be applied in scenarios that hold sensitive personal information about the entities to be linked. However, sharing or exchanging such values among different organizations is often prohibited due to privacy and confidentiality concerns [1], [3], [4]. Thus, in scenarios where the privacy of the entities matters, Privacy-Preserving Record Linkage is employed. PPRL is the task of performing RL over data sources that belong to different parties (suppliers) without revealing any information that could compromise the privacy of the parties’ data [1].

As examples of private data usage, we have banks that seek to reduce credit fraud, national security applications that aim to identify terrorists, and medical data that need to be integrated for a variety of purposes such as patient profiles and disease outbreak prediction, public health policy evaluation, new drug studies, to name a few [1], [5].

According to recent PPRL surveys [1], [3], [4], [6], most of the state-of-the-art PPRL techniques assume an Honest But Curious (HBC) security model during the matching process. This model considers that the parties will follow a pre-agreed protocol, but will attempt to learn all possible information from legitimately received messages [4]. In contrast to the HBC model, we have the malicious security model, which assumes that the parties will not follow the pre-agreed protocol by refusing to participate, performing arbitrary computation in the input data, or aborting the protocol at any time [3].

Since the HBC model requires the parties to fully trust each other, which is not realistic for real-world applications [1], [4], and the malicious model is computationally expensive, due to the anonymization techniques (i.e., homomorphic encryption) and communication cost [1]. The need for new security models that enable an auditability of the similarity computations (i.e., models that lie in between the HBC and the malicious models) is reported as an open problem [1], [3], [4], [6], [7].

In this context, this paper introduces a novel security protocol that enables the auditability of the computations, also named as covert adversaries [8], performed during a PPRL process. In other words, the proposed protocol allows the parties to detect, with high probability, the misbehave of a malicious party during the entities’ similarity computation. To implement the protocol, we employ a decentralized environment where untrusted (or semi-trusted) parties perform the computations required by the PPRL.

As the decentralized computing environment, we use (public and private) Blockchain networks in order to provide auditability to PPRL. The Blockchain technology aims to provide shared data access and computation for parties that do not trust each other [9]. The computations and the data sharing are possible due to an immutable cryptographically append-only log, which is replicated and managed in a decentralized environment composed by untrusted parties [10].

Blockchain platforms are being used by applications of different domains [9], [10], [11], [12], including government, healthcare, and IoT. In such applications, the Blockchain is treated as a shared database or data processing platform which is employed when the participants do not trust each other. For this reason, we use the Blockchain (Smart Contract) as a Semi-Trusted Third Party (STTP) in PPRL.

However, the Blockchain does not provide a mechanism to preserve the privacy of the entities during the PPRL process. Actually, the Blockchain reduces the privacy of the PPRL by replicating the entire data amongst the untrusted parties. To overcome this limitation, we also propose an improvement for the most prevalent anonymization technique used in PPRL applications: Bloom Filter [13]. Such improvement, named Splitting Bloom Filter (SBF), offers a lower risk of privacy leakage and, additionally, reduces the possibility of collusion between malicious parties.

Contributions. We summarize the major contributions of this work as follows:

  • We propose a novel privacy-preserving protocol that considers a covert security model, which lies between the HBC and Malicious models;

  • We present a Proof-of-Concept implementation of our privacy-preserving protocol using the Blockchain technology;

  • We propose an improvement to the Bloom Filter (SBF) that increases the privacy guarantees of the entity comparisons performed in the PPRL, by reducing the amount of information shared amongst the PPRL parties; and

  • We perform theoretical and empirical evaluations using real-world data (18 data sources from nine domains) to assess the efficacy, efficiency, and privacy capabilities of our contributions.

Outline. The remaining of this paper is organized as follows. In Section 2, we present the required concepts for a better understating of our work. In Section 3, we formalize the problem tackled in this paper. Related work is discussed in Section 4. In Sections 5, 6 Auditable blockchain-based PPRL, 7 Evaluation, we describe the SBF, the proposed Privacy-Preserving Protocol and its Proof-of-Concept implementation, respectively. In Section 8, we evaluate (theoretically and empirically) the efficacy, efficiency, and privacy capabilities of our contributions using real-world data sources. Finally, in Section 9, we conclude the paper with an outlook on future research directions.

Section snippets

Preliminaries and background

For a better understanding of our approach, this section provides an overview of the techniques and concepts commonly used in PPRL and Blockchain.

Problem formulation

With the basic concepts of PPRL, anonymization and Blockchain defined in Section 2, we can now formalize the problem tackled in this paper. To address the privacy-guarantees of the PPRL problem, we assume that multiple parties and an STTP engage in a PPRL process that considers an accountable security model (covert adversaries).

Problem Statement 1 Decentralized Execution of the Decision Model

Given multiple participants (p2) with their respective anonymized data sources (Dτ), and a Semi-Trusted Third-Party (STTP) implemented as a Blockchain Smart Contract

Splitting bloom filter

As presented in Section 2, Bloom Filters may fail to preserve the privacy of the data in two cases. The first one occurs when an adversary has a list of possible values assumed by the entities [7], [25], [26], [27]. The second case happens when an adversary is able to access several anonymized entities and execute a pattern mining attack [7], [25]. Both vulnerabilities may endanger the privacy of the entities employed in PPRL. Considering that the entities involved in PPRL belong to a unique

An auditable protocol for PPRL

In this section, we present a novel privacy-preserving linkage protocol, which considers covert adversaries to audit the comparison and classification steps of PPRL. This protocol uses SBF to reduce the amount of information shared during the PPRL execution. Moreover, this protocol uses the decentralized SBF characteristic (presented in Eq. (5)) to provide audibility in the similarity computations performed by the PPRL parties. The protocol also considers a semi-trusted third party (STTP),

Auditable blockchain-based PPRL

In this section, we describe the Auditable Blockchain-based Privacy-preserving Record Linkage (ABEL), a Proof-of-Concept implementation of the 3PAC protocol presented in Section 5.

Different from the traditional HBC protocols, which consider an external (neutral) party in which all participants trust [4], [14], [35], ABEL considers a semi-trusted third party (STTP), which parties do not fully trust. Because they do not rely on STTP, the parties need to audit STTP computations. In order to enable

Evaluation

The main goal of this section is to evaluate the efficacy (quality), efficiency, and the privacy capabilities introduced by our contributions, the SBF and ABEL approaches. To this end, we present an empirical and theoretical evaluation of them.

Related work

This research is closely related to PPRL and decentralized data management. In this sense, we can list the following works as related to the problem tackled in this work.

Vatsalan and Christen [15] proposed a protocol to identify matching sets of entities (records) held by multiple parties that have a similarity above a certain threshold under the semi-honest adversary model. The protocol divides and distributes the BF chunks amongst the parties. To calculate the entities’ overall similarity,

Conclusion and future work

In this paper, we have presented a PPRL protocol that considers a covert adversary model. This protocol considers a decentralized computing environment (Blockchain) and uses a novel implementation of the Bloom Filter (SBF), which reduces the information available to the PPRL parties. Specifically, SBF decreases the available information in case of collusion amongst malicious parties. The Blockchain technology was employed to implement a semi-trusted party and, consequently, provides

CRediT authorship contribution statement

Thiago Nóbrega: Conceptualization, Methodology, Software, Writing - original draft, Writing - review & editing. Carlos Eduardo S. Pires: Methodology, Writing - original draft, Writing - review & editing. Dimas Cassimiro Nascimento: Methodology, Writing - original draft, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by the Brazilian National Council of Technological and Scientific Development (CNPq) .

References (49)

  • AumannY. et al.

    Security against covert adversaries: Efficient protocols for realistic adversaries

    J. Cryptol.

    (2010)
  • El-HindiM. et al.

    Blockchaindb - towards a shared database on blockchains

  • NathanS. et al.

    Blockchain meets database

    Proc. VLDB Endow.

    (2019)
  • DasuT. et al.

    Unchain your blockchain

  • DinhT.T.A. et al.

    Blockbench: A framework for analyzing private blockchains

  • SchnellR. et al.

    Privacy-preserving record linkage using bloom filters

    BMC Med. Inform. Decis. Mak.

    (2009)
  • NóbregaT.P. et al.

    Blind attribute pairing for privacy-preserving record linkage

  • VatsalanD. et al.

    Multi-party privacy-preserving record linkage using bloom filters

    (2016)
  • AraujoT.B. et al.

    Spark-based streamlined metablocking

  • BatiniC. et al.
  • RanbadugeT. et al.

    Privacy-preserving temporal record linkage

  • VatsalanD. et al.

    Scalable multi-database privacy-preserving record linkage using counting bloom filters

    (2017)
  • VatsalanD. et al.

    Privacy-preserving record linkage

    (2018)
  • VatsalanD. et al.

    Privacy-preserving record linkage for big data : Current approaches and research challenges

  • Cited by (15)

    • Blockchain technology for cybersecurity: A text mining literature analysis

      2022, International Journal of Information Management Data Insights
    • Explanation and answers to critiques on: Blockchain-based Privacy-Preserving Record Linkage

      2022, Information Systems
      Citation Excerpt :

      Furthermore, the PPRL solution needs to provide a comprised among privacy, efficiency, and quality according to the needs of the PPRL parties’ requirements. In this context, the Blockchain-based Privacy-Preserving Record Linkage (BC-PPRL) [5] introduces a novel protocol that enables the auditability of the computations performed by the PPRL parties. Furthermore, the BC-PPRL compares the records iteratively (using a fraction of the original information) to audit the computation performed by different parties and reduce the information shared during the PPRL process.

    • A critique and attack on “Blockchain-based privacy-preserving record linkage”

      2022, Information Systems
      Citation Excerpt :

      We experimentally assess these concerns and develop a simple attack method to validate our concerns on real-world data in Section 4. We conclude our paper in Section 5 with a summary of our findings, and provide further issues we identified in the paper by Nóbrega et al. [19] in the Appendix. The central privacy assumption of the BC-PPRL protocol, besides the blockchain mechanism which is aimed at detecting cheating parties, is that each party in the protocol has access to less information compared to existing PPRL protocols where full BFs are sent from the database owners to the linkage unit.

    View all citing articles on Scopus
    View full text