Skip to main content

Privacy-Preserving Record Linkage Using Local Sensitive Hash and Private Set Intersection

  • Conference paper
  • First Online:
Applied Cryptography and Network Security Workshops (ACNS 2022)

Abstract

The amount of data stored in data repositories increases every year. This makes it challenging to link records between different datasets across companies and even internally, while adhering to privacy regulations. Address or name changes, and even different spelling used for entity data, can prevent companies from using private deduplication or record-linking solutions such as private set intersection (PSI). To this end, we propose a new and efficient privacy-preserving record linkage (PPRL) protocol that combines PSI and local sensitive hash (LSH) functions, and runs in linear time. We explain the privacy guarantees that our protocol provides and demonstrate its practicality by executing the protocol over two datasets with \(2^{20}\) records each in 11–45 min, depending on network settings.

M. Mirkin—The work for this paper was done while Michael Mirkin was with IBM Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.ncsbe.gov/results-data/voter-registration-data, last accessed Mar 2022.

  2. 2.

    In practice, if \(P_{s}\) learns that both parties share a record with the same SSN and at a later stage learns that the other record fields do not match, then it may deduce that \(D_{r}\) contains a record with a very close SSN that leaks information. Following previous studies, we only consider leaks that occur as a result of the protocol itself.

References

  1. Baker, D.B., et al.: Privacy-preserving linkage of genomic and clinical data sets. IEEE/ACM Trans. Comput. Biol. Bioinf. 16(4), 1342–1348 (2019). https://doi.org/10.1109/TCBB.2018.2855125

    Article  Google Scholar 

  2. Barker, E., Chen, L., Moody, D.: Recommendation for Pair-Wise Key-Establishment Schemes Using Integer Factorization Cryptography (Revision 1) (2014). https://doi.org/10.6028/NIST.SP.800-56Br1

  3. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst. Sci. 18(2), 143–154 (1979)

    Article  MathSciNet  Google Scholar 

  4. Chen, F., et al.: Perfectly secure and efficient two-party electronic-health-record linkage. IEEE Internet Comput. 22(2), 32–41 (2018). https://doi.org/10.1109/MIC.2018.112102542

    Article  Google Scholar 

  5. Chen, H., Huang, Z., Laine, K., Rindal, P.: Labeled PSI from fully homomorphic encryption with malicious security. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, pp. 1223–1237. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3243734.3243836

  6. Chen, H., Laine, K., Rindal, P.: Fast private set intersection from homomorphic encryption. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 1243–1255. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3133956.3134061

  7. Chen, Y.: Current approaches and challenges for the two-party privacy-preserving record linkage (PPRL). In: Collaborative Technologies and Data Science in Artificial Intelligence Applications, pp. 108–116 (2020). https://codassca2020.aua.am/wp-content/uploads/2020/09/2020_Codassca_Chen.pdf

  8. Christen, P., Ranbaduge, T., Vatsalan, D., Schnell, R.: Precise and fast cryptanalysis for bloom filter based privacy-preserving record linkage. IEEE Trans. Knowl. Data Eng. 31(11), 2164–2177 (2019). https://doi.org/10.1109/TKDE.2018.2874004

    Article  Google Scholar 

  9. Christen, P., Schnell, R., Vatsalan, D., Ranbaduge, T.: Efficient cryptanalysis of bloom filters for privacy-preserving record linkage. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10234, pp. 628–640. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57454-7_49

    Chapter  Google Scholar 

  10. Churches, T., Christen, P.: Blind data linkage using n-gram similarity comparisons. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 121–126. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_15

    Chapter  Google Scholar 

  11. Clifton, C., et al.: Privacy-preserving data integration and sharing. In: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD 2004, pp. 19–26. Association for Computing Machinery, New York (2004). https://doi.org/10.1145/1008694.1008698

  12. Cong, K., et al.: Labeled PSI from homomorphic encryption with reduced computation and communication. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS 2021, pp. 1135–1150. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3460120.3484760

  13. Cui, H., Yu, Y.: A Not-So-Trival Replay Attack Against DH-PSI. Cryptology ePrint Archive, Report 2020/901 (2020). https://ia.cr/2020/901

  14. Essex, A.: Secure approximate string matching for privacy-preserving record linkage. IEEE Trans. Inf. Forensics Secur. 14(10) (2019). https://doi.org/10.1109/TIFS.2019.2903651

  15. Franke, M., Rahm, E.: Evaluation of Hardening Techniques for Privacy-Preserving Record Linkage (2021)

    Google Scholar 

  16. Franke, M., Sehili, Z., Rahm, E.: Parallel privacy-preserving record linkage using LSH-based blocking. In: International Conference on Internet of Things, Big Data and Security (IoTBDS) (2018). https://www.scitepress.org/Papers/2018/66827/66827.pdf

  17. Freeman, D.: Pairing-based identification schemes. Cryptology ePrint Archive, Report 2005/336 (2005). https://ia.cr/2005/336

  18. Gkoulalas-Divanis, A., Vatsalan, D., Karapiperis, D., Kantarcioglu, M.: Modern privacy-preserving record linkage techniques: an overview. IEEE Trans. Inf. Forensics Secur. 16, 4966–4987 (2021). https://doi.org/10.1109/TIFS.2021.3114026

    Article  Google Scholar 

  19. Gyawali, B., Anastasiou, L., Knoth, P.: Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In: Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 894–903. European Language Resources Association (2020). https://oro.open.ac.uk/70519/

  20. He, X., Machanavajjhala, A., Flynn, C., Srivastava, D.: Composing differential privacy and secure computation: a case study on scaling private record linkage. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 1389–1406. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3133956.3134030

  21. Huberman, B.A., Franklin, M., Hogg, T.: Enhancing privacy and trust in electronic communities. In: Proceedings of the 1st ACM Conference on Electronic Commerce, EC 19999, pp. 78–86. Association for Computing Machinery (1999). https://doi.org/10.1145/336992.337012

  22. IBM: IBM InfoSphere® Optim™ Test Data Fabrication (2022). https://www.ibm.com/products/infosphere-optim-test-data-fabrication

  23. IBM Research: Helayers (2022). https://hub.docker.com/r/ibmcom/helayers-pylab

  24. Ioffe, S.: Improved consistent sampling, weighted MinHash and L1 sketching. In: 2010 IEEE International Conference on Data Mining, pp. 246–255 (2010)

    Google Scholar 

  25. Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: FEDERAL: a framework for distance-aware privacy-preserving record linkage. IEEE Trans. Knowl. Data Eng. 30(2), 292–304 (2018). https://doi.org/10.1109/TKDE.2017.2761759

    Article  MATH  Google Scholar 

  26. Karapiperis, D., Verykios, V.S.: A distributed near-optimal LSH-based framework for privacy-preserving record linkage. Comput. Sci. Inf. Syst. 11(2), 745–763 (2014). https://doi.org/10.2298/CSIS140215040K

    Article  Google Scholar 

  27. Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: Random-data perturbation techniques and privacy-preserving data mining. Knowl. Inf. Syst. 7(4), 387–414 (2005). https://doi.org/10.1007/s10115-004-0173-6

    Article  Google Scholar 

  28. Khurram, B., Kerschbaum, F.: SFour: a protocol for cryptographically secure record linkage at scale. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 277–288 (2020). https://doi.org/10.1109/ICDE48307.2020.00031

  29. Kroll, M., Steinmetzer, S.: Automated cryptanalysis of bloom filter encryptions of health records. In: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2015, vol. 5, pp. 5–13. SCITEPRESS - Science and Technology Publications, Lda, Setubal, PRT (2015). https://doi.org/10.5220/0005176000050013

  30. Kuzu, M., Kantarcioglu, M., Durham, E., Malin, B.: A constraint satisfaction cryptanalysis of bloom filters in private record linkage. In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 226–245. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22263-4_13

    Chapter  Google Scholar 

  31. Leskovec, J., Rajaraman, A., Ullman, J.D.: Finding similar items. In: Mining of Massive Datasets, pp. 73–130 (2014). https://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

  32. Li, Y., Xia, K.: Fast video deduplication via locality sensitive hashing with similarity ranking. In: Proceedings of the International Conference on Internet Multimedia Computing and Service, ICIMCS 2016, pp. 94–98. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/3007669.3007725

  33. Meadows, C.: A more efficient cryptographic matchmaking protocol for use in the absence of a continuously available third party. In: 1986 IEEE Symposium on Security and Privacy, p. 134 (1986). https://doi.org/10.1109/SP.1986.10022

  34. Mullaymeri, X., Karakasidis, A.: A two-party private string matching fuzzy vault scheme. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, pp. 340–343. Association for Computing Machinery (2021). https://doi.org/10.1145/3412841.3442079

  35. Pinkas, B., Schneider, T., Zohner, M.: Faster private set intersection based on OT extension. In: 23rd USENIX Security Symposium (USENIX Security 2014), San Diego, CA, pp. 797–812. USENIX Association (2014). https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/pinkas

  36. Rao, F.Y., Cao, J., Bertino, E., Kantarcioglu, M.: Hybrid private record linkage: separating differentially private synopses from matching records. ACM Trans. Priv. Secur. 22(3) (2019). https://doi.org/10.1145/3318462

  37. Ravikumar, P., Cohen, W.W., Fienberg, S.E.: A secure protocol for computing string distance metrics. PSDM held at ICDM (2004). https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/wcohen/postscript/psdm-2004.pdf

  38. Saleem, A., Khan, A., Shahid, F., Masoom Alam, M., Khan, M.K.: Recent advancements in garbled computing: how far have we come towards achieving secure, efficient and reusable garbled circuits. J. Netw. Comput. Appl. 108(January), 1–19 (2018). https://doi.org/10.1016/j.jnca.2018.02.006

    Article  Google Scholar 

  39. Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using Bloom filters. BMC Med. Inform. Decis. Mak. 9(1), 41 (2009). https://doi.org/10.1186/1472-6947-9-41

    Article  Google Scholar 

  40. Vatsalan, D., Sehili, Z., Christen, P., Rahm, E.: Privacy-preserving record linkage for big data: current approaches and research challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 851–895. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_25

    Chapter  Google Scholar 

  41. Wong, K.S.S., Kim, M.H.: Privacy-preserving similarity coefficients for binary data. Comput. Math. Appl. 65(9), 1280–1290 (2013). https://doi.org/10.1016/j.camwa.2012.02.028

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nir Drucker .

Editor information

Editors and Affiliations

Appendices

A Related Work

To demonstrate our solution, we use a PSI instantiation that uses public-key cryptography; specifically, we use one that leverages the commutative properties of the DH key agreement scheme. This PSI construction was introduced in [21] with a similar construction even before that in [33]. Subsequent PSI works consider other, more complex cryptographic primitives such as homomorphic encryption (HE) [6] and oblivious transfer (OT) [35]. While the latter solutions may offer an interesting tradeoff in terms of performance and security, we decided to stick with the basic DH-style protocol due to its simplicity and the fact that its primitives were already standardized [2]. Because we use PSI as a blackbox, we can also benefit from most of the advantages that the other methods provide such as performance and security guarantees.

Our solution follows previous works in considering a balanced case, where the two datasets are roughly equal in size. An example, for a PSI over unbalanced sets was studied in [5]. In fact, there were attempts to use PSI for PPRL before this paper. However, they were either noted to be inefficient [40] or relied on a different techniques such as term frequency-inverse document frequency (TF-IDF) [37], which is more appropriate for comparing documents, rather than short record fields (such as names or addresses). Furthermore, the protocol of [37] can only compare given record pairs. This implies the need for \(\mathcal {O}(n^2)\) operations, in contrast to our method, which requires \(\mathcal {O}(n)\) operations.

A complete survey of PPRL techniques and challenges is available at [18, 40], in which we observed solutions that use different cryptographic primitives. For example, [14, 41] relies on HE, which is known for its high computational cost. For example, [14] reports that it took somewhat less than two hours to evaluate 20, 000 patient records, which is less records than in our evaluations by several orders of magnitude. Other works [4, 38] use garbled circuits, which can still be inefficient, while other multi-party computation (MPC) solutions such as [28] can incur high communication costs [7]. Another example is the fuzzy volts approach, which uses secure polynomial interpolations [34], but only reports results for around 1, 000 records.

Other solutions [20, 36] overcome the privacy issue by using differential privacy (DP), which provides some level of anonymization. In [36], the two parties partition the dataset into blocks of records and compare only records in corresponding blocks via an MPC process that computes the distance function. In contrast, in [20], for every block, the parties compute a private “synopsis” and send it to a third party, which uses this information to identify when blocks are too far from each other to justify a comparison of their records. In both [20, 36], the scheme privacy comes from DP, while the scheme security comes from the MPC process used to compare the pairs of records. The two solutions use MPC protocols for comparing integers while our record matching metric relies on LSH, which is a more appropriate comparison method for longer texts such as addresses. In addition, the complexity of our solution is linear in the total number of records since we do not separately compare every pair of records in the two data sets or even in pre-arranged blocks, which requires a sub-quadratic complexity. Unlike [20], our solution does not require the presence of a third party. Finally, it is possible to enhance the privacy of our scheme by adding a preprocessing DP layer as in [20, 36]. Thus, we view the usage of DP as orthogonal to our approach.

Many PPRL works use Bloom filter encodings [39], which use a locality preserving hash (LPH) function over the data. The main advantage of the Bloom filter is speed. The difference between LPH and LSH is that LPH is data-dependent, i.e., for three records pqr, a metric d, and an LPH function \(l_p\)

$$ d(p, q)< d(q,r) \Longrightarrow d(l_p(p), l_p(q)) < d(l_p(q),l_p(r)) $$

This relation complicates the evaluation of the protocol leakage. The lack of a formal analysis for Bloom filter based solutions caused several attacks on them [8, 9, 29, 30]. A survey of attacks and countermeasures for this method can be found in [15]. Our solution’s use of LSH has an advantage over Bloom filters as it is data-independent and more robust against the above attacks. A method that combines Bloom filters and LSH was presented in [16, 26]. In contrast to this one, our solution only uses LSH, which simplifies the privacy analysis. Moreover, our use of PSI hides the LSH output and thus prevents offline attacks. In addition, [26] requires use of a third-party and demonstrates a solution that took more than an hour to match 300K records. Another recent example is [28], which runs in \(\mathcal {O}(n \cdot polylog(n))\) and proved to be cryptographically secure in the semi-honest security model. However, the method analyzed 4, 096 records in 88 min and it is not clear whether this method can scale to handle more than 100K records.

B Security Assumptions

Definition 3 (Decisional DH (DDH))

For a cyclic group \(\mathbb {G}\), a generator g, and integers \(a,b,c \in \mathbb {Z}\), the decisional DH problem is hard, if for every probabilistic polynomial-time (PPT) adversary \(\mathcal {A}\)

$$\begin{aligned} |Pr[A(g,&g^a, g^b, g^{ab}] = 1) - \\&Pr[A(g, g^a, g^b, g^c) = 1]| < negl(), \end{aligned}$$

where the probability is taken over (gabc).

Definition 4 (Computational DH (CDH))

For a cyclic group \(\mathbb {G}\), a generator g, and integers \(a,b \in \mathbb {Z}\), the computational DH problem is hard, if for every PPT adversary \(\mathcal {A}\)

$$ Pr[A(g, g^a, g^b] = g^{ab}) < negl(), $$

where the probability is taken over (gab).

Definition 5

(One-more-DH (OMDH) [17]). Let \(\mathbb {G}\) be a cyclic group. The one-more-DH problem is hard, if for every PPT adversary \(\mathcal {A}\) that gets a generator \(g \in \mathbb {G}\) together with some power \(g^a\) and who has access to two oracles: \(h^a = CDH_{g,g^a}(h)\) for some \(h \in \mathbb {G}\), and \(r \xleftarrow {\$} C()\) a challenge oracle that returns a random challenge point \(r \in G\) and can only be invoked after all calls to the \(CDH_{g, g^a}\), it follows that

$$ Pr[A(g, g^a, r \leftarrow C()) = r^a] < negl() $$

where the probability is taken over (ga).

C Privacy-Preserving Record Linkage

PPRL [11] is an ER protocol between two parties \(P_{s}\) and \(P_{r}\), with private datasets \(D_{s}\) and \(D_{r}\) of sizes \(N_{s}\) and \(N_{r}\), respectively; these records have a similarity measure \(\mu (\cdot , \cdot )\), and some additional privacy requirements. These requirements may lead to several security models and several formal definitions of PPRL.

The most intuitive way to define privacy for PPRL is by following the PSI privacy notion: \(P_{s}\) only learns \(N_{r} \) and the intersection \(D_{s} \cap D_{r} \), i.e., all records that exactly match in all fields while \(P_{r}\) only learns \(N_{s}\). Note that in both PSI and PPRL, \(P_{s}\) and \(P_{r}\) need to share the nature of the information contained in their datasets with each other to decide which QIDs they can validly compare.

The difference between PSI and PPRL is that PSI only returns exact matches according to some uniquely identifying QIDs, while PPRL returns matching records up to some similarity indicator and according to non-unique QIDs. For example, a PSI protocol may rely on users’ SSNs, while a PPRL protocol may compare first and last names. Thus, a PPRL may inadvertently match “David Doe” with “Davy Don” even if they represent different entities (users).

Fig. 6.
figure 6

A Venn diagram of different ER outputs applied on two datasets \(D_{s}\) and \(D_{r}\). The ER methods are: matching only identical pairs of records (purple), matching pairs of records with a Jaccard index above some threshold (green), and matching pairs of records with matching LSH indicators (yellow). (Color figure online)

Figure 6 shows a Venn diagram for the output of different ER solutions on \(D_{s}\) and \(D_{r}\) datasets. With the exact matching method (\(D_{s} \cap D_{r} \)) no privacy risks occur since it only reveals the agreed-upon intersectionFootnote 2. In contrast, when using the Jaccard similarity to compute the matches, the parties learn: a) records in \(D_{s} \cap D_{r} \), which is ok; b) records outside \(D_{s} \cap D_{r} \) that represent the same entity (true-positive), which is also ok; c) records outside \(D_{s} \cap D_{r} \) that represent different entities (false-positive), which may break the privacy of the parties. In general, any PPRL protocol must assume this kind of leakage, and should do its best to quantify it, e.g., by assuming the existence of a bound \(\tau \) on the similarity false-positive rate.

Definition 6 (PPRL)

A PPRL protocol \(\mathcal {P}\) between two parties \(P_{s} \), \(P_{r} \) with datasets \(D_{s} \), \(D_{r} \), respectively, a similarity measure \(\mu \), a measure indicator \(I^{\mu }_{t}\) for a fixed threshold t with a false-positive rate bounded by \(\tau \), has the following properties.

  • Correctness: \(\mathcal {P}\) is correct if it outputs to \(P_{s}\) the set

    $$ res = \{ (s , Enc(r )) ~|~ s \in D_{s}, r \in D_{r}, I^{\mu }_{t}(s , r ) =1\}, $$

    where \(Enc(r )\) is an encryption of \(r \) under a secret key of \(P_{r}\).

  • Privacy: \(\mathcal {P}\) maintains privacy if \(P_{s}\) only learns res and \(N_{r} \), and \(P_{r}\) only learns \(N_{s} \).

Corollary 1

The leaked information of \(P_{r}\) in \(\mathcal {P}\) is bounded by \(\tau \cdot \frac{|res|}{N_{r}}\).

Definition 6 assumes the existence of \(\tau \) but only implicitly uses it. The reason is that \(\tau \) does not always exist. In many cases, it can be empirically estimated based on prior data or based on perturbed synthetic data. However, relying solely on empirical estimates increases the ambiguity of the privacy definition for such protocols. Moreover, in many cases, \(\tau \) depends on data from the two datasets that have different distributions, which none of the parties know in advance. Another reason for only implicitly relying on \(\tau \) is that the leaked information in Corollary 1 depends on res and can only be computed after running the protocol.

While \(\tau \) bounds the privacy leak from above, there is still the issue of quantifying the exact leakage after the protocol ends. It is not clear how the parties can verify the number of false-positive cases without revealing private data. Usually, an ER protocol is used when the compared records do not include uniquely identifying fields (such as an SSN) and thus the parties cannot compute the exact matches using PSI. Consequently, their only way to verify matches is by revealing their private data. To assist in this task, we define a protocol called a revealing PPRL.

Definition 7 (Revealing PPRL)

A revealing PPRL protocol \(\mathcal {P}\) is a PPRL protocol \(\mathcal {P}'\), where \(P_{r}\) also learns \( u = \{Enc(r ) ~|~ (s , Enc(r )) \in \mathcal {P}'.res\} \) and \(P_{s}\) also learns

$$ res' = \{ (r , Enc(r )) ~|~ (s , Enc(r )) \in \mathcal {P}'.res\}, $$

In words, \(P_{r}\) learns which of its own records are matched, and \(P_{s}\) learns the field content of the matched records of the other party. The simplest way to achieve a revealing PPRL is for \(P_{s}\) to send u to \(P_{r}\), who will then decrypt its values and hand them back to \(P_{s}\). The difference between Definitions 6 and 7 is that in the latter, \(P_{s}\) learns the values of \(P_{r}\) ’s records instead of just their encryption. While this definition leaks more data from \(P_{r}\) to \(P_{s}\), it is easier to analyze because now \(P_{s}\) can verify the matches with some probability and learn the estimated number of false-positives. We also consider the definitions of the associated mutual PPRL and the mutual revealing PPRL.

Definition 8 (Mutual PPRL)

A PPRL protocol \(\mathcal {P}\) between two parties \(P_{s} \), \(P_{r} \) with datasets \(D_{s} \), \(D_{r} \), respectively, a similarity measure \(\mu \), a measure indicator \(I^{\mu }_{t}\) for a fixed threshold t with a false-positive rate bounded by \(\tau \), has the following properties.

  • Correctness: \(\mathcal {P}\) is correct if it outputs \(res_s \) (resp. \(res_r \)) to \(P_{s}\) (resp. \(P_{r}\)), where

    $$ res_s = \{ (s , Enc(r )) ~|~ s \in D_{s}, r \in D_{r}, I^{\mu }_{t}(s , r ) =1\} $$
    $$ res_r = \{ (r , Enc(s )) ~|~ s \in D_{s}, r \in D_{r}, I^{\mu }_{t}(s , r ) =1\}, $$

    and \(Enc(r )\) (resp. \(Enc(s )\)) is an encryption of \(r \) (resp. \(s \)) under a secret key of \(P_{r}\) (resp. \(P_{s}\)).

  • Privacy: \(\mathcal {P}\) maintains privacy if \(P_{s}\) only learns \(res_s \) and \(N_{r} \), and \(P_{r}\) only learns \(res_r \) and \(N_{s} \).

The mutual revealing PPRL is similarly defined. The difference between the mutual PPRL and the revealing PPRL in terms of privacy is that in the mutual PPRL, \(P_{r}\) can match the encryption of \(P_{s}\) records to its records and therefore gains more information while \(P_{s}\) only learns the encryption of \(P_{r}\) records.

In the PPRL protocols described above, the two parties learn the intersection of their datasets. However, in some scenarios, the parties merely need to learn the number of matches and do not wish to reveal the identity of the matched records to the other party. To this end, we define an N-PPRL protocol.

Definition 9 (N-PPRL)

A PPRL protocol \(\mathcal {P}\) between two parties \(P_{s} \), \(P_{r} \) with datasets \(D_{s} \), \(D_{r} \), respectively, a similarity measure \(\mu \), a measure indicator \(I^{\mu }_{t}\) for a fixed threshold t with a false-positive rate bounded by \(\tau \), has the following properties.

  • Correctness: \(\mathcal {P}\) is correct if it outputs to \(P_{s}\) the value

    $$ N_{s \cap r } = |\{ (s , r ) ~|~ s \in D_{s}, r \in D_{r}, I^{\mu }_{t}(s , r ) =1\}|, $$
  • Privacy: \(\mathcal {P}\) maintains privacy if \(P_{s}\), (resp. \(P_{r}\)) only learns \(N_{s \cap r },N_{r} \) (resp. \(N_{s} \)).

The mutual N-PPRL protocol is similarly defined.

D Example of the LSH-PSI Protocol

A concrete example of Steps 1.b - 3 of the LSH-PSI PPRL protocol (Fig. 5) is given in Fig. 7. Suppose that \(v = H(455)^{sk_s sk_r}\) then \(P_{s}\) learns via the PSI process that \(P_{r}\) also has a band signature with the same value 455. \(P_{r}\) took care to preserve the order of \(P_{s}\) ’s encrypted band signatures during the PSI, so \(P_{s}\) can map the shared value v back to the band signature for Band 1 of record \(N_s \), and deduce that \(P_{r}\) has some unknown record that is similar to her own record \(N_s \).

Fig. 7.
figure 7

Steps 1.b - 3 of our protocol. \(P_{s}\) learns via the PSI protocol that the signature for Record \(N_{s}\) Band 1 is shared with \(P_{r}\).

E Using the Jaccard Indicator

Theorem 1 shows that the LSH-PSI PPRL protocol follows Definition 6 when considering the LSH as the similarity indicator. This means that security reviewers need to accept the privacy leakage that occurs when using an LSH, something that is already done by many organizations that perform RL. However, some reviewers may instead prefer to trust the Jaccard index due to its wide acceptance.

Figure 6 shows two ways to define LSH false-positive events: in relation to exact matches of entire records as in the LSH-PSI PPRL, or in relation to the method of matching pairs of records with a high enough Jaccard index. Thus according to the latter definition an LSH false-positive happens only when a pair of records are matched due to having at least one shared LSH band, and yet they do not have a high enough Jaccard index to justify a claim of similarity. Bounding the false-positive events rate \(\tau '\) based on the latter definition will allow us to define an LSH-PSI PPRL related to the Jaccard index metric but with a different bound \(\tau \cdot \tau '\), where \(\tau \) is the Jaccard original false-positive bound. In this section, we further discuss the relation between the LSH and the Jaccard index.

For two records \(s , r \) with Jaccard index J, Fig. 8 shows the probability for an \(\texttt {LSHMatch} =1\) event according to Eq. 2 with \(R=200\) and \(B=20\). In standard ER solutions, it is the role of the domain expert to decide the specific Jaccard index that would indicate enough similarity between the two records. For example, in the figure the targeted Jaccard index is 0.78. The figure shows the cumulative probability of getting true-positives (\(J(s , r ) > 0.78\) and \(\texttt {LSHMatch}(s , r ) =1\)), true-negatives (\(J(s , r ) \le 0.78\) and \(\texttt {LSHMatch}(s , r ) =0\)), and the corresponding false-positive and false-negative cumulative probabilities.

Fig. 8.
figure 8

The function \(F(J) = 1-(1-J^R)^B\) from Eq. 2, where \(R=20\) and \(B=200\). The black vertical line is the Jaccard index threshold.

The above example shows that when \(B=20\) and \(R=200\), it is possible to close the gap between the Jaccard index and the LSH by choosing the Jaccard threshold to be below 0.5. In that case, the probability for a false-positive event is less than 0.0001, which means that one in every ten-thousand records leaks. However, using such a Jaccard threshold will yield many false-positive cases relative to exact record matching, which is less desirable in terms of privacy.

It turns out that it is possible to tune the slope of the accumulated probability function. Figure 9 compares the probability functions in four different setups \(B=20, R=200\) (setup 1) \(B=100, R=100\) (setup 2) \(B=14, R=30\) (setup 3) and \(B=120, R=18\) (setup 4). Here, we see that replacing setup 1 with setup 2 allows us to set the Jaccard threshold at 0.78 while reducing the LSH false-positive rate to as low as \(10^{-8}\). However, setup 2 dramatically increases the LSH false-negative rate. Note however that false negatives affect the security less than false-positives, and in addition, users are often much more reluctant to report false positives than to miss reports due to false negatives. Setup 2 may also increase the overall performance of the protocol relative to setup 1 because there are many more bands to encrypt and communicate, as described in the following section.

1.1 E.1 Optimizing the Protocol

Setup 4 in Fig. 9 probably results in more false-positive and false-negative cases than setup 1, and the low slope of the curve implies a larger region of uncertainty. However, the PSI for setup 4 runs more than 6 times faster than the PSI for setup 1, because there are just 20 rather than 180 band signatures that need to be encrypted and communicated. The change in the R parameter does not affect the performance as much, since it merely determines the number of Min-Hashes that need to be computed locally. It turns out that computing a Min-Hash (like the highly optimized SHA-256 operation) is much faster than computing the power in the underlying groups of the DH protocol. Moreover, there are known methods for quickly producing R different permutations out of a single SHA-256 call such as the Mersenne twister [3]. Finally, the value of R does not affect the size of the communication.

Fig. 9.
figure 9

A comparison of four probability functions \(F(J) = 1-(1-J^R)^B\) (see Eq. 2) with different B and R values.

We use the B and R parameters to control the curve, which in turn affects the protocol’s accuracy and performance. Reducing B makes it less likely to find a matching band signature, thus increasing the false-negative probability, but improving performance. The rate of false-negatives can be reduced by decreasing R, thus making it more probable for two bands to match. Conversely, if the false-positive rate is too high, then one can increase R with little performance penalty. We therefore optimize the process by searching for values of B and R that have the minimal B value (for best performance) while more or less preserving the targeted curve shape.

Suppose for example that setup 1 has the targeted probability function. The figure shows the probability function for setup 3, which runs almost twice as fast as setup 1 and has an almost identical probability function. Setup 4 has an almost identical curve as setup 3 so it gives an almost identical accuracy, but it runs much slower because it requires almost 8 times more bands.

1.2 E.2 Scoring the Reported Matches

When a PPRL protocol relies on the Jaccard index but its implementation uses LSH, it may be in the users’ interest to quantify the number of false-positive events. To this end, we present a way to estimate the Jaccard index based on the LSH results.

Estimating the Jaccard Index for Matching Pairs. When using LSH with B band signatures, it is possible to estimate the actual Jaccard index J by using a binomial confidence interval. By Observation 1, the probability for a matching band (i.e. the probability for a match in all R Min-Hashes of the band) is \(p=J^R\). Suppose that \(P_{s}\) learns that there are h matching band signatures and \(t=B-h\) non-matching band signatures. Using a \(95\%\) confidence interval, the Jaccard index lies in the range

$$\begin{aligned} \left[ \root R \of {\left| \frac{h}{B} - 1.96 \sqrt{t \frac{h}{B^3}}\right| }, \root R \of {\left| \frac{h}{B} + 1.96 \sqrt{t \frac{h}{B^3}}\right| } \right] \end{aligned}$$
(3)

In some cases this interval is too wide, and the users may prefer using a different approach, such as a revealing PPRL. In a revealing PPRL, the two parties learn the intersection of their datasets as in a standard PPRL but they also learn the records of the other parties that are involved in of the intersection. Thus, the leaked information in a revealing PPRL is higher than in a PPRL. Below, we propose an approach with privacy leakage that lies between the leakage of a revealing PPRL and a PPRL, where we compute the Jaccard index only for matching pairs, without revealing the exact shingles.

Computing the Precise Jaccard Index for Matching Pairs. Suppose that at the end of the LSH-PSI PPRL protocol, \(P_{s}\) learns the matching pair \((s , Enc(r ))\). \(P_{s}\) can ask \(P_{r}\) to participate in another PSI process over the set of shingles of \((s , r )\), where \(P_{s}\) knows \(s \) and \(P_{r}\) knows \(r \). In this PSI, \(P_{s}\) only learns the intersection size of the associated shingles \(|S \cap R|\) and the size |R|, so it can compute \(J(s , r ) = \frac{|S \cap R|}{|S| + |R| - |S \cap R|}\). Note that learning only the intersection size and not the intersection itself makes it harder for \(P_{s}\) to guess \(P_{r}\) ’s record.

These additional PSIs are relatively expensive in terms of performance, but we only need to carry them out for the reported matches, which are presumably only a very small fraction of all possible pairs of records. \(P_{s}\) and \(P_{r}\) can decide to perform such PSIs for every matching pair or for selected pairs of special interest, or for pairs selected after estimating the Jaccard index as described above. As mentioned in Sect. C performing a selective PSIs leaks the size of the selection to an eavesdropper and this should be taken into account in the application threats model.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Adir, A. et al. (2022). Privacy-Preserving Record Linkage Using Local Sensitive Hash and Private Set Intersection. In: Zhou, J., et al. Applied Cryptography and Network Security Workshops. ACNS 2022. Lecture Notes in Computer Science, vol 13285. Springer, Cham. https://doi.org/10.1007/978-3-031-16815-4_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16815-4_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16814-7

  • Online ISBN: 978-3-031-16815-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics