We present a new efficient protocol for computing private set union (PSU). Here two semi-honest parties, each holding a dataset of known size (or of a known upper bound), wish to compute the union of their sets without revealing anything else to either party. Our protocol is in the OT hybrid model. Beyond OT extension, it is fully based on symmetric-key primitives. We motivate the PSU primitive by its direct application to network security and other areas.
At the technical core of our PSU construction is the reverse private membership test (RPMT) protocol. In RPMT, the sender with input \(x^*\) interacts with a receiver holding a set X. As a result, the receiver learns (only) the bit indicating whether \(x^* \in X\), while the sender learns nothing about the set X. (Previous similar protocols provide output to the opposite party, hence the term “reverse” private membership.) We believe our RPMT abstraction and constructions may be a building block in other applications as well.
We demonstrate the practicality of our proposed protocol with an implementation. For input sets of size \(2^{20}\) and using a single thread, our protocol requires 238 s to securely compute the set union, regardless of the bit length of the items. Our protocol is amenable to parallelization. Increasing the number of threads from 1 to 32, our protocol requires only 13.1 s, a factor of \(18.25{\times }\) improvement.
To the best of our knowledge, ours is the first protocol that reports on large-size experiments, makes code available, and avoids extensive use of computationally expensive public-key operations. (No PSU code is publicly available for prior work, and the only prior symmetric-key-based work reports on small experiments and focuses on the simpler 3-party, 1-corruption setting.) Our work improves reported PSU state of the art by factor up to \(7,600{\times }\) for large instances.
Of course, \(x\in \{0,1\}^*\) needs to be “hashed down” to an element of the field we are working with. This can be done, e.g., by applying a collision resistant hash function. For simplicity, here we mention, but don’t formalize this step.
We thank all anonymous reviewers and Brice Minaud for insightful feedback.
Vladimir Kolesnikov was supported in part by Sandia National Laboratories, a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. He was also supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via 2019-1902070008. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Mike Rosulek and Ni Trieu were partially supported by NSF awards #1617197, a Google faculty award, and a Visa faculty award.
A RPMT Optimization
A RPMT Optimization
In the RPMT protocol, the receiver computes a polynomial P with special output s. The sender computes \(s^* = P(h(x^*)) \oplus q^*\), where \(q^*\) is its OPRF output. Then the parties use PEQT to securely compare s to \(s^*\).
In the context of PSU, it is not necessary to use PEQT for this step. Instead, the sender can simply send \(s^*\) to the receiver. The logic is as follows: If \(x^* \in X\), the sender should learn only this fact (and nothing about \(x^*\)). This is still the case after the optimization because the sender will compute the same polynomial output \(s^*\) for any such \(x^* \in X\). If \(x^* \in X\), it means that the receiver will eventually learn \(x^*\) as part of the PSU output (and the sender can infer that \(x^*\) was contributed by the receiver). The PSU simulator will therefore have the value \(x^*\), and it can perfectly simulate the polynomial output \(s^* = P(h(x^*)) \oplus q^*\).
We now formalize the details of this modification. Rather than define a weaker/leaky version of RPMT, we instead introduce a protocol for 1-vs-n PSU. Such a functionality is quite similar to RPMT, which can be thought of as revealing only the cardinality of \(| \{x^*\} \cup X|\), which is equivalent to revealing the cardinality of \(|\{x^*\} \setminus X|\) (either 0 or 1).
The details of the 1-vs-n PSU protocol are given in Fig. 9. Now, using 1-vs-n PSU as a building block instead of RPMT, our full-fledged PSU protocol can be written as in Fig. 10.
The security proof of the full-fledged PSU protocol is essentially the same as in the pre-optimization protocol. The security of the 1-vs-n protocol is given below:
Theorem 3
The construction of Fig. 9 securely implements functionality \(\mathcal {F}^{1,n}_{\textsf {psu}}\) in the semi-honest model, given the OPRF primitive defined in Fig. 4.
We exhibit simulators \(\mathsf {Sim}_{\mathcal {R}}\) and \(\mathsf {Sim}_{\mathcal {S}}\) for simulating corrupt \(\mathcal {R}\) and \(\mathcal {S}\) respectively, and argue the indistinguishability of the produced transcript from the real execution.
Corrupt Sender. \(\mathsf {Sim}_{\mathcal {S}}(x^*)\) simulates the view of corrupt \(\mathcal {S}\), which consists of \(\mathcal {S}\)’s randomness, input, output and received messages. \(\mathsf {Sim}_{\mathcal {S}}\) proceeds as follows. It first chooses \(q'\in _R \{0,1\}^\sigma \), calls OPRF simulator \(\mathsf {Sim}_{S_\mathsf{OPRF}} (x^*, q')\), and appends its output to the view.
\(\mathsf {Sim}_{\mathcal {S}}\) simulates Step 3 as follows. It generates random \(s' \in \{0,1\}^\sigma \), and n random points \((x'_i, q'_i) \in _R (\{0,1\}^\star ,\{0,1\}^\sigma )\). \(\mathsf {Sim}_{\mathcal {S}}\) then interpolates the polynomial P over these points \(\{h(x'_i),s' \oplus q'_i\}\) and appends its coefficients to the generated view.
We argue that the output of \(\mathsf {Sim}_{\mathcal {S}}\) is indistinguishable from the real execution. For this, we formally show the simulation by proceeding the sequence of hybrid transcripts \(T_0,T_1, T_2\), where \(T_0\) is real view of \(\mathcal {S}\), and \(T_2\) is the output of \(\mathsf {Sim}_{\mathcal {S}}\).
Hybrid 1. Let \(T_1\) be the same as \(T_0\), except that the OPRF execution is replaced as follows. By the OPRF/BaRK-OPRF pseudorandomness guarantee and the indistinguishability of the output of \(\mathsf {Sim}_{S_\mathsf{OPRF}}\), we replace \(F(k,x^*)\) and \(F(k,x_i), \forall i \in [n],\) with \(q'\) and \(q'_i, \forall i \in [n]\), respectively. We note that if \(x^* = x_i\), then \(q'=q'_i\). It is easy to see that \(T_0\) and \(T_1\) are indistinguishable.
Hybrid 2. Let \(T_2\) be the same as \(T_1\), except that the polynomial is an uniform polynomial of degree \(n-1\). Consider two following cases:
\(x^* \not \in X\): Since all values \(q'_i\) are uniformly random from the \(\mathcal {S}\)’s point of view, so are the \(s \oplus q'_i\).
\(x^* = x_i\) (consequently, \(q'=q'_i\)): Since other values \(q'_{j \in [n]}, \forall j \ne i,\) are uniformly random from \(\mathcal {S}\)’s point of view, we replace these \(s \oplus q'_j\) with random. Then s is used only in the expression \(s\,\oplus \,q'_i\). Since s is uniform, \(s\,\oplus \,q'_i\) is also uniformly random from the \(\mathcal {S}\)’s view even though the adversary knows \(q'=q'_i\).
In summary, the polynomial from the real execution can be replaced with a polynomial P over random points. \(T_1\) and \(T_2\) are indistinguishable.
Corrupt Receiver. \(\mathsf {Sim}_{\mathcal {R}}(x_1,...,x_n, out)\) simulates \(\mathcal {R}\)’s view, which includes \(\mathcal {R}\)’s randomness, input, output and received messages. \(\mathsf {Sim}_{\mathcal {R}}\) proceeds as follows.
First, if \(out = \{x_1, \ldots , x_n, x^*\}\) for some \(x^*\), then the simulator knows \(\mathcal {S}\)’s input \(x^*\) and can trivially simulate all of \(\mathcal {S}\)’s actions honestly. This case of simulation is clearly perfect.
Otherwise, \(\mathsf {Sim}_{\mathcal {R}}\) chooses a random \(k'\in _r \{0,1\}^\kappa \), calls OPRF simulator \(\mathsf {Sim}_{S_\mathsf{OPRF}} (\bot ,k')\), and appends its output to the view. It simulates a message \(s^*=s\) from \(\mathcal {S}\) in Step 4. Finally, to simulate Step 5, \(\mathsf {Sim}_{\mathcal {S}}\) runs simulator \(\mathsf {Sim}_\mathsf{OT}\) on input \((1,\bot )\) and appends the output of \(\mathsf {Sim}_\mathsf{OT}\) to its output of the view.
The view generated by \(\mathsf {Sim}_{\mathcal {R}}\) in indistinguishable from a real view because of the indistinguishability of the transcripts of the underlying simulators.
