Keywords

1 Introduction

Secure Multiparty Computation (MPC) enables mutually untrusting parties to compute a function of their private inputs while revealing only the function output. The Goldreich-Micali-Wigderson (GMW) protocol is a foundational technique that realizes MPC for p players and that tolerates up to \(p-1\) semi-honest corruptions. In GMW, the players jointly evaluate a circuit \(C\) by (1) randomly secret sharing their private input values, (2) privately evaluating \(C\) gate-by-gate, ensuring that the random secret shares encode the correct value for each wire, and (3) reconstructing secret shares on the output wires.

While \(\mathtt {XOR}\) gates are evaluated without interaction, \(\mathtt {AND}\) gates require communication in the form of oblivious transfer (OT). The bottleneck in GMW performance is communication incurred by OTs, both in terms of bandwidth consumption and latency.

In this work, we improve the bandwidth consumption of the GMW protocol for circuits that include conditional branching. In particular, we improve by up to the branching factor: for a circuit with b branches, we reduce bandwidth consumption by up to \(b\times \).

The Cost of Round Complexity and GMW Use Cases. GMW requires a round of communication for each of the circuit’s layers of \(\mathtt {AND}\) gatesFootnote 1. Because in many scenarios the network latency is substantial, constant-round protocols, such as Garbled Circuit (GC) are often preferred.

Nevertheless, there are a number of scenarios where GMW is preferable to GC and other protocols:

  • GMW efficiently supports multiparty computation and is resilient against a dishonest majority. While multiparty GC protocols exist, they are expensive: the GC is generated jointly among players such that no small subset of players can decrypt wire labels. Thus, the GC must be generated inside MPC, which is expensive.

  • Many useful circuits are low-depth or have low-depth variants. GMW’s multi-round nature is less impactful for low-depth circuits, and prior work has shown that the protocol can outperform GC in these cases  [SZ13].

  • It is possible to front-load most of GMW’s bandwidth consumption to a pre-computation phase. When pre-computation is allowed, GMW can perform useful work even before the computed function is known. Indeed, given precomputed random OTs, GMW consumes only 6 bits per \(\mathtt {AND}\) gate in the 2PC setting (1-out-of-2 bit OT can be done by transferring a single one-bit secret and a single two-bit secret as introduced in  [Bea95]); this holds for arbitrary \(C\). In contrast, GC protocols cannot perform useful work until the circuit is knownFootnote 2.

In sum, GMW is suitable for a number of practical scenarios, and its improvement benefits many applications.

Goal: (Almost) Free Branching in GMW. GMW is a circuit-based protocol, and as such, all of \(C\)’s branches must be evaluated by the players. Until the recent work of  [HK20] (whose improvement is not for MPC, but for the simpler zero-knowledge setting), it was widely believed that the cost of branching is unavoidable in circuit-based protocols. In this work, we show how to essentially eliminate the cost of branching for GMW. Our technique is wholly different from that of  [HK20]; their ‘stacking’ technique has no obvious analog in GMW due to the interactive nature of the protocol.

Semi-honest GMW requires two bit-OTs per AND gate per each pair of players. The cost of such OT includes the transfer of the secrets (cheap, 3 bits from  [Bea95]) and consumes one row of the OT extension matrix (expensive, \(\kappa \) bits). Evaluation of all but one branch is ultimately discarded by the MPC, and our goal is to eliminate this waste.

We work in the semi-honest model, which is useful in many scenarios (e.g. protecting against players who may become corrupted in the future). Furthermore, advances in the semi-honest model often lead to similar advances in the malicious model. We leave exploring such improvements to future work.

1.1 Our Contributions

  • Efficient \(\mathtt {VS}\) gate. We extend the GMW protocol with gates that we call ‘vector-scalar’ gates (\(\mathtt {VS}\)). \(\mathtt {VS}\) gates allow p players to multiply a shared vector of b bits by a shared scalar bit for only \(p\cdot (p-1)\) OTs. Standard GMW computes each multiplication separately and thus requires \(b\cdot p \cdot (p-1)\) OTs. Thus, we reduce bandwidth consumption by \(b\times \) when evaluating the \(\mathtt {VS}\) gate.

  • (Almost) free conditional branching. We show how to use \(\mathtt {VS}\) to essentially eliminate the communication cost of inactive branches. Precisely, we amortize random OTs needed to securely compute \(\mathtt {AND}\) gates across a conditional. The players must still broadcast several bits per \(\mathtt {AND}\) gate, but this cost is small compared to the expensive \(\kappa \)-bit random OTs which we amortize. For a circuit with b branches, we improve communication by up to \(b\times \) as compared to state-of-the-art GMW. Our computation costs are also slightly lower than standard GMW because we process fewer OTs.

  • Implementation and evaluation. We implemented our approach in C++ and report performance (see Sect. 9). For 2PC and a circuit with 16 branches, we improve communication by 9.4\(\times \) and total wall-clock time by 5.1\(\times \) on a LAN and 9.2\(\times \) on a LAN with shared traffic (i.e. lower bandwidth).

1.2 Presentation Outline

We motivated our work in Sect. 1 and summarized the contributions in Sect. 1.1. We present related work in Sect. 2, review the basic GMW protocol in Sect. 3, and introduce notation in Sect. 4.

We present a technical summary of our approach in Sect. 5. We formally specify our protocols in Sect. 6 and provide proofs in Sect. 7. We discuss implementation details and evaluate performance in Sects. 8 and 9.

2 Related Work

We improve the state-of-the-art Goldreich-Micali-Wigderson (GMW) protocol [GMW87] by adding an efficient vector-scalar multiplication gate (\(\mathtt {VS}\)) that is notably useful for executing conditional branches. We therefore review related work that improves (1) secure computation of conditional branches and (2) the classic GMW protocol.

Stacked Garbling. A recent line of work improves communication of GC with conditional branching in settings where one player knows the evaluated branch  [Kol18, HK20]. [Kol18] is motivated by the use case where the GC generator knows the taken branch, e.g. while evaluating one of several DB queries. [HK20] is motivated by ZK proofs.

Prior to these works, it was generally believed that all circuit branches must be processed and transmitted according to the underlying protocol. [Kol18, HK20] break this assumption by using communication proportional to only the longest branch, given that one of the players knows which branch is taken.

Our research direction was inspired by these prior works: we show that communication reduction via conditional branching efficiently carries to GMW as well. In particular, the OTs used to compute \(\mathtt {AND}\) gates can be amortized across branches. Unlike [Kol18, HK20], we do not require any player to know which branch is taken.

Universal Circuits. Our work improves conditional branching by adding a new gate primitive that amortizes OTs across branches. Another approach instead recompiles branches into a new form. Universal circuits (UCs) are programmable constructions that can evaluate arbitrary circuits up to a given size n. Thus, a single UC can be programmed to compute any single branch in a conditional, amortizing the gate costs of the individual branches.

Unfortunately, a UC representing circuits of size n incurs significant overhead in the number of gates. Decades after Valiant’s original construction [Val76], UC enjoyed a renewed interest due to its use in MPC, and UC size has steadily improved  [KS08, LMS16, GKS17, AGKS19, KS16, ZYZL18]. The state-of-the-art UC construction has size \(3n \log n\)  [LYZ+20]. Even with these improvements, representing conditional branches with UCs is often impractical. For example, if we consider branches of size \(n=2^{10}\) gates, the state-of-the-art UC construction has factor \(3\cdot \log (2^{10}) = 30\times \) overhead. In addition, programming the UC based on branch conditions known only to the MPC player is a difficult and expensive process. Thus, in use cases arising in evaluation of typical programs, UC-based branch evaluation is slower than naïve circuit evaluation.

[KKW17] observed that UCs are overly general for conditional branching: a UC can represent any circuit up to size n, while a conditional has a fixed and often small set of publicly known circuits. Correspondingly, [KKW17] generalized UCs to Set Universal Circuits (\(\mathcal {S}\)-UCs). An \(\mathcal {S}\)-UC can be programmed to implement any circuit in a fixed set \(\mathcal {S}\), rather than the entire universe of circuits of size n. By constraining the problem to smaller sets, the authors improved UC overhead. [KKW17] used heuristics to exploit common sub-structures in the topologies of the circuits in \(\mathcal {S}\) by overlaying the circuits with one another. For a specific set of 32 circuits, the authors achieved 6.1\(\times \) size reduction compared to separately representing each circuit. For 32 circuits, our approach can improve by up to 32\(\times \). Additionally, we do not face the expensive problem of programming the conditional based on conditions known only to the MPC player. Finally,  [KKW17] is a heuristic whose performance depends on the specific circuits. Our approach is much more general.

Oblivious Transfer (OT) Extension and Silent OT. Since OT requires expensive public-key primitives, efficient GMW relies on OT extension  [Bea96, IKNP03]. Our implementation uses the highly performant 1-out-of-2 OT extension of [IKNP03] as implemented by the EMP-toolkit  [WMK16]. More specifically, we precompute 1-out-of-2 random OTs in a precomputation phase and use the standard trick  [Bea95] to cheaply construct 1-out-of-2 OT from random OT.

With  [IKNP03], each 1-out-of-2 OT requires transmission of a \(\kappa \)-bit (e.g. 128-bit) OT matrix row, regardless of the length of the sent secrets. Reducing the number of consumed OT matrix rows is the source of our improvement: our \(\mathtt {VS}\) gate takes advantage of the fact that a single 1-out-of-2 OT of b-bit strings is much cheaper than b 1-out-of-2 OTs of 1-bit strings, since in the former case only one \(\kappa \)-bit OT matrix row is consumed.

Silent OT is an exciting recent primitive that generates large numbers of random OTs from relatively short pseudorandom correlation generators  [BCG+19]. It largely removes the communication overhead of random OT when a large batch is executed. Currently, [IKNP03] remains more efficient than Silent OT in many contexts because Silent OT incurs expensive computation and involves operations with high RAM consumption  [BCG+19]. We stress that although we emphasize communication improvement via amortizing OTs, Silent OT does not replace our approach. Indeed, our approach yields improvement even if we use Silent OT, because we reduce the number of needed random OTs, thus allowing us to run a smaller Silent OT instance. Therefore, our approach significantly reduces the computation overhead of Silent OT, both in terms of RAM consumption and wall-clock time.

GMW with Multi-input/Multi-output Gates. Prior work  [KK13, KKW17, DKS+17] noticed that the cost of OTs associated with GMW gate evaluation could be amortized across several gates.   [KK13] improved OT for short secrets by extending  [IKNP03] 1-out-of-2 OT to a 1-out-of-n OT at only double the cost.   [KKW17, DKS+17] applied the  [KK13] OT to larger gates with more than the standard two inputs/one output, thus amortizing the OT matrix cost across several gates. As a secondary benefit, merging several gates into larger gates reduces the circuit depth and latency overhead.

Unfortunately, the above multi-input gate constructions encounter two significant problems. First, the size of the truth table, and thus bandwidth consumption, grows exponentially in the number of inputs. Therefore, it is unrealistic to construct multi-gates with large numbers of inputs. Second, gates that encode arbitrary functions do not cleanly generalize from the two-party to the multi-party setting. To explain why, we contrast arbitrary gates with \(\mathtt {AND}\) gates. \(\mathtt {AND}\) gates generalize to the multi-party setting because logical \(\mathtt {AND}\) distributes over \(\mathtt {XOR}\) secret shares. Therefore, the multiple players can construct \(\mathtt {XOR}\) shares of the \(\mathtt {AND}\) gate truth table. In contrast, an arbitrary function does not distribute over shares, and thus players cannot construct shares of the table.

Our \(\mathtt {VS}\) gate can be viewed as a particularly useful multi-input/multi-output gate that \(\mathtt {AND}\)s (multiplies) any number of vector elements with a scalar. The advantage of our approach over prior multi-input/multi-output gates is that our approach is based on algebra, not on the brute-force encoding of truth-tables. This algebra scales well both to any number of inputs/outputs and to any number of players. Of course, the most important difference is the key application of our approach – efficient branching – which was not achievable with prior work.

Arithmetic MPC and Vector OLE. A number of works presented arithmetic generalizations of MPC in the GMW style, e.g.  [IPS09, ADI+17]. Modern works in this area can efficiently multiply arbitrary field elements using a generalization of 1-out-of-2 string OT called ‘vector oblivious linear function evaluation’ (vOLE)  [ADI+17, BCGI18, DGN+17]. In addition, these works point out that field scalar-vector multiplication can be efficiently achieved with two vOLEs, and emphasize the usefulness of this technique for efficient linear algebra operations (e.g., matrix multiplication). Because we work with Boolean circuits, we do not need generalized vOLEs, and instead more efficiently base our vectorization directly on the efficient OT extension technique  [IKNP03]. Importantly, our branching application benefits from multiplication of relatively small vectors (of size equal to the branching factors), while break-even points of prior constructions imply their usefulness with much longer vectors.

Our work applies efficient scalar-vector multiplication to the unobvious and important use case of conditional branching.

Constant-Overhead MPC. Ishai et al.  [IKOS08] proposed a constant-overhead GMW-based MPC. They observe that once sufficiently many random OTs are available to the players, the remainder of the protocol can be done with constant overhead per Boolean gate. They exhibit a construction of such a pool of OTs with constant cost per OT. For this,  [IKOS08] relies on Beaver’s non-black-box OT extension  [Bea96], decomposable randomized encoding and an NC\(^0\) PRG. While asymptotically  [IKOS08]’s cost is optimal, in concrete terms, it is impractically high. Our work does not achieve constant factor overhead, but similarly improves OT utilization and is concretely efficient.

GMW Optimizations. [CHK+12] showed that GMW is particularly suitable in low-latency network settings and that it outperforms GCs in certain scenarios. [CHK+12] further showed an application in a set of online marketplaces such as a mobile social network, where a provider helps its users connect according to mutual interests. Their implementation used multi-threaded programming to take advantage of inherent parallelism available in the execution of OT and the evaluation of \(\mathtt {AND}\) gates of the same depth.

[SZ13] introduced several low-level computation improvements, such as using SIMD instructions and performing load-balancing, and circuit representation improvements, such as choosing low-depth circuits even at the cost of larger overall circuits. [SZ13] also elaborated on a number of examples where GMW is suitable, including a privacy-preserving face recognition with Eigenfaces [EFG+09, HKS+10, SSW10] or Hamming distance [OPJM10]. We draw our key evaluation benchmark, a \(\log \)-depth bitstring comparison circuit, from  [SZ13].

3 GMW Protocol Review

The GMW protocol allows p semi-honest players to securely compute a Boolean function of their private inputs. The key invariant is that on each wire, the p players together hold an \(\mathtt {XOR}\) secret share of the truth value.

Consider p players \(P_1, ..., P_p\) who together evaluate a Boolean circuit \(C\). For a wire a, we denote \(P_i\)’s share of a as \(a_i\). The players step through \(C\) gate-by-gate:

  • For each wire a corresponding to an input bit from player \(P_i\), \(P_i\) uniformly samples a p-bit \(\mathtt {XOR}\) secret share of a and sends a share to each player.

  • To compute an \(\mathtt {XOR}\) gate \(c = a \oplus b\), the players locally add their shares:

    $$ (a_1 \oplus ... \oplus a_p) \oplus (b_1 \oplus ... \oplus b_p) = (a_1 \oplus b_1) \oplus ... \oplus (a_p \oplus b_p) $$
  • To compute an \(\mathtt {AND}\) gate, the players communicate. Consider an \(\mathtt {AND}\) Gate \(c = ab\) and the following equality:

    $$ c = ab = (a_1 \oplus ... \oplus a_p)(b_1 \oplus ... \oplus b_p) = \left( \bigoplus _{i,j \in 1..p} a_i b_j\right) $$

    That is, to compute an \(\mathtt {AND}\) gate it suffices for each pair of players to multiply together their respective shares and then for the players to locally \(\mathtt {XOR}\) the results. Consider two players \(P_i\) and \(P_j\). The players compute shares of \(a_ib_j\) and \(a_jb_i\) via 1-out-of-2 OT: To compute \(a_ib_j\), \(P_i\) first samples a uniform bit \(x_i\). Then, the players perform 1-out-of-2 OT where \(P_j\) inputs \(b_j\) as her choice bit and \(P_i\) submits as input \(x_i\) and \(x_i \oplus a_i\). Let \(x_j\) be \(P_j\)’s OT output and note that \(x_i \oplus x_j = a_ib_j\). \(P_i\) \(\mathtt {XOR}\)s together her OT outputs with \(a_ib_i\) (which is computed locally) and outputs the sum.

  • For each output wire a, the players reconstruct the cleartext output by broadcasting their share and then locally \(\mathtt {XOR}\)ing all shares.

Thus, the GMW protocol securely computes an arbitrary function by consuming \(p(p-1)\) OTs per \(\mathtt {AND}\) gate. Our construction uses this same protocol, except that we replace \(\mathtt {AND}\) gates by a generalized \(\mathtt {VS}\) gate that \(\mathtt {AND}\)s an entire vector of bits with a scalar bit for \(p(p-1)\) OTs. As our key use-case, we show that this improves conditional branching.

4 Notation

  • We use p to denote the number of players.

  • We use subscript notation to associate a variable with a player. E.g., \(a_i\) is the share of wire a held by player \(P_i\).

  • t denotes the ‘active’ branch in a conditional i.e. a branch that is taken during the oblivious execution. \(\bar{t}\) implies an ‘inactive’ branch.

  • In this work, we manipulate strings of bits as vectors:

    • \(\bullet \) Superscript notation denotes vector indexes. E.g. \(a^i\) refers to the i-th index of a vector a.

    • \(\bullet \) We denote a vector of bits by writing parenthesized comma-separated values. E.g., (abc) is a vector of ab,  and c.

    • \(\bullet \) We use n to denote the length of a vector.

    • \(\bullet \) When two vectors are known to have the same length, we use \(\oplus \) to denote the bitwise XOR sum:

      $$\begin{aligned} (a^1, \ldots , a^n) \oplus (b^1, \ldots , b^n) = (a^1\oplus b^1,\ldots , a^n\oplus b^n) \end{aligned}$$
    • \(\bullet \) We indicate a vector scalar Boolean product by writing the scalar to the left of the vector:

      $$\begin{aligned} a(b^1, \ldots , b^n) = (ab^1,\ldots , ab^n) \end{aligned}$$

5 Technical Overview

Our approach amortizes OTs across conditional branches. Section 6 formalizes this approach in technical detail. In this section, we explain at a high level.

Recall, that GMW computes \(\mathtt {AND}\) (Boolean multiplication) gates via 1-out-of-2 OT. Suppose that we wish to multiply an entire vector of Boolean bits \((b^1,\ldots ,b^n)\) by the same scalar a. I.e., we wish to compute \((ab^1,\ldots ,ab^n)\). \(\mathtt {MOTIF}\) amortizes the expensive 1-out-of-2 OTs needed to multiply each shared vector element by a shared scalar (hence the notation \(\mathtt {VS}\) for vector-scalar). Namely, to evaluate n \(\mathtt {AND}\) gates of this form, instead of using \(n\cdot p\cdot (p-1)\) OTs of length-1 secrets, we use only \(p \cdot (p-1)\) OTs of length-n secrets. This reduces consumption of the OT extension matrix rows, the most expensive resource in the GMW evaluation.

We first show how we achieve this cheap vector scalar multiplication. Then, we show how this tool is used to reduce the cost of conditional branching.

In this section, for simplicity, we focus on the case of \(b=2\) branches and \(p=2\) players. Our approach naturally generalizes to arbitrary b and p, and we formally present our constructions in full generality in Sect. 6.

5.1 \(\mathtt {VS}\) Gates

As we showed in Sect. 3, a single \(\mathtt {AND}\) gate computed amongst p players requires \(p(p-1)\) 1-out-of-2 OTs. Our \(\mathtt {VS}\) gate construction consumes the same number of OTs, but multiplies an entire vector of bits by a scalar bit. Suppose two players \(P_1, P_2\) wish to compute the following vector operation:

$$ a (b, c) = (a b, a c) $$

where \(a = a_1\oplus a_2, b = b_1\oplus b_2\), and \(c=c_1\oplus c_2\) are GMW secret shared between \(P_1,P_2\). Note the following equality:

The first and fourth summands can be computed locally by the respective players. Thus, we need only show how to compute \(a_1(b_2, c_2)\) (the remaining third summand is computed symmetrically). To compute this vector \(\mathtt {AND}\), the players perform a single 1-out-of-2 OT of length-2 secrets. Here, \(P_2\) plays the OT sender and \(P_1\) the receiver. \(P_2\) draws two uniform bits x and y and allows \(P_1\) to choose between the following two secrets:

$$\begin{aligned} (x, y)\qquad (x \oplus b_2, y \oplus c_2) \end{aligned}$$

\(P_1\) chooses based on \(a_1\) and hence receives \((x \oplus a_1b_2, y \oplus a_1c_2)\). \(P_2\) uses the vector (xy) as her secret share of this summand. Thus, the players successfully hold shares of \(a_1(b_2, c_2)\).

Put together, the full vector multiplication a(bc) uses only two 1-out-of-2 OTs of length-2 secrets. Our \(\mathtt {VS}\) gate generalizes to arbitrary numbers of players and vector lengths: a vector scaling of b elements between p players requires \(p(p-1)\) 1-out-of-2 OTs of length b secrets.

5.2 \(\mathtt {MOTIF}\): (Almost) Free Conditional Branching in GMW

We now show how \(\mathtt {VS}\) gates allow improved conditional branching. We amortize OTs used by \(\mathtt {AND}\) gates across conditional branches. Branches may be arbitrary, having different topologies and operating on independent wires.

For simplicity, consider a circuit that has only two branches and that is computed by only two players; our approach generalizes to b branches and n players. Since the two branches are conditionally composed, one branch is ‘active’ (i.e. taken) and one is ‘inactive’.

Our key invariant is that on all wires of the inactive branch the players hold a share of 0, whereas on the active branch they hold valid shares. We begin by showing how \(\mathtt {AND}\) gates interact with this invariant. In particular, the invariant allows \(\mathtt {AND}\) gates across different conditional branches to be simultaneously computed by a single \(\mathtt {VS}\) gate. Then we show how all gates maintain the invariant and how we enter/leave branches.

\(\mathtt {AND}\) Gates. Our key optimization allows the players to consider simultaneously one \(\mathtt {AND}\) gate from each branch. For example, suppose the players wish to compute both \(a^1b^1\) and \(a^2b^2\) where \(a^1,b^1\) are wires in branch 1 and \(a^2,b^2\) are wires in branch 2. Despite the fact that the players compute two gates, they need only two 1-out-of-2 OTs. Let t be the taken branch. Hence \(x^t,y^t\) are active wires and \(x^{\bar{t}},y^{\bar{t}}\) are both 0. Observe the following equalities:

$$\begin{aligned} (x^t \oplus x^{\bar{t}})y^t&= (x^t \oplus 0)y^t = x^ty^t\\ (x^t \oplus x^{\bar{t}})y^{\bar{t}}&= (x_t\oplus 0)0 = 0 \end{aligned}$$

Thus if we efficiently compute both \((x^t \oplus x^{\bar{t}})y^t\) and \((x^t \oplus x^{\bar{t}})y^{\bar{t}}\), then we propagate the invariant: the active branch’s \(\mathtt {AND}\) output wire receives the correct value while the inactive branch’s wire receives 0. These products reduce to a vector-scalar product computed by our \(\mathtt {VS}\) gate:

$$ (x^t \oplus x^{\bar{t}})(y^t, y^{\bar{t}}) $$

Thus, we compute two \(\mathtt {AND}\) gates for the price of one. This technique generalizes to arbitrary numbers of branches: to compute b \(\mathtt {AND}\) gates across b branches, our approach consumes two OTs of length b secrets.

Additional Details. Our optimization relies on ensuring all inactive wires hold 0. We now show how we establish this invariant upon entering a branch, how non-\(\mathtt {AND}\) gates maintain the invariant, and how we leave conditionals.

  • Demultiplexing. ‘Entering’ a conditional is controlled by a condition bit, a single bit whose value determines which of the two branches should be taken. To enter a conditional with two branches, we demultiplex the input values based on the condition bit. That is, we \(\mathtt {AND}\) the branch inputs with the condition bit. More precisely, for the input to branch 1, i.e. the branch taken if the condition bit holds 1, we \(\mathtt {AND}\) the input bits with the condition bit. Symmetrically, for branch 0, we \(\mathtt {AND}\) each input bit with the \(\mathtt {NOT}\) of the condition bit. Thus, we obtain a vector of valid inputs for the active branch and a vector of all 0s for the inactive branch. Because we multiply all inputs by the same two bits, we can use \(\mathtt {VS}\) gates to efficiently implement the demultiplexer. In order to implement more than two branches, we nest conditionals.

  • \(\mathtt {XOR}\) gates. \(\mathtt {XOR}\) gates trivially maintain our invariant: an \(\mathtt {XOR}\) gate with two 0 inputs outputs 0.

  • \(\mathtt {NOT}\) gates. Native \(\mathtt {NOT}\) gates would break our invariant: a \(\mathtt {NOT}\) gate with input 0 outputs 1. Thus, we do not natively support \(\mathtt {NOT}\) gates. Fortunately, we can construct \(\mathtt {NOT}\) gates from \(\mathtt {XOR}\) gates. To do so, we maintain a distinguished ‘true’ wire in each branch. We ensure, by demultiplexing, that the ‘true’ wire holds logical 1 on all active branches and logical 0 on all inactive branches. A \(\mathtt {NOT}\) gate of a wire can thus be achieved by \(\mathtt {XOR}\)ing the wire with ‘true’.

  • Multiplexing. To ‘leave’ a conditional, we resolve the output wires of the two branches: we propagate the output values on the active branch and discard the output of the inactive branch. Fortunately, our invariant means that this operation is extremely cheap: to multiplex the output values of wires on the active and inactive branches, we simply \(\mathtt {XOR}\) corresponding wires together.

Branch Layer Alignment. As GMW is an interactive scheme, at any time we can only evaluate gates whose input shares have already been computed (ready gates), and thus we cannot include ‘future round’ \(\mathtt {AND}\) gates into the current \(\mathtt {VS}\) computation. In each round of GMW computation, we can only amortize OTs over the ready gates.

That is, in p-party GMW, in each round our technique eliminates all OTs, except for the total of \(p(p-1)\cdot \max (w_i)\) OTs, where \(w_i\) is the number of \(\mathtt {AND}\) gates in the current layer of branch i. Clearly, the more aligned (i.e. having a similar number of \(\mathtt {AND}\) gates in each circuit layer) the circuit branches are, the higher the performance improvement.

In our experiments, we demonstrate the maximum achievable benefit of our construction by evaluating perfectly aligned circuits. While typical circuits will not have perfectly aligned branches, we do not expect them to have a poor alignment either, particularly if the branching factor is high. We leave improving alignment, perhaps via compilation techniques, as future work.

6 \(\mathtt {MOTIF}\): Formalization and Protocol Construction

We now formalize \(\mathtt {MOTIF}\), our GMW extension that supports efficient branching. As in the standard GMW protocol, our approach represents functions as circuits composed from a collection of low-level gates. We presented the core technical ideas of our approach in Sect. 5; the following discussion assumes a familiarity with Sect. 5.

Underlying Idea. We implement efficient branching by simultaneous evaluation of multiple independent \(\mathtt {AND}\) gates, one gate from each mutually exclusive branch, by representing them as a single cheap \(\mathtt {VS}\) gate.

Presentation Roadmap. Our formalization involves intertwined low-level cryptographic, programming language, and circuit technical details.

In Sect. 6.1 we motivate our compilation sequence, which takes a program with if branches written in a high-level language and outputs a straight-line circuit that uses \(\mathtt {VS}\) gates. We do not yet explain in detail how it is achieved, absent a necessary formalization of circuits and gates, which we provide in Sect. 6.2. Armed with the formalization, we explain in Sect. 6.3 how vectorized \(\mathtt {VS}\) gates facilitate branching in a straight-line circuit: we provide a formal algorithm (Fig. 1) that generates a straight-line circuit with \(\mathtt {VS}\) gates implementing branching over two circuits \(C_0,C_1\).

Then, having converted a program/circuit with branching into a \(\mathtt {VS}\) circuit defined in Sect. 6.2, we focus on efficient secure evaluation of the latter. In Sect. 6.4, we complete our formalization by defining cleartext semantics. In Sect. 6.5, we present a complete protocol, , with proofs in Sect. 7.

6.1 Compiling Conditionals to Straight-Line \(\mathtt {VS}\) Circuits

Our approach is concerned primarily with the efficient handling of conditional branching. Therefore, we begin our formalization by discussing how conditional branches can be efficiently represented in terms of only \(\mathtt {XOR}\) and \(\mathtt {VS}\) gates.

Assume that the user’s MPC functionality is encoded in some high-level language as a program with branching. The user hands this high-level functionality to a compiler which translates the high-level-language program into a low-level collection of gates. To interface with our approach, the compiler should output a circuit that contains \(\mathtt {XOR}\) and \(\mathtt {VS}\) gates.

It is thus the job of the compiler to translate conditionals into the \(\mathtt {VS}\) circuit. Recall (from Sect. 5.2) that our key branching invariant requires that all inactive branches hold 0 values on all wires. Consider b branches, where each branch i computes the conjunction \(x^iy^i\), and where \(x^i,y^i\) are independent values carried by i-th branch’s wires. Due to the key invariant, and as discussed in detail in Sect. 5.2, the following vector-scalar product simultaneously computes these b \(\mathtt {AND}\)s:

$$ (x^1 \oplus \ldots \oplus x^b)(y^1, \ldots , y^b) $$

The compiler’s job is to output \(\mathtt {VS}\) gates that simultaneously compute \(\mathtt {AND}\) gates in this manner. In Sect. 6.3 we show how a compiler can merge the gates of two branches in order to amortize \(\mathtt {AND}\) gates as just described. First, we describe the syntax needed for this compiler algorithm and for our protocol.

6.2 Circuit Formal Syntax

Because we add a new gate primitive, we cannot use the community-held implicit syntax of Boolean circuits. Thus, we formalize the syntax and semantics of our modified circuits such that we can prove correctness and security.

Gate Syntax. Our approach handles two kinds of gates: \(\mathtt {XOR}\) gates, which can be evaluated locally, and vector-scalar gates (\(\mathtt {VS}\)), a new type of gate, which multiplies a vector of bits by a scalar for the cost of only \(p(p-1)\) OTs. An \(\mathtt {XOR}\) gate has two input wires ab and an output wire c and computes \(c \leftarrow a \oplus b\). We denote an \(\mathtt {XOR}\) gate by writing \(\mathtt {XOR}(c, a, b)\). A vector-scalar gate \(\mathtt {VS}\) takes as input a scalar a and a vector \((b^1, \ldots , b^n)\) and computes:

$$ (c^1, \ldots , c^n) \leftarrow a (b^1, \ldots , b^n) $$

We denote a vector-scalar gate by writing \(\mathtt {VS}((c^1, \ldots , c^n), a, (b^1, \ldots , b^n))\). We also formalize the input/output wires of the circuit. We denote an input wire a whose value is given by player P by writing \(\mathtt {INPUT}(P, a)\). Finally, we indicate that wire a is an output wire by writing \(\mathtt {OUTPUT}(a)\). Formally, let variables \(a, b, c,\ldots \) be arbitrary wires and let P be an arbitrary player. The space of gates is denoted:

\(\mathtt {NOT}\) Gates. Typically, Boolean techniques support gates that perform logical \(\mathtt {NOT}\). As discussed in Sect. 5, we do not natively support \(\mathtt {NOT}\) gates as they would break the correctness of \(\mathtt {VS}\) implementation of conditional branches: our invariant requires all inactive wires to hold shares of 0, and \(\mathtt {NOT}\) gates flip 0 to 1. Accordingly, our formal syntax does not include \(\mathtt {NOT}\) gates. Instead, we build \(\mathtt {NOT}\) gates from \(\mathtt {XOR}\) gates and a per branch auxiliary distinguished wire aux, which is set by the MPC player to \(\mathtt{aux} = 1\) in the active branch, and to \(\mathtt{aux}= 0\) in all inactive branches. Then \(\lnot a = a \oplus \mathtt{aux}\), which implements \(\mathtt {NOT}\) in the active branch and preserves monotonicity in the inactive branches.

Circuit Syntax. A circuit is a list of gates. We do not need to “connect” the gates in the circuit, since gates already refer to specific wire ids. Formally, let \(g_1, \ldots , g_k \in \mathcal {G}\) be arbitrary gates. The space of circuits with k gates is denoted:

We consider a circuit to be valid only if the gates are in a topological order: i.e., a wire must appear as a gate output before it is used as a subsequent gate input. In upcoming discussion, we assume circuits are valid.

Circuit Layers. In our implementation, our circuit syntax groups collections of gates into layers, such that all \(\mathtt {VS}\) gates of the same depth can be computed in constant communication rounds. We omit this layering from our formalization to keep notation simple, but emphasize that the required change is straightforward.

Fig. 1.
figure 1

\(\mathtt {merge}\), a compiler algorithm, demonstrates how two branch circuits can be merged into one while joining together \(\mathtt {VS}\) gates. By using an algorithm like \(\mathtt {merge}\), a compiler can use our approach to amortize the cost of OTs across conditional branches.

6.3 Merging Conditional Branches

As discussed in Sect. 6.1, we view the problem of translating from programs with conditional branches to circuits in our syntax as a problem for a compiler. In this section, we specify an algorithm \(\mathtt {merge}\) (Sect. 1) that demonstrates how a compiler can combine \(\mathtt {VS}\) gates from each branch into a single \(\mathtt {VS}\) gate (of course, the standard \(\mathtt {AND}\) gate is a special case of the \(\mathtt {VS}\) gate).

For simplicity, assume that the high-level source language contains only binary branching, perhaps through if statements. Even in this simplified model, the programmer can nest if statements to achieve arbitrary branching. We also assume that the compiler can translate low-level program statements into circuits (e.g., assignment statements are converted into circuits).

Consider two branches of an if statement, and suppose that the compiler already recursively compiled the body of both branches into two circuits \(C_0\) and \(C_1\). To finish translating the if statement while taking advantage of our approach, the compiler should merge together \(\mathtt {VS}\) gates in \(C_0\) and \(C_1\). \(\mathtt {merge}\)  is one technique for performing this combining operation. \(\mathtt {merge}\)  takes \(C_0\) and \(C_1\) as arguments and outputs a single circuit that computes both input circuits, but that uses fewer \(\mathtt {VS}\) gates than simply concatenating \(C_0\) and \(C_1\). At a high level, \(\mathtt {merge}\)  walks the two input circuits gate-by-gate. It eagerly moves \(\mathtt {XOR}\) gates from the input circuits to the output circuit until the next gate in both circuits is a \(\mathtt {VS}\) gate. \(\mathtt {merge}\)  combines these two \(\mathtt {VS}\) gates into one by concatenating the two vectors and by \(\mathtt {XOR}\)ing the two scalars. \(\mathtt {merge}\)  assumes that circuits inside of conditionals do not contain \(\mathtt {INPUT}\) or \(\mathtt {OUTPUT}\) wires.

By recursively applying \(\mathtt {merge}\)  across many conditional branches, a compiler can achieve up to \(b\times \) reduction in the number of \(\mathtt {VS}\) gates.

Merging Layers. As discussed in Sect. 6.2, our formalization does not account for circuit layers (i.e. \(\mathtt {VS}\) gates that occur at the same multiplicative depth) for simplicity. In order to avoid increasing latency, merging must take care to preserve layers: merging \(\mathtt {VS}\) gates across layers can increase the overall multiplicative depth and add communication rounds. Thus, the compiler must be careful when merging gates.

One straightforward technique, which we implemented, is to only merge together \(\mathtt {VS}\) gates of the same depth. That is, our implementation introduces an extra loop which combines all \(\mathtt {VS}\) gates that are grouped in the same layer instead of handling \(\mathtt {VS}\) gates one at a time. Even this straightforward strategy is likely to yield large improvements, particularly if the branching factor is high.

More optimal approaches exist, and the problem of maximally amortizing OTs across branches thus becomes a relatively interesting compilers problem. An intelligent compiler could allocate gates to different layers in order to maximally match up \(\mathtt {VS}\) gates across branches without increasing depth. An even more intelligent compiler could account for network settings in order to decide when it is worth it to increase multiplicative depth in exchange for better layer alignment.

Fig. 2.
figure 2

The cleartext semantics for a circuit \(C\in \mathcal {C}\) run between p players. Each player i’s input is modeled as a string of bits \(\mathtt {inp}_i\). The method \(\mathtt {pop}\) pops the first value from the string. Each gate manipulates a wiring, which is a map from wire indexes to values. The output of evaluation is a string of bits \(\mathtt {out}\).

6.4 Circuit Cleartext Semantics

Prior discussion showed that a Boolean circuit with branches can be represented as a straight-line \(\mathtt {VS}\) circuit. We present our MPC protocol for evaluating such circuits in formal detail in Sect. 6.5.

In order to demonstrate that our protocol is correct, we require a formal semantics. I.e., we require the functionality that the protocol achieves. In this section, we specify the formal semantics of circuits as the algorithm \(\mathtt {eval}\) listed in Fig. 2. \(\mathtt {eval}\) maintains a circuit wiring: a map from wire indexes to Boolean values. Each gate reads values from the wiring for input wires and/or writes values to the wiring for output wires.

Fig. 3.
figure 3

Our protocol from the perspective of player i. performs the same tasks as the classic GMW protocol except for \(\mathtt {VS}\) gates, where we delegate to the sub-protocol .

6.5 Our Protocol

In this section, we formalize our protocol , which securely implements the semantics of \(\mathtt {eval}\) (Fig. 2):

Construction 1

(Protocol ) is defined in Figs. 3 and 4.

Theorems in Sect. 7 imply the following:

Theorem 1

Construction 1 implements the functionality \(\mathtt {eval} \) (Fig. 2) and is secure against up to \(p-1\) semi-honest corruptions in the OT-hybrid model.

Figure 3 lists our high level protocol from the perspective of an arbitrary player \(P_i\). For the reader familiar with the detail of the classic GMW protocol, the only essential difference between the classic protocol and ours is that we handle \(\mathtt {VS}\) gates by invoking an instance of our protocol.

 ensures that the p players hold random \(\mathtt {XOR}\) secret shares of the truth values on the already computed wires. This invariant ensures both correctness and security: the protocol is correct because the output wires’ secret shares can be reconstructed to the correct truth value. The protocol is secure because the \(\mathtt {XOR}\) secret shares are uniformly random, and hence no player’s share (or any strict subset’s shares) gives any information about the truth value on a particular wire. We argue these facts in detail in our proofs (Sect. 7).

Like the functionality \(\mathtt {eval}\),  proceeds by case analysis on gates:

  • \(\mathtt {XOR}\). The players locally \(\mathtt {XOR}\) their shares. Because \(\mathtt {XOR}\) is commutative and associative, this local computation correctly implements the functionality.

  • \(\mathtt {VS}\). We delegate \(\mathtt {VS}\) gates to a separate protocol  (Fig. 4). Recall, \(\mathtt {VS}\) simultaneously multiplies an entire n-element Boolean vector \((x^1,\ldots ,x^n)\) by a Boolean scalar a, as follows: Let p be the number of players holding \(\mathtt {XOR}\) shares of a and \(x^1,...,x^n\). Consider an arbitrary k-th vector element \(x^k\).  is based on the following equivalence:

    $$\begin{aligned} ax^k = (a_1 \oplus \ldots \oplus a_p) (x^k_1 \oplus \ldots \oplus x^k_p) = \bigoplus _{i=1}^p \left( \bigoplus _{j=1}^p a_ix^k_j\right) \end{aligned}$$
    (1)

    Now, the sums \(\bigoplus _{j=1}^p a_ix^k_j\) can be delivered to player \(P_i\) simultaneously for all \(k\in [1,...,n]\) via only \((p-1)\) n-bit string 1-out-of-2 OTs executed with the \(p-1\) other players. Once this is done for all p players (using a total of \(p(p-1)\) OTs of n-bit strings), the result is a secret sharing of the vector \((ax^1,...,ax^n)\). OT senders introduce uniform masks to protect the secrecy of their shares \(x_j^k\). The \(\mathtt {VS}\) protocol is formalized in Fig. 4.

  • \(\mathtt {INPUT}\). Each input wire has a designated player who provides the input value. In , this player distributes a share of a single bit from their input. Our formalization assumes two procedures: (1) \(\mathtt {sendShares}\) constructs a uniform \(\mathtt {XOR}\) secret share of a given value and sends the shares to all p players and (2) \(\mathtt {recvShare}\) is the symmetric procedure that receives a single share from the sending player.

  • \(\mathtt {OUTPUT}\). For output wires, the players simply reconstruct their \(\mathtt {XOR}\) secret shares. Our formalization assumes a protocol \(\mathtt {reconstruct}\) which handles these details. \(\mathtt {reconstruct}\) instructs each player to broadcast their share to all other players. Then, each player locally \(\mathtt {XOR}\)s together all shares.

Fig. 4.
figure 4

Protocol from the perspective of player i. explains how the players perform a vector-scalar multiplication. \(\mathtt {draw}\) uniformly draws a random bit-vector of the specified length. \(\mathtt {OTSend}\) and \(\mathtt {OTRecv}\) respectively send and receive a 1-out-of-2 OT of n-bit secrets. In practice, we precompute all random OTs at the start of the protocol.

7 Proofs

Now that we have formalized , we prove that it is correct and secure.

7.1 Proof of Correctness

implements the functionality \(\mathtt {eval} \) (Fig. 2):

Theorem 2

( Correctness). For all circuits \(C\in \mathcal {C}\) and all input bitstrings \(\mathtt {inp}_1, \ldots , \mathtt {inp}_p\):

Proof

By induction on \(C\). The invariant is that gate input wires hold \(\mathtt {XOR}\) secret shares of corresponding cleartext values.

We proceed by case analysis of an individual gate g, showing that the invariant is propagated from input wires to output wires.

  • Suppose g is an input \(\mathtt {INPUT}(i, a)\). Then \(P_i\) secret shares her input bit and distributes it amongst players, trivially establishing the invariant on wire a.

  • Suppose g is an \(\mathtt {XOR}\) gate \(\mathtt {XOR}(c, a, b)\). By induction, the input wires a and b hold correct shares. In , the players locally sum their shares. Thus, the output wire c holds a correct sharing of the \(\mathtt {XOR}\) of the input shares:

    $$ (a_1 \oplus \ldots \oplus a_p) \oplus (b_1 \oplus \ldots \oplus b_p) = (a_1 \oplus b_1) \oplus \ldots \oplus (a_p \oplus b_p) $$
  • Suppose g is a vector-scalar gate \(\mathtt {VS}((c^1,\ldots ,c^n), a, (b^1,\ldots ,b^n))\). By induction, \(a, b^1,\ldots , b^n\) hold correct shares. Consider an arbitrary vector element \(b^k\). The specification \(\mathtt {eval} \) requires that the corresponding output wire \(c^k\) obtains a secret sharing of \(ab^k\). Recall the crucial \(\mathtt {AND}\) equality given by Equation (1):

    $$ ab^k = (a_1 \oplus \ldots \oplus a_p) (b^k_1 \oplus \ldots \oplus b^k_p) = \bigoplus _{i=1}^p \left( \bigoplus _{j=1}^p a_ib^k_j\right) $$

    The protocol (Fig. 4) uses local computation and OTs to simultaneously compute a secret sharing of the above \(\mathtt {XOR}\) sum for each vector element. In particular, for each element \(b^k\), each player \(P_i\) computes a share \(\bigoplus _{j=1}^p a_ib^k_j\) (with added random masks). Thus, for each vector element \(b^k\), the players hold correct \(\mathtt {XOR}\) secret shares, which they store on the wire \(c^k\).

  • Suppose g is an output \(\mathtt {OUTPUT}(a)\). By induction, wire a holds correct secret shares. Thus, when the players reconstruct their shares they obtain the correct truth value for wire a.

is correct.

   \(\square \)

7.2 Proof of Security

We now prove secure in the OT-hybrid model. uses 1-out-of-2 OT as an oracle functionality.

Our proof is nearly identical to that of classic GMW. The difference between the two proofs is that our protocol uses \(\mathtt {VS}\) gates whereas classic GMW uses \(\mathtt {AND}\) gates. Both proofs show that interactions involving \(\mathtt {AND}\)/\(\mathtt {VS}\) gates can be simulated by uniform bits.

Theorem 3

( Security). is secure against semi-honest corruption of up to \(p - 1\) players in the OT-hybrid model.

Proof

By construction of a simulator \(\mathcal {S}\) that simulates the view of a player \(P_1\), and an argument that \(\mathcal {S}\) generalizes to arbitrary strict subsets of players.

At a high level, \(\mathcal {S}\) computes simulated secret shares on all circuit wires and adds simulated messages to \(P_1\)’s simulated view. The crucial property is that all wire values, except outputs and inputs belonging to \(P_1\), are indistinguishable from uniform bits.

  • Consider an input wire. First, suppose that this wire belongs to \(P_1\). In this case, \(P_1\) receives no messages. Hence, \(\mathcal {S}\) need not modify \(P_1\)’s view. Instead, \(\mathcal {S}\) samples a uniform bit as an \(\mathtt {XOR}\) secret share of \(P_1\)’s input and adds it to the circuit wiring.

    Next, suppose that the input wire belongs to some other player \(P_{i\ne 1}\). Recall that \(P_{i\ne 1}\) uniformly samples an \(\mathtt {XOR}\) secret share of her input and sends one share to \(P_1\). Thus, \(\mathcal {S}\) simulates an input wire by drawing a uniform bit. \(\mathcal {S}\) adds this bit to \(P_1\)’s view and to the circuit wiring.

  • \(\mathtt {XOR}\) gates are computed locally. Hence, \(\mathcal {S}\) need not modify \(P_1\)’s view. Instead, \(\mathcal {S}\) simply \(\mathtt {XOR}\)s the gate’s simulated input shares and adds the output share to the wiring.

  • Consider a \(\mathtt {VS}\) gate. In the real world, \(P_1\) interacts with OT twice per every other player (once as a sender and once as a receiver). On send interactions, \(P_1\) receives no output, so the interaction is trivially simulated. Receiving OTs is more complex. Recall that for a \(\mathtt {VS}\) gate (see Fig. 4), each player \(P_{i \ne 1}\) sends via OT either a random string x or \(x \oplus b\) where b is \(P_{i \ne 1}\)’s shares for all of the scaled wires. Note that in this second message, b is masked by x. Since \(P_1\) obtains only one of these messages from the OT oracle, both are indistinguishable from uniform bits. Thus, \(\mathcal {S}\) simulates each OT output by drawing uniform bits. Now, \(\mathcal {S}\) updates the simulated wiring by \(\mathtt {XOR}\)ing the simulated input shares with the simulated OT messages (see Fig. 4, Equation (1) for the required computation) and places the results on the \(\mathtt {VS}\) gate output wires.

  • Consider an output wire. In the real world, \(P_1\) receives all other players’ shares and \(\mathtt {XOR}\)s them with her own share. \(\mathcal {S}\) must take care that \(P_1\)’s view is consistent with this \(\mathtt {XOR}\)ed output value. In particular, \(\mathcal {S}\) draws uniform bits to simulate messages for all uncorrupted players except for one. For this last player, \(\mathcal {S}\) simulates a message by \(\mathtt {XOR}\)ing these drawn bits with \(P_1\)’s simulated share (stored in the wiring) and the desired output.

Thus, \(\mathcal {S}\) simulates \(P_1\)’s view.

Now, we argue that \(\mathcal {S}\) is generalizable to any strict subset of players. Because of the symmetry of the protocol, \(\mathcal {S}\) is clearly applicable to any one player. Generalizing to more than one player relies on the fact that players’ values are \(\mathtt {XOR}\) secret shares. Thus, holding k player shares gives no information about the other players’ views. \(\mathcal {S}\) is easily modified to simulate more messages, i.e. to simulate the messages received by all simulated players.

is secure against semi-honest corruption of up to \(p-1\) players.

   \(\square \)

8 Implementation

We implemented \(\mathtt {MOTIF}\) in C++ using GCC’s experimental support for C++20. Our implementation consists of a circuit compiler, which converts code with conditionals into circuits, and a circuit evaluator, which implements our protocol.

Our compiler accepts a C++ program written in a stylized vocabulary. This vocabulary allows programs with overloaded C++ Boolean operations that construct Boolean circuits (from the programmer’s perspective, this stylized vocabulary is similar to that of EMP’s circuit generation library). We add a special IF/THEN/ELSE branching syntax that constructs circuits with conditionals of two branches. Higher branching factor is achieved by nesting.

The compiler outputs \(\mathtt {XOR}\) and \(\mathtt {VS}\) gates listed in order of depth. The compiler also optionally outputs standard GMW circuits (i.e., without our conditional optimization) for benchmarking purposes.

Our implementation of the MPC protocol  is natural, but we point out some of its more interesting aspects. We use 1-out-of-2  [IKNP03] OT as implemented by EMP  [WMK16]. Each pair of players precomputes enough OT matrix rows for the MPC evaluation. Players evaluate circuits layer-by-layer as specified by the compiler output. In the case of standard GMW, players evaluate each \(\mathtt {AND}\) gate by consuming two OT matrix rows per each pair of players. In , players consume the same number of OT matrix rows, but evaluate our more expressive \(\mathtt {VS}\) gates. The benefit of our approach is that up to \(b\times \) fewer \(\mathtt {VS}\) gates (vs \(\mathtt {AND}\) gates) are needed to implement b branches, thus reducing the number of consumed OT rows. In both the reference protocol and our optimized protocol, we parallelize OTs for \(\mathtt {AND}\)/\(\mathtt {VS}\) gates in the same circuit layer. Thus, communication rounds are proportional to the circuit’s multiplicative depth.

9 Performance Evaluation

We compare to the standard GMW protocol [GMW87]. All experiments were run on a commodity laptop running Ubuntu 19.04 with an Intel(R) Core(TM) i5-8350U CPU @ 1.70 GHz and 16 GB RAM. All players were run on the same machine, and network settings were configured with the tc command. We sampled data points over 200 runs, averaging the middle 100 results.

In our experiments, the computed circuit consists of b branches, each implementing the same \(\log \)-depth string-comparison circuit, which checks the equality of two length-65000 bitstrings. The active branch is selected based on private variables chosen by the players. In more realistic circuits, each conditional branch would have a different topology. We use the same circuit across branches so that it is easy to understand branching improvement: all branches have the same size.

We emphasize that our compiler does not ‘optimize away’ conditionals: i.e., even though each branch is the same circuit, all branches are still evaluated by both protocols. We use a string-comparison circuit because it is indicative of the kinds of circuits where GMW excels: the string-comparison circuit has low-depth. This circuit was suggested as a useful application of GMW by  [SZ13].

Choice of Benchmark Circuit and Layering. As discussed in Sect. 5.2, our approach cannot always fully amortize OTs across branches because we must preserve the circuit’s multiplicative depth. Thus, in p-party GMW, in each round our technique eliminates all OTs, except for the total of \(p(p-1)\cdot \max (w_i)\) OTs, where \(w_i\) is the number of \(\mathtt {AND}\) gates in the current layer of branch i. The effectiveness of our approach thus varies depending on the relative alignment of branch layers. Branches that are highly aligned (i.e., have similar numbers of \(\mathtt {AND}\) gates in each layer) enjoy significant improvement.

Because our experiment uses the same circuit in each branch, we achieve perfect alignment. Thus, our experiments show the maximum benefit that our technique can provide. We emphasize that our approach always reduces the number of required OTs, because each circuit layer of each branch must have at least 1 \(\mathtt {AND}\) gate that can be combined into a \(\mathtt {VS}\) gate. Additionally, as we discuss in Sect. 6.3, compiler technologies can be applied to improve the alignment of misaligned circuits, further improving the benefit of our approach.

Fig. 5.
figure 5

2PC comparison of against standard GMW. We plot the following metrics as functions of the branching factor (i.e. the number of branches in the overall conditional): the overall per-player communication (top-left), the wall-clock time to complete the protocol on a LAN (top-right), the wall-clock time to complete the protocol on a LAN where other processes share bandwidth (bottom-left), and the wall-clock time on a WAN (bottom-right).

9.1 2PC Improvement over Standard GMW

We first compare the performance of to that of standard GMW in the 2PC setting. Specifically, we run the branching string-comparison circuit between two players on 3 different simulated network settings:

Fig. 6.
figure 6

Per-player communication improvement for our 2PC string comparison experiment as a function of the number of branches.

  1. 1.

    LAN: A simulated gigabit ethernet connection with 1 Gbps bandwidth and 2 ms round-trip latency.

  2. 2.

    Shared LAN: A simulated shared local area network connection where the protocol shares network bandwidth with a number of other processes. The connection features 50 Mbps bandwidth and 2 ms round-trip latency.

  3. 3.

    WAN: A simulated wide area network connection with 100 Mbps bandwidth and 20 ms round-trip latency.

Figure 5 plots the total protocol wall-clock time in each network setting and the total per-player communication. For further reference, Fig. 6 tabulates our communication improvement as a function of branching factor. Note that total communication is independent of the network settings.

Discussion. In all metrics, our approach significantly improves performance:

  • Communication. Our approach improves communication by up to 9.4\(\times \). There are several reasons we do not achieve the full 16\(\times \) improvement at branching factor 16. First, both the standard GMW approach and ours must perform the same number of base OTs to set up an OT extension matrix  [IKNP03]. This adds a small amount of communication (around 20 KB) common to both approaches, which cuts slightly into our advantage. Second, the online communication for the body of each branch is the same in both approaches. That is, although we amortize the \(\kappa \)-bit strings sent for random OTs, we do not amortize the six bits per \(\mathtt {AND}\) gate needed in the ‘online’ phase of the protocol. Finally, we pay communication cost for the demultiplexer at the start of each branch. Recall that we \(\mathtt {AND}\) branch inputs with the branch condition to ensure that all inactive branches have 0 on each wire. Although the demultiplexer is achieved using only one \(\mathtt {VS}\) gate (and hence two OTs) per branch, the ‘online’ cost of multiplying 65000 wires by the branch condition is significant. The relative cost of demultiplexers varies with the number of inputs to each branch: circuits with small inputs incur less demultiplexer overhead. The string comparison circuit has a particularly costly demultiplexer because the circuit has a large number of input bits relative to the number of gates in the circuit.

  • LAN wall-clock time. On a fast LAN network, our approach’s improvement is diminished compared to our communication improvement. Even so, we improve by approximately 5.1\(\times \) over standard GMW at 16 branches. A 1Gbps network is very fast, and our modest hardware struggles to fill the communication pipe. With better hardware and low-level implementation improvements, our wall-clock improvement would approach 9.4\(\times \).

  • Shared LAN wall-clock time. On the more constrained shared LAN network, our approach excels. We achieve an approximate 9.2\(\times \) speedup compared to standard GMW at 16 branches. On this slower network, our hardware and implementation easily keep up with the network, and hence we very nearly match the 9.4\(\times \) communication improvement.

  • WAN wall-clock time. On this high-latency network our advantage is less pronounced. Still, we achieve a 4.1\(\times \) speedup compared to standard GMW at 16 branches. This high-latency network highlights the weakness of GMW’s multi-round nature. Because we do not reduce the number of rounds, our approach incurs the same total latency as standard GMW, and hence our improvement is diminished.

Fig. 7.
figure 7

MPC per-player communication usage of both and of standard GMW as a function of the number of players. Note that, like standard GMW, our approach uses per-player communication linear in the number of players.

9.2 Scaling to MPC

For our second experiment, we emphasize our approach’s efficient scaling to the multiparty setting. This experiment uses the same branching string-comparison circuit as the first, but fixes the number of branches to 16. We run this 16-branch circuit among varying numbers of MPC players. We plot the results of this experiment in Fig. 7.

Discussion. The key takeaway of this second experiment is that \(\mathtt {MOTIF}\)  works well in the multiparty setting. In particular, our approach’s branching optimization does not add extra costs compared to standard GMW: both techniques use total communication quadratic in the number of players.