1 Introduction

1.1 Background on k-Means Clustering

The k-means clustering problem can be described as follows: A database \({\mathcal {D}}\) holds information about n different objects, each object having d attributes. The information regarding each object is viewed as a coordinate in \({\mathbb {R}}^d\), and hence the objects are interpreted as data points living in d-dimensional Euclidean space. Informally, k-means clustering algorithms are comprised of two steps. First, k initial centers are chosen in some manner, either at random or using some other “seeding” procedure. The second step (known as the “Lloyd Step”) is iterative and does the following: Partition the n data points into k clusters based on which current cluster center they are closest to. Then reset the new cluster centers to be the center of mass (in Euclidean space) of each cluster. This process is either iterated a fixed number of times or until the new cluster centers are sufficiently close to the previous ones (based on a pre-determined measure of “sufficiently close”).

The k-means clustering method is enormously popular among practitioners as an effective way to find a geometric partitioning of data points into k clusters, from which general trends or tendencies can be observed. In particular, k-means clustering is widely used in information retrieval, machine learning, and data mining research (see, e.g. [24] for further discussion about the enormous popularity of k-means clustering).

The question of finding efficient algorithms for solving the k-means clustering problem has been greatly explored and is not investigated in this paper. Rather, we wish to extend an existing algorithm (which solves the k-means problem for a single database without any concern for privacy) to an algorithm that works in the two-database setting (in accordance with multiparty computation literature, we refer to the databases as “parties”). In particular, if two parties each hold partial data describing the d attributes of n objects, then we would like to apply this k-means algorithm to the aggregate data (e.g. imagining a virtual database that contains all the data) in a way that protects the privacy of each party’s data. In this paper, we will work in the most general setting, where we assume the data is arbitrarily partitioned between the two databases. This means that there is no assumption on how the attributes of each data point are distributed among the parties (and in particular, this subsumes the cases of vertically and horizontally partitioned data).

1.2 Previous Work

The k-means clustering problem is one of the functions most studied in the more general class of data-mining problems. Data-mining problems have received much attention in recent years. Due to the sheer volume of inputs that are often involved in data-mining problems, generic multiparty computation (MPC) protocols become infeasible in terms of communication cost. This has led to constructions of function-specific multiparty protocols that attempt to handle a specific functionality in an efficient manner, while still providing privacy to the parties (see, e.g. [1, 2, 21]).

The problem of extending single database k-means clustering protocols to the multiparty setting has been explored by numerous authors, whose approaches have varied widely. The main challenge in designing such a protocol is to prevent intermediate values from being leaked during the Lloyd Step. In particular, each iteration of the Lloyd Step requires k new cluster centers to be found, a process that requires division (the new cluster centers are calculated using a weighted average, which in turn requires dividing by the number of data points in a given cluster). However, the divisors should remain unknown to the parties, as leaking intermediate cluster sizes may reveal excessive information. Additionally, many current protocols for solving the single database k-means clustering problem improve efficiency by choosing data points according to a weighted distribution, which will then serve as preliminary “guesses” to the cluster center (e.g. [5, 24]). Choosing data points in this manner will also likely involve division.

A subtle issue that may not be obvious at first glance is how to perform these divisions in light of current cryptographic tools. In particular, most encryption schemes describe a message space that is a finite group (or field or ring). This means that an algorithm that attempts to solve the multiparty k-means problem in the cryptographic setting (as opposed to the information-theoretic setting) will view the data points not as elements of Euclidean Space \({\mathbb {R}}^d\), but rather as elements in \({\mathbb {G}}^d\) (for some ring \({\mathbb {G}}\)) in order to share encryptions of these data points with the other party members. But this then complicates the notion of “division”, which we wish to mean “division in \({\mathbb {R}}\)” as opposed to “multiplication by the inverse.” (The latter interpretation not only fails to perform the desired task of finding an average, but additionally may not even exist if not all elements in the ring \({\mathbb {G}}\) have a multiplicative inverse).

Previous authors attempting to solve the multiparty k-means problem have incorporated various ideas to combat this obstacle. The “data perturbation” technique (e.g. [1, 2, 23]) avoids the issue altogether by addressing the multiparty k-means problem from an information-theoretic standpoint. These algorithms attempt to protect party members’ privacy by having each member first “perturb” their data (in some regulated manner), and then the perturbed data is made public to all members. Thus, the division (and all other computations) can be performed locally by each party member (on the perturbed data), and the division problem is completely avoided. Unfortunately, all current algorithms utilizing this method do not protect the privacy of the party members in the cryptographic definition of privacy protection. Indeed, these protocols provide some privacy guarantee in terms of hiding the exact values of the database entries, but do not make the more general guarantee that (with overwhelming probability) no information can be obtained about any party’s inputs (other than what follows from the output of the function, i.e. the final cluster centers).

Another solution to the division problem (see e.g. [28]) is to have each party member perform the division locally on their own data. The problem with this method is that it requires each party to know all intermediate cluster assignments (in order to know what they should divide by). The same problem is encountered in [27], which also requires each party to know intermediate cluster assignments. The extra information obtained by the parties in these two papers would not be available to the parties in the ideal model, and thus these solutions fail to provide complete privacy protection (as per Definition 1; see Sect. 2.3). A similar problem is encountered in [19], where they describe a way to privately perform division, but their protocol relies on the fact that both parties will learn the output value of the division (which is again more information than is revealed in the ideal model). Another approach, suggested by Jagannathan and Wright [18] is to interpret division as multiplication by the inverse. However, a simple example shows that this method does not satisfy correctness, i.e. does not correctly implement a k-means algorithm. (Consider, e.g. dividing 11 by 5 in \({\mathbb {Z}}_{21}\). One would expect to round this to 2, but \(11*5^{-1}= 11*17 = 19\)).

There are a number of works in the multiparty computation literature that focus on the problem of secure division. A number of these works (e.g. [3, 7, 20]) seek to implement fixed-point division, outputting an estimate of the quotient (with bounded error). While these protocols can be (and have been) extended to integer division (e.g. [9, 13]), the resulting protocols are more involved than the one presented here (e.g. [3, 13] implement a version of Newton–Raphson method, [7] invokes Goldschmidt’s Division Algorithm, and [9] utilize Taylor Polynomials), and often require additional assumptions on the inputs or setup that are not appropriate when used as a subprotocol of k-means clustering. Furthermore, the Division Protocol presented in Sect. 3.1 is comparable to the above-mentioned protocols in terms of communication complexity, so that (ignoring correctness/compatibility issues) swapping our protocol for any of the above protocols would not result in an (asymptotic) improvement of overall communication complexity of the k-means clustering protocol. There are also results [16, 29] for performing secure division if the denominator is known to both parties, but utilizing such a subprotocol in the context of k-means clustering will require leaking the size (number of data points) of each cluster during each iteration of the Lloyd Step.

One final approach encountered in the literature (see, e.g. [4, 10,11,12]) protects against leaking information about specific data in a different context. In this setting, the data is not distributed among many parties, but rather exists in a single database that is maintained by a trusted third party. The goal now is to have clients send requests to this third party for k-means clustering information on the data, and to ensure that the response from the server does not reveal too much information about the data. In the model we consider in this paper, these techniques cannot be applied since there is no central database or trusted third party.

To summarize, none of the existing “privacy-preserving” k-means clustering protocols provide cryptographically-acceptable security against an “honest-but-curious” adversary. We will present a formal notion of security in Sect. 2.3 (see, e.g. [15]). Informally, the security of a multiparty protocol is defined by comparing the real-life interaction between the parties to an “ideal” scenario where a trusted third party exists. In this ideal setting, the trusted third party receives the private inputs from each of the parties, runs a k-means clustering algorithm on the aggregate data, and returns as output the final cluster centers to each party. (Note: depending on a pre-determined arrangement between the parties, the third party may also give each party the additional information of which cluster each data point belongs to.) The goal of multiparty computation is to achieve in the “real” world (where no trusted third party is assumed to exist) the same level of data privacy protection that can be obtained in the “ideal” model.

One final obstacle in designing a perfectly secure k-means clustering protocol comes from the iterative nature of the Lloyd Step. In the ideal model, the individual parties do not learn any information regarding the number of iterations that were necessary to reach the stopping condition. In the body of this paper, our main protocol will reveal this information to the parties (it is our belief that in practice, this privacy breach is unlikely to reveal meaningful information about the other party’s database). However, we discuss more fully in “Appendix A” alternative methods of controlling the number of iterations without revealing this extra information.

1.3 Our Results

We describe in Sect. 5 of this paper the first protocol for two-party k-means clustering that is secure against an honest-but-curious adversary (as mentioned above, general MPC protocols could in theory be applied to k-means, but for large datasets with many attributes, such protocols require excessive communication and may be infeasible to use in practice; see Sect. 5.4 for a comparison). Moreover, we demonstrate that our protocol is performance-competitive with other protocols (which fail to protect privacy against an honest-but-curious adversary). Let k denote the number of clusters, \(\lambda \) is the security parameter, n is the number of data points, d is the number of attributes of each data point, and \(O(\xi _s)\) is the communication cost of (securely) finding the minimum of two numbers. The exact efficiency bounds that we achieve are as follows:

Communication Complexity Result

Our two-party secure k-means clustering protocol has a one-time communication cost of \(O(\lambda nd)\), followed by \(O(\xi _s kn) \le O(\lambda ^2 kn)\) for each iteration of the Lloyd Step.

A complete discussion on the bounds achieved above can be found in Sect. 5.4. Table 1 compares our protocol with other existing k-means clustering protocols.Footnote 1

Table 1 Comparison of communication complexity of various k-means clustering protocols

Our protocol takes as a template the single-database protocol of Ostrovsky et al. [24], and extends it to the two-party setting. We chose the particular protocol of [24] because it has two advantages over conventional single-database protocols: First, it provides a provable guarantee as to the correctness of its output (assuming moderate conditions on the data); and second, because their protocol reduces the number of iterations in the Lloyd Step. However, the techniques we use to extend the single-database protocol of [24] are general, and can likely be applied to any single-database protocol to achieve security in the two-party setting.

In order to extend the single database protocol of [24] to a two-party protocol, we follow the setup and some of the ideas discussed by Jagannathan and Wright in [18]. In that paper, the authors attempt to perform secure two-party k-means clustering, but (as they remark) fall short of perfect privacy due to leakage of information (including the number of data points in each cluster) that arises from an insecure division algorithm.

To solve the multiparty division problem, we define division in the ring \({\mathbb {Z}}_N\) in a natural way, namely as the quotient Q from the Division Algorithm in the integers: \(P=QD+R\). From this definition, we demonstrate how two parties can perform multiparty division in a secure manner. Additionally, we describe how two parties can select initial data points according to a weighted distribution. To accomplish this, we introduce a new protocol, the Random Value Protocol, which is described in Sect. 4. We note that the Random Value Protocol may be of independent interest as a subprotocol for other protocols that require random, oblivious sampling.

Our results utilize many existing tools and subprotocols developed in the multiparty computation literature. As such, the security guarantee of our result relies on cryptographic assumptions concerning the difficulty of inverting certain functions. In particular, we will assume the existence of a semantically secure homomorphic encryption scheme, and for ease of discussion, we use the homomorphic encryption scheme of Paillier [25].

1.4 Notation

In the following sections, we will adopt the convention of writing [a..b] to represent the integers from a to b (inclusive), i.e. \([a..b] = [a, b] \cap {\mathbb {Z}}\).

1.5 Overview

In the next section, we briefly introduce the cryptographic tools and methods of proving privacy that we will need to guarantee security in the honest-but-curious adversary model. We also include in Sect. 2.2 a complete list of the subprotocols that will be used in this paper. Because most of the subprotocols that we use are general and have been described in previous MPC papers, we provide in Sect. 2.2 only a list of these protocols (possible implementations are included in “Appendix C” for completeness). An exception to this is our new Random Value Protocol, for which we provide full details and proof of security in Sect. 4, and a description of a two-party Division Protocol in Sect. 3. Finally, in Sect. 5 we describe our secure 2-party k-means clustering protocol, which extends the single database (non-secure) k-means clustering protocol of [24].

2 Achieving Privacy

In multiparty computation (MPC) literature, devious behavior is modeled by the existence of an adversary who can corrupt one or more of the parties. In this paper, we will assume that the adversary is honest-but-curious, which means the adversary only learns the inputs/outputs of all of the corrupted parties, but the corrupted parties must engage in the protocol appropriately. We include in Sect. 2.3 a formal definition of what it means for a protocol to “protect privacy” in the honest-but-curious adversary model (see also, e.g. [15] for definitions of security against an honest-but-curious adversary).

In order to construct a private two-party k-means clustering protocol, we begin by presenting a secure Division Protocol (Sect. 3) and Random Value Protocol (Sect. 4) that will be used as subprotocols. The overall k-means clustering protocol will then utilize these subprotocols, as well as a handful of standard subprotocols for which there exist secure (privacy-preserving) instantiations. We utilize the fact that the composition of secure subprotocols results in a secure protocol [6], so that the overall privacy of our k-means clustering protocol reduces to the privacy of each subprotocol. Proof sketches for privacy of the Division Protocol and Random Value Protocol are included with the description of these protocols (Sects. 3 and 4). A list of all other subprotocols used is provided in 2.2; for all such protocols, we either provide a reference to an existing secure instantiation of the protocol, or in cases where the protocol can be instantiated via a simple reduction to a secure subprotocol (who security is already known), we sketch such a reduction in “Appendix C”.

In Sect. 2.3, we classify protocols that have a specified generic form, and prove that such protocols will be secure in the honest-but-curious adversary model. Privacy of our Division Protocol and Random Value Protocol will then follow because they have this generic form. In Sect. 2.1, we first introduce the cryptographic tools we will need to guarantee privacy. The casual reader may wish to skip the description of the cryptographic tools in Sect. 2.1 and read only the high-level arguments of security in the first paragraph of Sect. 2.3, omitting the formal definitions and proofs of privacy in the rest of that section.

2.1 Cryptographic Tools

We will utilize standard cryptographic tools to maintain privacy in our two-party k-means clustering protocol. It will be convenient to name our two participating parties, and we adopt the standard names of “Alice” and “Bob.” We will first utilize an additively homomorphic encryption scheme, e.g. Paillier [25]. Thus, for encryptions we assume a message space \({\mathbb {Z}}_N\), where \(N=pq\) is the product of two \(\lambda \)-bit primes and \(\lambda \) is the security parameter. In the protocols that follow, there is no public setup/dealer; rather, one of the parties will be responsible for choosing the modulus N (we use the convention that Alice plays this role), and will publish the corresponding public key (but keep the decryption key private). The opposite party (Bob) will be responsible for performing the requisite computations on encrypted data. The encryption scheme is a map \(E:{\mathbb {Z}}_N \times {\mathbb {H}} \rightarrow {\mathbb {G}}\), where \({\mathbb {H}}\) represents some group from which we obtain randomness, and \({\mathbb {G}}\) is some other group. For notational convenience, we will write \(E(m) \in {\mathbb {G}}\) rather than E(mr). This encryption scheme is additively homomorphic, so that: \(E(m_1, r_1) + E(m_2, r_2) = E(m_1+m_2, r_1+r_2)\), where each addition refers to the appropriate group operation in \({\mathbb {G}}, {\mathbb {Z}}_N,\) or \({\mathbb {H}}\). (For Paillier, \({\mathbb {G}} ={\mathbb {Z}}^{\times }_{N^2}\) and thus the group operation is multiplication). Additionally, the encryption scheme allows a user to (efficiently) multiply by a constant, i.e. for \(c \in {\mathbb {Z}}_N\), anyone can compute: \(cE(m, r) = E(cm, r')\). (For Paillier, if (Ng) is the public key, then \(cE(m,r) := (g^mr^N)^c = g^{mc}r^{cN} = g^{mc}(r^c)^N = E(cm, r')\), where \(r' = r^c\)).

2.2 Privacy Protecting Protocols

We list here the generic subprotocols that will be used by our two-party k-means clustering protocol. All of the below protocols can be readily implemented using only the Scalar Product Protocol, and we include possible implementations in “Appendix C”. The Scalar Product Protocol is a standard protocol that has been explored much by other authors; we will not include an implementation of this protocol in this paper, but refer the reader to a number of possible references.

  • Scalar Product Protocol (SPP) This protocol takes in \(\mathbf {x} \in {\mathbb {Z}}^t_N\) and \(\mathbf {y} \in {\mathbb {Z}}^t_N\), and returns (shares of) some pre-determined degree two function \(f(\mathbf {x}, \mathbf {y}) = \sum _{i=1}^t c_i\mathbf {x}_i \mathbf {y}_i\) for public constants \(c_i\), and where all arithmetic is modulo N. This encompases degree-zero and degree-one terms as well, e.g. by taking \(\mathbf {x}_i\) and/or \(\mathbf {y}_i\) to be one, as appropriate. (See, e.g. [14], where they describe a protocol that achieves \(O(t\lambda )\) communication complexity, \(\lambda \) the security parameter. Other implementations can be found in [22, 30, 32].)

  • Find Minimum of 2 Numbers Protocol (FM2NP) Alice and Bob share two numbers. This protocol returns shares of the location of the smaller number (0 or 1).

  • Find Minimum of k Numbers Protocol (FMkNP) An extension of the above protocol, where this time as output they receive shares of the vector \((0, \dots , 1, \dots , 0)\), where the ‘1’ appears in the mth coordinate if the mth number is smallest.

  • Change Modulus Protocol Let \(N_1, N_2\) be two publicly known integers and let \(Q < \min (N_1, N_2)\). Suppose Alice and Bob share \(Q = Q^A + Q^B \in {\mathbb {Z}}_{N_1}\). Then as output, Alice and Bob should share Q in \({\mathbb {Z}}_{N_2}\), i.e. Alice gets \(\widetilde{Q}^A\) and Bob gets \(\widetilde{Q}^B\) such that \(Q = \widetilde{Q}^A +\widetilde{Q}^B \in {\mathbb {Z}}_{N_2}\).

  • Addition Modulo Unknown Value Protocol Let \(S = S^A + S^B (\)mod N) and \(T = T^A + T^B (\)mod N) be two values shared by Alice and Bob, and let \(Q \in [1+\max (S, T)..N-1]\) be arbitrary. Then as output, Alice and Bob share \(R := S + T (\)mod Q).

  • Nested Product Protocol (NPP) Alice and Bob share values \(\{x_i\}_{i=1}^m = \{x^A_i + x^B_i (\)mod \(N)\}\). This protocol returns shares (in \({\mathbb {Z}}_N\)) of the m nested products; that is, for each \(1 \le j \le m\), Alice and Bob share the value: \(y_j := \prod _{j=1}^i x_j\).

  • To Binary Protocol (TBP) Alice and Bob have shares of some value \(X \in {\mathbb {Z}}_N\). If \(X = x_\lambda \dots x_1 x_0\) is the binary representation of X, then this protocol returns shares of \(x_i\) for each \(0 \le i \le \lambda \). In other words, \(x_i = x^A_i + x^B_i\) (mod N).

  • Distance Protocol Alice and Bob share two points in \({\mathbb {Z}}^d_N\). As output, they share the distance squared between these points. Since the computation for distance (squared) can be expressed as a product of Alice and Bob’s inputs, the distance protocol can be instantiated via SPP. One such possible reduction of the Distance Protocol to a SPP can be found in [18], which has communication complexity \(O(d\lambda )\).

  • Compute Modulus Mask Protocol, Compute\(\mathbf {e}_i\)Protocol and Choose\({\varvec{\mu }}_1\)Protocol These will be discussed when they arise in Sects. , 4.1, and 5.3.1.

2.3 Definition of Privacy in the Honest-But-Curious Model

We present first the high-level argument for how our protocols will protect each party’s data. We have one of the parties (Alice) choose the encryption key (i.e. we do not rely on a trusted setup to distribute public/secret key pairs), and encrypt all of her data using this key before sending it to the other party (Bob). Thus, Alice’s privacy will be guaranteed by the semantic security assumption of the encryption scheme. Meanwhile, Bob will also encrypt his data using Alice’s key, utilize the homomorphic properties of the encryption scheme to perform the requisite computations, and then blind all of the outputs he sends to Alice with randomness of his choosing, ensuring that Alice can learn nothing about his data.

We now make these notions precise by first providing a formal definition of privacy protection in the honest-but-curious adversary model, and a formal proof of privacy for the class of protocols that attempt to protect privacy in the above described manner (both the definition and technique of providing privacy in this manner are standard tools used in MPC literature, see, e.g. [15]).

Definition 1

Suppose that protocol X has Alice compute (and output) the function \(f^A(\mathbf {x}, \mathbf {y})\), and has Bob compute (and output) \(f^B(\mathbf {x}, \mathbf {y})\), where \((\mathbf {x}, \mathbf {y})\) denotes the inputs for Alice and Bob (respectively). Let VIEW\(^A(\mathbf {x},\mathbf {y})\) (resp. VIEW\(^B(\mathbf {x},\mathbf {y})\)) represent Alice’s (resp. Bob’s) view of the transcript. In other words, if \((\mathbf {x}, \mathbf {r}^A)\) (resp. \((\mathbf {y}, \mathbf {r}^B)\)) denotes Alice’s (resp. Bob’s) input and randomness, then:

$$\begin{aligned} \text{ VIEW }^A(\mathbf {x},\mathbf {y})&= (\mathbf {x}, \mathbf {r}^A, m_1, \dots , m_t), \quad \text{ and }\\ \text{ VIEW }^B(\mathbf {x},\mathbf {y})&= (\mathbf {y}, \mathbf {r}^B, m_1, \dots , m_t), \end{aligned}$$

where the \(\{ m_i \}\) denote the messages passed between the parties. Also let \(O^A(\mathbf {x}, \mathbf {y})\) and \(O^B(\mathbf {x}, \mathbf {y})\) denote Alice’s (resp. Bob’s) output. Then we say that protocol Xprotects privacy (or is secure) against an honest-but-curious adversary if there exist probabilistic polynomial time simulators \(S_1\) and \(S_2\) such that:

$$\begin{aligned} \{ (S_1(\mathbf {x}, f^A(\mathbf {x}, \mathbf {y})), f^B(\mathbf {x}, \mathbf {y})) \}&{\mathop {\equiv }\limits ^{c}}\{(\text{ VIEW }^A (\mathbf {x},\mathbf {y}), O^B(\mathbf {x},\mathbf {y})) \} \end{aligned}$$
(1)
$$\begin{aligned} \{ (f^A(\mathbf {x}, \mathbf {y}), S_2(\mathbf {y}, f^B(\mathbf {x}, \mathbf {y}))) \}&{\mathop {\equiv }\limits ^{c}} \{ (O^A(\mathbf {x}, \mathbf {y}), \text{ VIEW }^B(\mathbf {x},\mathbf {y})) \}, \end{aligned}$$
(2)

where \({\mathop {\equiv }\limits ^{c}}\) denotes computational indistinguishability.

With the above definition of privacy protection, we now prove the basic lemma that will allow us to argue that our two-party k-means clustering protocol is secure against an honest-but-curious adversary.

Lemma 2.1

Suppose that Alice has run the key generation algorithm for a semantically secure homomorphic public-key encryption scheme, and has given her public-key to Bob. Further suppose that Alice and Bob run Protocol X, for which all messages passed from Alice to Bob are encrypted using this scheme, and all messages passed from Bob to Alice are uniformly distributed (in the range of the ciphertext) and are independent of Bob’s inputs. Then Protocol X is secure in the honest-but-curious adversary model.

Proof

We prove the privacy protecting nature of Protocol X in two separate cases, depending on which party the adversary has corrupted. To prove privacy, we show that for all PPT Adversaries, the view of the adversary based on Alice and Bob’s interaction is indistinguishable to the adversary’s view when the corrupted party interacts instead with a simulator. In other words, we show that there exist simulators \(S_1\) and \(S_2\) that satisfy conditions (1) and (2).

Case 1:

Bob is Corrupted by Adversary We simulate Alice’s messages sent to Bob. For each encryption that Alice is supposed to send to Bob, we let the simulator \(S_2\) pick a random element from \({\mathbb {Z}}_N\), and send an encryption of this. Any adversary who can distinguish between interaction with Alice versus interaction with \(S_2\) can be used to break the security assumptions of E. Thus, no such PPT adversary exists, which means (2) holds.

Case 2:

Alice is Corrupted by Adversary We simulate Bob’s messages sent to Alice. To do this, every time Bob is to send an encryption to Alice, the simulator picks a random element of \({\mathbb {Z}}_N\) and returns an encryption of this. Again, equation (1) holds due to the fact that Alice cannot distinguish the simulator’s encryption of a random number from Bob’s encryption of the correct computation that has been shifted by randomness of Bob’s choice.

\(\square \)

Since every semantically secure homomorphic encryption scheme available today has a finite message space (e.g. \({\mathbb {Z}}_N\)), when our k-means protocol requires the data points (or attributes of the data points) to be encrypted, we must restrict the possible data values to a finite range. Therefore, instead of viewing the data points as living in \({\mathbb {R}}^d\), we “discretize” Euclidean space and approximate it via the lattice \({\mathbb {Z}}_N^d\), for some large N. All of the results of this paper are consequently restricted to the model where the data points live in \({\mathbb {Z}}_N^d\), (both in the “real” and “ideal” setting) and any function performing k-means clustering in this model is restricted to computations in \({\mathbb {Z}}_N\). Note that restricting to this “discretized” model is completely natural; indeed due to memory constraints, calculations performed on computers are handled in this manner. As a consequence of working in the discretized space model, we also avoid privacy issues that arise from possible rounding errors (i.e. restricting input to be in \({\mathbb {Z}}_N^d\) avoids the necessity of approximating inputs in \({\mathbb {R}}\) by rounding up or down).

3 Private Division

As mentioned in Sect. 1.2, performing two-party division has been an obstacle to obtaining a secure two-party k-means clustering protocol. In this section, we discuss our methods for overcoming this obstacle. In particular, we make precise what we mean by division in the ring \({\mathbb {Z}}_N\), and show that this definition not only matches our intuition as to what division should be, but also allows us to perform division in a secure way (namely, so that the dividend and divisor remain hidden).

Let N be a positive integer (when secure division is used as a subprotocol, N may be, e.g. an RSA modulus), and let \(P, D \in {\mathbb {Z}}_N\). Then viewing P and D as integers, we may apply the Division Algorithm to find unique integers \(Q < N\) and \(0 \le R < D\) such that \(P = QD + R\). Viewing \(Q \in {\mathbb {Z}}_N\), we then define division (of P by D) to be the quotient Q. Note that this definition is the natural restriction of division in \({\mathbb {R}}\) to the integers, in that Q represents the actual quotient in \({\mathbb {R}}\) that has been rounded down to the nearest integer. Thus this definition coincides much more closely to real division (e.g. for purposes of finding averages) than other alternatives, such as defining division to be multiplication by the inverse.

In defining what it means for a division protocol to be secure (see Sect. 2.3), one compares the information that could be obtained in an ideal model (where a trusted third party exists) versus what could be obtained in the real world (where no such third party exists, and the proposed protocol is employed). In terms of defining the function that is to be evaluated (which performs the k-means clustering), we force the definition of division to match the above definition. In other words, when the functions \(f^A(\mathbf {x},\mathbf {y})\) and \(f^B(\mathbf {x},\mathbf {y})\) (see notation of Sect. 2.3) call for division to be performed, these divisions are defined to mean division in the ring \({\mathbb {Z}}_N\) as defined here. This way, when our protocol is run and division is performed in this way, it matches the computations that the functions \(f^A\) and \(f^B\) are performing.

With these definitions in place, it remains to implement a secure division subprotocol, where two parties (Alice and Bob) share a numerator and denominator \(P,\ D \in {\mathbb {Z}}_N\), and as output they receive shares of the quotient \(Q \in {\mathbb {Z}}_N\). We describe below a possible implementation, which has been reduced to the Scalar Product Protocol combined with the Find Minimum of 2 Numbers Protocol, and consequently its security follows from the security of those subprotocols.

Before we present the protocol, we mention that if the divisor D is publicly known, then it is likely faster to utilize the equation below, and have Alice and Bob compute the division locally, with the appropriate calls to FM2NP to handle carry-over:

$$\begin{aligned}&\text{ Alice } \text{ runs } \text{ Division } \text{ Algorithm } \text{ locally } \text{ to } \text{ find }\ Q^A, R^A: \quad P^A = D \cdot Q^A + R^A \nonumber \\&\text{ Bob } \text{ runs } \text{ Division } \text{ Algorithm } \text{ locally } \text{ to } \text{ find }\ Q^B, R^B: \quad P^B = D \cdot Q^B + R^B \nonumber \\&\text{ Bob } \text{ runs } \text{ Division } \text{ Algorithm } \text{ locally } \text{ to } \text{ find }\ \widehat{Q}, \widehat{R}: \ \quad N = D \cdot \widehat{Q} + \widehat{R} \nonumber \\&\text{ Alice } \text{ sets } \text{ final } \text{ share } \text{ of }\ Q\ \text{ to }\ Q^A \end{aligned}$$
(3)
$$\begin{aligned}&\text{ Bob } \text{ sets } \text{ final } \text{ share } \text{ of }\ Q\ \text{ to: } \quad Q^B + (1-\delta ) \cdot \alpha + \delta \cdot (\widehat{Q} + \beta ), \end{aligned}$$
(4)

where

$$\begin{aligned} \alpha&:= \left\{ \begin{array}{ll} 0 &{} \quad \text{ if } R^A + R^B< D \\ 1 &{} \quad \text{ if } R^A + R^B \ge D \end{array}\right. \\ \beta&:= \left\{ \begin{array}{ll} 0 &{} \quad \text{ if } \widehat{R} + R^A + R^B< D \\ 1 &{} \quad \text{ if } D \le \widehat{R} + R^A + R^B< 2D \\ 2 &{} \quad \text{ if } \widehat{R} + R^A + R^B \ge 2D\end{array}\right. \\ \delta&:= \left\{ \begin{array}{ll} 0 &{} \quad \text{ if } P^A + P^B < N \\ 1 &{} \quad \text{ if } P^A + P^B \ge N \end{array}\right. \end{aligned}$$

We leave it as an exercise that (3) and (4) will compute the appropriate shares of the quotient Q.

3.1 Implementation of the Division Protocol

Intuitively, our protocol attempts to mimic a natural way of performing division. Namely, when dividing P by D, we want to find the biggest integer Q such that \(QD \le P\), but \((Q+1)D > P\). To find Q, we perform an exponential search: Viewing Q in its binary representation, we first try to find the highest power of 2 that will be in the binary expression of Q, and then work down. Namely, once the highest power of 2 (say \(\alpha \)) has been found such that \(2^{\alpha } \le P\) but \(2^{\alpha +1} > P\), then we subtract \(2^{\alpha }\) from P and repeat the process on the difference. This approach is performed by the division protocol below, with appropriate modifications to allow two parties to keep their individual inputs private.

Input Alice and Bob share \(P = P^A+P^B \ (\text{ mod } N)\) and \(D = D^A+D^B \ (\text{ mod } N)\).

Output If \(P = QD + R\) (\(0 \le R < D\)) is the unique expression guaranteed by the Division Algorithm, then this protocol outputs shares of \(Q = Q^A+Q^B \ (\text{ mod } N)\) to Alice and Bob.

Cost The communication in this protocol is dominated by \(1 + \lambda \) calls to the Find Minimum of 2 Numbers Protocol, where \(\lambda = \lfloor \log _2 N \rfloor \). Denoting the communication cost of FM2NP by \(\xi _s\), the implementation of this protocol therefore has communication \(O(\lambda \xi _s)\).

Protocol Description

  1. 1.

    Alice and Bob run the Compute Modulus Mask Protocol to obtain shares of \(\mathbf {v} \in {\mathbb {Z}}_2^{1+\lambda }\), which is defined so that the ith-coordinate \(v_i\) of \(\mathbf {v}\) is ‘1’ if and only if \(2^i \cdot D < N\). Hence, \(\mathbf {v} = (1, 1, \dots , 1, 0, \dots , 0)\), with the last ‘1’ in the \(\lfloor \log _2 ((N - 1) / D)\rfloor \) coordinate (here, coordinates are 0-based, so that \(v_0\) denotes the first coordinate, and \(v_{\lambda }\) the last coordinate). Note: The mask \(\mathbf {v}\) is needed to account for the fact that both subprotocols being utilized perform arithmetic modulo N (to account for the fact that Alice and Bob share inputs in \({\mathbb {Z}}_N\)). Namely, the division protocol will need to determine if a given multiple \(2^i\) of D is less than or equal to some quantity C, where here “less than” means with respect to natural (integral) arithmetic. The protocol below will determine this by noting that if \(2^i \cdot D \le C\), then, when viewed as arithmetic modulo N, \(C - 2^i \cdot D\) will be less than C, because \(C - 2^i \cdot D\) did not “wrap-around;” i.e. \(C - 2^i \cdot D\) (mod N) = \(C - 2^i \cdot D\). On the other hand, if \(2^i \cdot D > C\), then \(C - 2^i \cdot D\) will “wrap-around”, and we want to detect this by finding that \(C - 2^i \cdot D\) (mod N) >C. This will always be true so long as we don’t “wrap-around” more than once, i.e. so long as \(-N < C - 2^i \cdot D\). The mask \(\mathbf {v}\) is used to ensure that the difference appearing in Step 2 below will “wrap-around” at most once, thus resulting in a true comparison of C versus \(2^i \cdot D\).

  2. 2.

    For each \(0 \le i \le \lambda \), Alice and Bob run the FM2NP on \((P_i, \ P_i-v_{\lambda -i} \cdot 2^{\lambda -i} \cdot D)\), where \(P_i :=P_{i-1} - O_{i-1} \cdot 2^{\lambda -i+1} \cdot D\) with \(O_{i-1}\) denoting the output of FM2NP on the previous iteration. The iterative formula is initialized with \(P_0 :=P\).

  3. 3.

    Notice that \(Q = \sum _{i=0}^\lambda O_i \cdot 2^{\lambda -i}\), which can be locally computed by Alice and Bob from their shares of \(O_i\) from each step.

3.2 Example of Division Protocol

In this section, we present an example to see the protocol in work: Let \(N=2^6 = 64\), so \(\lambda =6\). As inputs to the protocol, let \(P=52\) and \(D=5\). The Division Algorithm would write:

$$\begin{aligned} 52 = (5)(10) + 2, \end{aligned}$$
(5)

so as output, our division protocol should output shares of 10.

  1. 1.

    Step 1 of our protocol finds \(\mathbf {v}\), which for the example values above is:

    $$\begin{aligned} \mathbf {v} = (1, 1, 1, 1, 0, 0, 0) \end{aligned}$$
    (6)

    since \((2^i)*5 < 64\) for \(i=0, 1,2,3\), but not for \(i= 4, 5, 6\).

  2. 2.

    This step is repeated for \(0 \le i \le \lambda (=6)\):

    1. (a)

      On the first iteration (\(i=0\)), FM2NP is run on \((52,\ (52-0*2^6*5)) = (52,\ 52)\), since \(P_0 = P = 52\) and \(v_{6} = 0\). By the remarks in “Appendix B.1”, this call will return shares of zero, i.e. \(O_0 = 0\).

    2. (b)

      On the next iteration (\(i=1\)), FM2NP is run on \((52,\ (52-0*2^5*5)) = (52,\ 52)\), since \(P_1 = P_0 - O_0*2^{6}*5 = 52 - 0 = 52\) and \(v_{5} = 0\). By the remarks in “Appendix B.1”, this call will return shares of zero, i.e. \(O_1 =0\).

    3. (c)

      On the next iteration (\(i=2\)), FM2NP is run on \((52,\ (52-0*2^4*5)) = (52,\ 52)\), since \(P_2 = P_1 - O_1*2^{5}*5 = 52 - 0 = 52\) (since \(O_1 = 0\)) and \(v_{4} =0\). By the remarks in “Appendix B.1”, this call will return shares of zero, i.e. \(O_2 = 0\).

    4. (d)

      On the next iteration (\(i=3\)), FM2NP is run on \((52,\ (52-1*2^3*5)) = (52,\ 12)\), since \(P_3 = P_2 - O_2*2^{4}*5 = 52 - 0 = 52\) (since \(O_2 = 0\)) and \(v_{3} =1\). Since \(52 > 12\), this call will return shares of one, i.e. \(O_3= 1\).

    5. (e)

      On the next iteration (\(i=4\)), FM2NP is run on \((12, (12-1*2^2*5)) = (12, -8) = (12, 56)\), since \(P_4 = P_3 -O_3*2^{3}*5 = 52 - 40 = 12\) (since \(O_3 = 1\)), \(v_{2} = 1\), and in \({\mathbb {Z}}_{64}\) we have that \(-8=56\). Since \(12 < 56\), this call will return shares of zero, i.e. \(O_4 = 0\).

    6. (f)

      On the next iteration (\(i=5\)), FM2NP is run on \((12, (12-1*2^1*5)) = (12, 10)\), since \(P_5 = P_4 - O_4*2^{2}*5 = 12 - 0 = 12\) (since \(O_4 = 0\)) and \(v_{1} = 1\). Since \(12 > 10\), this call will return shares of one, i.e. \(O_5 = 1\).

    7. (g)

      On the last iteration (\(i=6\)), FM2NP is run on \((2, (2-1*2^0*5)) = (2, -3) = (2, 61)\), since \(P_6 = P_5 - O_5*2^{1}*5 = 12 - 10 = 2\) (since \(O_5 = 1\)), and \(v_{0} = 1\) (by definition), and in \({\mathbb {Z}}_{64}\) we have that \(-3=61\). Since \(2 < 61\), this call will return shares of zero, i.e. \(O_6 = 0\).

    Therefore, \(O = (0, 0, 0, 1, 0, 1, 0)\), which is the binary representation of 10, as desired.

3.3 Correctness Division Protocol

We provide a short proof sketch that the output of the Division Protocol in Sect. 3.1 is correct. Note that if the quotient Q has binary representation \(Q=q_{\lambda }\dots q_1 q_0\), then the division protocol outputs (shares of) the binary digits as \(q_i = O_{\lambda -i}\). In other words, the output \(O_i\) on iteration i corresponds to the binary digit \(q_{\lambda -i}\).

The proof proceeds with a double-induction on the size of the dividend P and the domain size \(\lambda \), arguing that for a fixed dividend P, the ith coordinate \(O_i\) (which corresponds to binary digit \(q_{\lambda -i}\) of the output) is computed correctly as long as the division protocol correctly handles all dividends less than P and that all higher-order binary digits were computed correctly.

Base Case\(i = 0\). Note that \(v_{\lambda }\) is ‘1’ when \(2^{\lambda } \cdot D < N\) (arithmetic in \({\mathbb {Z}}\)). If \(v_{\lambda } = 0\), then \(2^{\lambda } \cdot D \ge N\) and thus \(2^{\lambda } \cdot D > P\). Notice that FM2NP will return \(O_0 = 0\) for this case, as desired. On the other hand, if \(v_{\lambda } =1\), then \(2^{\lambda } \cdot D < N\), and thus \(-N < -2^{\lambda } \cdot D\), so \(P - 2^{\lambda } \cdot D\) (mod N) will “wrap-around” at most once. In particular, \(2^{\lambda } \cdot D>P\) if and only if \(P \le P - 2^{\lambda } \cdot D\) (mod N), as desired.

Induction Step If \(v_{\lambda - i} = 0\), then the argument is the same as the Base Case above. Otherwise \(v_{\lambda - i} = 1\), and we proceed based on whether any of the higher-order bits of the quotient Q were 1:

  • \(\underline{\hbox {Case 1: All earlier outputs}\ O_j\ \hbox {were}\ \text {`}0\text {':}\ \forall \,0 \le j < i: \quad O_j = 0.}\)

    By the induction hypothesis (on i), outputs \(\{O_0, \dots , O_{i - 1}\}\) were all generated correctly, and since they were all zero, \(P < 2^{\lambda - i + 1} \cdot D\). Meanwhile, we are considering the case that \(v_{\lambda - i} = 1\), and so \(N > 2^{\lambda - i} \cdot D\), and thus \(P - 2^{\lambda - i} \cdot D\) (mod N) will “wrap-around” exactly once if and only if \(2^{\lambda -i} \cdot D > P\). Thus, \(O_i\) will equal ‘1’ if and only if \(P - 2^{\lambda -i} \cdot D\) is less than P, which will happen if and only if \(P - 2^{\lambda -i} \cdot D\) (mod N) does not wrap-around, which happens if and only if \(2^{\lambda -i} \cdot D \le P\), as desired.

  • Case 2: At least one earlier output was ‘1’ Let l denote the value whose higher-order bits are defined by the outputs \(\{O_0 \ O_1 \ \dots \ O_{i - 1}\}\), i.e. \(l = \sum _{j=0}^{i-1} 2^{\lambda - j} \cdot O_j\), and note that \(l >0\) since we are in the case that at least one of the outputs \(\{O_0, \dots , O_{i - 1}\}\) was ‘1’. By the induction hypothesis, the outputs \(\{O_0, \dots , O_{i - 1}\}\) were generated correctly (i.e. they represent the high-order bits of the quotient Q), and hence \(l \cdot D \le P\) but \((l + 2^{\lambda - i + 1}) \cdot D > P\) (all arithmetic in \({\mathbb {Z}}\)). On the ith iteration, \(P_i = P - l \cdot D\), which is less than P because \(l > 0\), and thus by induction (on the size of the dividend), the division protocol will correctly find the quotient of \(P_i\), i.e. all lower-order bits of Q will be correctly computed.\(\square \)

4 The Random Value Protocol (RVP)

In this section, we discuss how two parties can choose a value \(R \in {\mathbb {Z}}_Q\) uniformly at random, where \(Q \in {\mathbb {Z}}_N\) for a publicly known N but unknown Q (the two parties share Q (mod N)).

Definition 2

Let \(Q=Q^A + Q^B (\hbox {mod}\ N)\) be an arbitrary positive integer shared between Alice and Bob. A Random Value Protocol is a protocol run between Alice and Bob that outputs (shares of) a value \(R \in [0..Q-1]\), where R has been sampled uniformly at random from \([0..Q-1]\).

Before we describe the protocol, we provide motivation for why the problem is interesting. With a secure division protocol in hand (as in Sect. 3.1), a naive solution to the RVP is to have the parties choose (shares of) a random value \(R' \in {\mathbb {Z}}_N\), and then use the secure division protocol to compute \(R = R'\) (mod Q). The problem with this approach is that the resulting distribution will not represent a uniform distribution over \({\mathbb {Z}}_Q\), and its deviation from uniform may be statistically significant (in terms of satisfying (1) and (2)). In particular, if Q does not divide N, then the modulus \(\bar{N} \in [0..Q-1]\) of N in \({\mathbb {Z}}_Q\) is non-zero and \(R = R'\) (mod Q) will NOT be distributed uniformly in \([0..Q-1]\), as R will be slightly more likely to lie in \([0..\bar{N}-1]\) than in \([\bar{N}..Q-1]\). To quantify this, if \(N=DQ + \bar{N}\), then the probability that \(R \in [0..\bar{N}-1]\) is \(\bar{N}(D+1)/N\), while the probability that \(R \in [\bar{N}..Q-1]\) is \((Q-\bar{N})D/N\). To see that this may produce a non-negligible error in generating a uniformly distributed value for R in \([0..Q-1]\), consider as an example \(Q = 2N/7\). Then \(D = 3\) and \(\bar{N} = N/7\), so R lies in the first half of Q (i.e. in \([0..\bar{N}-1]\), since \(\bar{N} = Q/2\)) with probability \(4/7 =\bar{N}(D+1)/N\) versus a probability of 3/7 of lying in the second half of Q (i.e. \([Q/2..Q-1]\)).

Returning to the question of security, if the functions \(f^A\) and \(f^B\) in Definition 1 involve drawing a value Runiformly from \({\mathbb {Z}}_Q\) (for some unknown, shared value Q), then having Alice and Bob generate R as described in the naive RVP above will make it impossible to find simulators as in (1) and (2). We therefore need to find a way to sample uniformly from \({\mathbb {Z}}_Q\)without revealing any information about Q to either party.

We begin by defining precisely the notion that Alice and Bob will not have any knowledge about the random value R that is selected by a Random Value Protocol.

Definition 3

Let \(\hbox {VIEW}^A\) denote Alice’s view of an execution of a RVP. We say that Alice and Bob have chosen Robliviously with respect to Alice’s view if:

$$\begin{aligned} \forall \widehat{Q} < N, \ \forall \alpha \in {\mathbb {Z}}_{\widehat{Q}}:\quad \text{ Pr }[R = \alpha | \text{ VIEW }^A] = \frac{1}{\widehat{Q}}. \end{aligned}$$
(7)

If (7) holds for both parties’ view, we say that the RVP samples Robliviously for both parties.

Notice that Definition 3 (and in particular (7)) is independent of any specific value of \(Q < N\). We will say that a RVP is secure if in addition to Definition 3, Q remains hidden from both parties (as in Definition 1).

4.1 Protocol Overview

The protocol proceeds by first describing how Alice and Bob can generate \(S \in {\mathbb {Z}}_Q\) such that S is chosen uniformly at random (in \({\mathbb {Z}}_Q\)), but Bob may have partial knowledge of its value (Alice however is oblivious to the value of S). However, the partial knowledge that Bob has about S does not reveal anything about Q (for example, knowing S exactly would already leak too much information about Q, since S is chosen uniformly from \([0..Q-1]\)). This is followed by the two parties forming \(T \in {\mathbb {Z}}_Q\) in an analogous manner but with their roles reversed, so that it is Alice who may have partial knowledge about T, and Bob who is oblivious. From these they will set \(R=S+T\) (mod Q).

We present first a brief high-level description of how they generate \(S \in {\mathbb {Z}}_Q\). Imagine the integers 0 through \(Q-1\) to be partitioned into sets whose sizes are a power of two, as determined by the binary representation of Q. For example, if \(Q = 37 = 100101\), then [0..36] is partitioned into the sets of size 1, 4, and 32: \(\{ 0 \}, [1..4], [5..36]\). For each set, a value is chosen uniformly at random (amongst all numbers in that set), so that if there are m sets, then m random values \(\{ S_1, \dots , S_m \}\) are chosen. Finally, S will be set to one of these m values, according to a probability that depends on the size of each set. More specifically, if the ith set has size \(2^{j}\), then we set S to be \(S_i\) with probability \(\frac{2^j}{Q}\). Continuing with the above example for \(Q = 37\), then there are \(m = 3\) sets, and \(S_1\) is chosen uniformly at random from \(\{0\}\) (i.e. necessarily \(S_1 = 0\)), \(S_2\) is chosen uniform from [1..4], and \(S_3\) uniformly from [5..36]. Then the final value S will be set to \(S_1\) with probability 1/37, or \(S_2\) with probability 4/37, or \(S_3\) with probability 32/37.

Claim The above described protocol samples values uniformly from \([0..Q-1]\).

Proof Sketch Of the Q values in \([0..Q-1]\), we want to argue that each is selected with probability 1/Q. Let \(Q=q_{\lambda }\dots q_1 q_0\) denote the binary representation of Q. Notice that the partitioning of the values \([0..Q-1]\) into the m sets \(\{S_1, S_2, \dots , S_m\}\) as described above is a complete and disjoint partitioning: every value in \([0..Q-1]\) lies in exactly one set \(S_i\) (this is based on the choice of sizes of the sets, in accordance to the binary representation of Q). Then for any value \(v \in [0..Q-1]\), the probability that v is selected by the above protocol is the probability that v is the chosen representative in its set, times the probability that its set is the one that is selected. If v’s set has size \(2^j\), then the former probability is \(1/2^j\), and the latter probability is \(2^j/Q\), and thus v gets selected with probability 1/Q, as desired. \(\square \)

While the above protocol is a relatively straightforward way of sampling uniformly at random from an integral range, it requires knowledge of (the binary representation of) Q, which ultimately we want to keep hidden. We make the following two modifications that make the above protocol slightly more complex, but that will be more amenable to a secure extension that keeps Q hidden:

  1. 1.

    Translation For each set \(S_i = [a_i..b_i]\), rather than choosing an element randomly in \([a_i..b_i]\), it is equivalent to translate this interval to start at zero and choose a random value in \([0..(b_i - a_i)]\), and then add \(a_i\). Notice that the translation amount (\(a_i\)) is determined by the lower-order binary digits. For example, when \(Q = 37 = 100101\), the third set \(S_3\) (corresponding to the \(2^5\) digit) had the interval [5..36], so the translation amount (five) equals the binary number represented by the lower-order binary digits of Q.

  2. 2.

    Selection Above, we described the relevant sets \(\{S_1, \dots , S_m\}\) based on the m non-zero binary digits of Q. More generally, for every binary digit \(q_i \in \{q_0, q_1, \dots , q_{\lambda }\}\), independent of whether it is ‘0’ or ‘1’, we can form a corresponding set \(S_i := [0..2^i -1]\) (notice the domain of this interval has been translated to zero, as per Translation above) and choose a value \(v_i\) uniformly in \([0..2^i - 1]\). These sets still correspond to the binary digits of Q, so the above protocol (which created only m sets for the m non-zero binary digits of Q) extends to the present scenario by insisting that we select a set \(S_i \in \{S_0, S_1, \dots , S_{\lambda }\}\) with probability:

    $$\begin{aligned} \begin{array}{ll} 0 &{} \quad \text{ if } \ q_i = 0 \\ 2^i/Q &{} \quad \text{ if }\ q_i = 1 \end{array} \end{aligned}$$

We formalize the extended protocol described above.

(Insecure) Random Value Protocol Let Q be an arbitrary positive integer, and let \(Q=q_{\lambda }\dots q_1 q_0\) denote its binary representation. Sample a value from \([0..Q-1]\) uniformly at random by:

  1. 1.

    For each \(0 \le i \le \lambda \), set \(S_i = [0..2^i -1]\), and then choose a value \(v_i \leftarrow S_i\) uniformly at random. Let \(\mathbf {v} = (v_0, \dots , v_{\lambda }) \in {\mathbb {Z}}^{1 + \lambda }\) denote the vector whose coordinates are the chosen values of \(\{v_i\}\).

  2. 2.

    For each \(0 \le i \le \lambda \), set (the translation amount) \(a_i := q_{i -1} \dots q_1 q_0\) (define \(a_0 := 0\)). That is, \(a_i\) is the quantity described by the i lowest-order binary digits of Q. Let \(\mathbf {a} = (a_0, a_1, \dots , a_{\lambda }) \in {\mathbb {Z}}^{1 + \lambda }\) denote the vector whose coordinates are the values \(\{a_i\}\).

  3. 3.

    Select a digit \(i \in [0..\lambda ]\) with probability:

    $$\begin{aligned} \begin{array}{ll} 0 &{} \quad \text{ if } \ q_i = 0\\ 2^i/Q &{} \quad \text{ if }\ q_i = 1 \end{array} \end{aligned}$$
    (8)

    Let \(\mathbf {e}_i \in {\mathbb {Z}}_2^{1+\lambda }\) denote the characteristic vector with the ‘1’ in the ith coordinate.

  4. 4.

    Output value \(v_i + a_i\), which can be expressed as:

    $$\begin{aligned} v_i + a_i = \mathbf {e}_i \cdot (\mathbf {v} + \mathbf {a}), \end{aligned}$$
    (9)

    where \(\cdot \) denotes the inner-product (in \({\mathbb {Z}}^{1+\lambda }\)).

Notice that Steps 1–2 of the above protocol are extraneous: the digit i could have been selected (as in Step 3) first, and then Steps 1 and 2 could have been done just once for this value of i. However, it will be convenient to describe the protocol as above, as the secure version will follow this pattern.

Claim The above (Insecure) Random Value Protocol samples values uniformly from \([0..Q-1]\).

Proof Sketch Let \(v \in [0..Q-1]\). We want to argue v will be selected with probability 1/Q. Let \(j \in [0..\lambda ]\) denote the lowest index such that \(v < q_{j} \dots q_1 q_0\) (such a j necessarily exists because \(v < Q = q_{\lambda } \dots q_1 q_0\)), and notice that necessarily \(q_j = 1\) (otherwise minimality of choice of j is contradicted). Let \(a_j := q_{j-1} \dots q_1 q_0\) (define \(a_j = 0\) if \(j=0\)). Then it is straightforward to verify that v will be selected if and only if:

  1. A.

    In Step 3, binary digit j was selected.

  2. B.

    In Step 1, when \(i = j\), \(v_i = (v - a_j)\) was chosen.

  3. C.

    In Step 2, when \(i = j\), \(a_i = a_j\) was the translation amount.

Since \(q_j = 1\), the probability of (A) is \(2^j/Q\) by (8). The probability of (B) is \(1/2^j\) (all that must be argued here is that \((v - a_j)\) lies in the interval \([0..2^j - 1]\), which is immediate by choice of j and definition of \(a_j\)). The probability of (C) is 1, by definition of \(a_j\). Thus, the probability of v being selected is \(2^j/Q \cdot 1/2^j \cdot 1 = 1/Q\), as desired. \(\square \)

In terms of extending the Insecure Random Value Protocol to a secure version that hides Q, notice that the formation of the sets \(\{S_i\}\) in Step 1 can be done (mostly) independently from Q; namely, one only need to know (an upper-bound for) \(\lfloor \log _2 Q \rfloor \). Since we will be utilizing the RVP in a setting where \(Q<N\) for known N, we can use \(\lfloor \log _2 N \rfloor \) as an upper-bound of \(\lambda \) (approximating \(\lambda \) this way will not effect correctness of the (Insecure) Random Value Protocol, only efficiency). Also, Step 2 can be done with a secure (scalar) multiplication protocol, since:

$$\begin{aligned} a_i = q_{i -1} \dots q_1 q_0 = \sum _{j=0}^{i-1}2^j \cdot q_j. \end{aligned}$$

Also, Step 4 can be achieved with a secure Scalar Product Protocol (SPP), provided the parties have (shares of) \(\mathbf {v}\), \(\mathbf {a}\), and \(\mathbf {e}_i\).

Thus, the difficult part of extending the Insecure Random Value Protocol to a secure version (that hides Q) is Step 3: Sampling an index \(i \in [0..\lambda ]\) according to the probabilities in (8). Indeed, such a sampling has a similar flavor to the original RVP itself (how to sample from a domain where the domain size is unknown), except that we’ve shifted the problem from an unknown domain sizeQ to an unknown sampling distribution; i.e. now the domain \([0..\lambda ]\) is known, but selection is no longer uniform but rather is according to a probability (0 or \(2^i/Q\)) that depends on Q. The ‘Q’ that appears in the denominator of (8) is independent of i, and can be thought of as a normalizing factor for the desired distribution. The challenge will be respecting the \(q_i\) values when deciding if an index i should be chosen with probability zero or probability proportional to \(2^i\).

To achieve Step 3 in a secure setting (where Q must be hidden), we first describe a Reordering Protocol (which is independent of Q, except for dependence on the publicly known \(\lambda \)) that reorders the integers \([0..\lambda ]\) according to some prudently chosen characteristic. This reordering can be viewed as allowing arbitrary normalization of a given distribution, which in particular will allow the “normalization by Q” that appears in (8). We first define the key characteristic of a Reordering Protocol, and then demonstrate how it can be used to sample index i from \([0..\lambda ]\) according to the distribution prescribed by (8).

Definition 4

A protocol that generates a reordering of the integers \([0..\lambda ]\) in a manner that satisfies the following property will be called a Reordering Protocol:

  • Reordering Property For any set of indices \({\mathcal {I}} \subseteq [0..\lambda ]\) and for any index \(i \in {\mathcal {I}}\), the probability that i appears in the reordered sequence before all other indices in \({\mathcal {I}}\) is given by:

    $$\begin{aligned} \text{ Probability }\ i\ \text{ appears } \text{ first } \text{ among } \text{ all } \text{ indices } \text{ in }\ {\mathcal {I}} = \ 2^i \ / \ \sum _{j \in \ {\mathcal {I}}} 2^{j} \end{aligned}$$
    (10)

The intuition as to why a protocol that satisfies the Reordering Property is useful as a subprotocol to achieve Step 3 of the (Insecure) RVP is that it enables “normalization” by an arbitrary constant Q. In particular, the set of indices \({\mathcal {I}}\) will be taken to be the indices of the binary digits of Q that equal ‘1’. Then the Reordering Property guarantees that if we sample from the indices \([0..\lambda ]\) by using a Reordering Protocol, and then output the first index in the reordered sequence that appears in \({\mathcal {I}}\), then this is equivalent to sampling from \([0..\lambda ]\) as in (8).

The specific implementation of a Reordering Protocol is not important, so long as it obeys the Reordering Property. One way of producing such a reordering comes from the classic “choosing balls from a bag” scenario from basic Probability Theory, as follows. A bag is filled with balls each marked with an index in \([0..\lambda ]\). For each index \(i \in [0..\lambda ]\), there will be \(2^i\) balls placed in the bag; i.e. one ball with index 0, two balls with index 1, etc. Then a Reordering Protocol can be achieved by selecting a ball at random from the bag, and using its index as the first number in the reordered sequence. Then, all balls with that index are removed from the bag, and the procedure continues. For completeness, we formalize this procedure in an Example Reordering Protocol (“Appendix B”), where we also provide a proof that it indeed satisfies the Reordering Property.

We now describe how a Reordering Protocol can be used as a subprotocol to instantiate Step 3 of the (Insecure) Random Value Protocol.

(Insecure) RVP Step 3 Let Q be an arbitrary positive integer, let \(\lambda \ge \lfloor \log _2 Q \rfloor \), and let \(Q=q_{\lambda } \dots q_1 q_0\) be its binary representation. This protocol outputs a value \(i \in [0..\lambda ]\) as follows:

  1. A.

    Run the Reordering Protocol to get a reordering of the integers in \([0..\lambda ]\). Let \(\tau : [0..\lambda ] \rightarrow [0..\lambda ]\) denote this reordering, and let \(\sigma = \tau ^{-1}\) denote the inverse mapping.

  2. B.

    Let i denote the first value that appears in the reordering such that \(q_i = 1\). Output the characteristic vector \(\mathbf {e}_i = (0, \dots , 0, 1, 0, \dots , 0)\), whose ‘1’ is in the ith coordinate (where coordinate labeling in this vector is 0-based, so that, e.g. \(\mathbf {e}_0 = (1, 0, \dots , 0)\)).

For example, for \(Q=37 = 100101\), we have \(\lambda = 5\), and so the Reordering Protocol will be applied to the integers [0..5]. Suppose the Reordering Protocol outputs: \(\{4, 1, 5, 3, 2, 0\}\). Then the first value appearing is ‘4’, which is rejected because \(q_4 = 0\). The next value ‘1’ is similarly rejected (\(q_1 = 0\)), and hence \(i=5\) is selected in Step B of the (Insecure) RVP Step 3 as the first value appearing such that the corresponding binary digit of Q (\(q_5\)) is ‘1’.

Claim The (Insecure) RVP Step 3 will output \(\mathbf {e}_i\), where \(i \in [0..\lambda ]\) has probability given by (8).

Proof Sketch We utilize the Reordering Property of the Reordering Protocol called in Step A, applied to the set of indices \({\mathcal {I}} := \{i \in [0..\lambda ] \ | \ q_i = 1\}\). Let \(i \in [0..\lambda ]\) be an arbitrary index, we want to show that the (Insecure) RVP Step 3 will output \(\mathbf {e}_i\) with probability given by (8). If \(q_i = 0\), then i is selected with probability zero (Step B only selects i with \(q_i = 1\)), as required. If \(q_i = 1\), then i is selected in Step B if and only if i is the first index in \({\mathcal {I}}\) to appear in the reordering. By the Reordering Property, this is exactly \(2^i / Q\), as required. \(\square \)

Notice that Step A of the (Insecure) RVP Step 3 can be done independently from Q (except for \(\lambda \ge \lfloor \log _2 Q \rfloor \), which as mentioned earlier can be taken to be \(\lambda =\lfloor \log _2 N \rfloor \) for a public value \(N > Q\)). Meanwhile, together with a trick to convert shares of \(\{q_i\}\) to shares of \(\{q_{\sigma (i)}\}\) such that \(\sigma \) is hidden from one party, a secure version of Step B can be obtained directly from a secure Scalar Product Protocol (SPP) and a Nested Product Protocol (NPP). The reduction of Step B to these two subprotocols (plus the reindexing trick) is due to the fact that we can express \(\mathbf {e}_i\) as:

$$\begin{aligned} \mathbf {e}_i&= q_{\sigma (0)} \cdot \mathbf {e}_{\sigma (0)} + q_{\sigma (1)} \cdot (1 - q_{\sigma (0)}) \cdot \mathbf {e}_{\sigma (1)} + \cdots \nonumber \\&\quad +q_{\sigma (\lambda )} \cdot \left[ (1 - q_{\sigma (0)}) \dots \cdot (1 - q_{\sigma (\lambda - 1)}) \right] \cdot \mathbf {e}_{\sigma (\lambda )} \nonumber \\&= \sum _{j=0}^{\lambda } \left[ q_{\sigma (j)} \cdot \left( \prod _{k=0}^{j-1} (1 - q_{\sigma (k)}) \right) \cdot \mathbf {e}_{\sigma (j)} \right] \end{aligned}$$
(11)

See the Compute \(\mathbf {e}_i\) Protocol in “Appendix C” for details.

We are now ready to put together these ideas and provide a formal description of our (Secure) Random Value Protocol.

4.2 Description of the Protocol

As mentioned at the start of Sect. 4.1, the protocol first has Alice and Bob sample a value \(S \leftarrow {\mathbb {Z}}_Q\) uniformly at random, such that Q remains hidden to both parties and S is sampled obliviously with respect to Alice’s view (as in Definition 3). That is, neither party learns anything about Q, and Alice knows nothing about S (Bob may have partial information about S, but this information will not leak any information about Q). Next, the parties run the same protocol with their roles reversed,Footnote 2 generating a uniform \(T \leftarrow {\mathbb {Z}}_Q\) such that neither party learns anything about Q and Bob knows nothing about T. Next, they run the Addition Modulo Unknown Value Protocol to obtain the final output \(R := S + T (\)mod Q). Thus, a secure RVP will follow from the following protocol, which generates S obliviously for one party’s view, and such that Q remains private (with respect to both parties’ views).

RVP Subprotocol for Generating \(S \leftarrow {\mathbb {Z}}_Q\) Obliviously (with respect to Alice’s view)

Input Public parameter N and \(\lambda =\lfloor \log _2 N \rfloor \) are known to both parties, Alice and Bob (additively secret) share an unknown value \(Q = Q^A + Q^B < N\).

Output Alice and Bob have (shares of) a value \(S \in {\mathbb {Z}}_Q\) such that:

  1. 1.

    S is chosen uniformly at random in \({\mathbb {Z}}_Q\).

  2. 2.

    Alice is oblivious to the value S chosen (as in Definition 3); no requirements are made regarding Bob’s knowledge of S (Bob will have partial knowledge).

  3. 3.

    Alice and Bob learn nothing about Q.

Cost This protocol will add \(O(\lambda ^2)\) to communication.

Protocol Description Note that this protocol follows the framework of the (Insecure) RVP presented in Sect. 4.1, with the appropriate modifications to ensure Q remains hidden.

  1. 0.

    Alice and Bob run the To Binary Protocol to convert their shares of Q to shares of each bit \(\{q_i\}\) in the binary representation of Q: \(q_i = q^A_i + q^B_i (\)mod N).

  2. 1.

    For each \(0 \le i \le \lambda \), Bob sets \(S_i = [0..2^i - 1]\), and chooses a value \(v_i \leftarrow S_i\) uniformly at random. Let \(\mathbf {v} = (v_0, \dots , v_{\lambda }) \in {\mathbb {Z}}^{1 + \lambda }\) denote the vector whose coordinates are the chosen values of \(\{v_i\}\).

  3. 2.

    For each \(0 \le i \le \lambda \), Alice and Bob locally compute shares of (the translation amount) \(a_i := q_{i -1} \dots q_1 q_0\) (define \(a_0 := 0\)). Namely, the parties set \(a^A_0 = 0 =a^B_0\), and then locally compute for each \(1 \le i \le \lambda \):

    $$\begin{aligned} \text{ Alice: } \quad a^A_i&= q^A_{i -1} \dots q^A_1 q^A_0 =\sum _{j=0}^{i-1}2^j \cdot q^A_j \\ \text{ Bob: } \quad a^B_i&= q^B_{i -1} \dots q^B_1 q^B_0 = \sum _{j=0}^{i-1}2^j \cdot q^B_j, \end{aligned}$$

    where all arithmetic above is computed modulo N. Note that \(a_i = a_i^A + a_i^B (\text{ mod } N)\). Let \(\mathbf {a} = (a_0, a_1, \dots , a_{\lambda }) \in {\mathbb {Z}}^{1 + \lambda }\) denote the vector whose coordinates are the values \(\{a_i\}\).

  4. 3.

    Alice and Bob run the following subprotocols to obtain (shares of) the characteristic vector \(\mathbf {e}_i = (0, \dots , 0, 1, 0, \dots , 0) \in {\mathbb {Z}}_2^{1+\lambda }\) where the ‘1’ is in coordinate i with probability as in (8):

    1. A.

      Bob runs a Reordering Protocol to get a reordering of the integers in \([0..\lambda ]\). Let \(\tau : [0..\lambda ] \rightarrow [0..\lambda ]\) denote this reordering, and let \(\sigma = \tau ^{-1}\) denote the inverse mapping.

    2. B.

      Alice and Bob run the Compute\(\mathbf {e}_i\)Protocol to output (shares of) \(\mathbf {e}_i\)

  5. 4.

    Alice and Bob run the Scalar Product Protocol to obtain (shares of) \(S := \mathbf {e}_i \cdot (\mathbf {v} + \mathbf {a})\), where “\(\cdot \)” denotes the inner-product (in \({\mathbb {Z}}^{1+\lambda }\)).

Lemma 4.1

The above RVP Subprotocol for Generating\(S \leftarrow {\mathbb {Z}}_Q\) protocol satisfies desired output properties 1–3:

  • Correctness\(S \leftarrow {\mathbb {Z}}_Q\) is chosen uniformly at random.

  • Obliviousness Alice is oblivious to the output value S, as in Definition 3.

  • SecurityQ is securely hidden from both parties, as in Definition 1.

Proof

Correctness follows from correctness of the (Insecure) Random Value Protocol (which was proven in Sect. 4.1). Obliviousness of Alice’s view (in the sense of Definition 3) follows from:

  1. 1.

    Obliviousness of Step 0 follows from the security of the To Binary Protocol.

  2. 2.

    Obliviousness of Step 1 is immediate, as Alice is not involved in this step.

  3. 3.

    Obliviousness of Step 2 is immediate, since \(\{a^A_i\}\) can be computed from Alice’s shares of the binary digits \(\{q_i\}\) that she received from Step 0.

  4. 4.

    Obliviousness of Step 3 follows from the security of the Compute\(\mathbf {e}_i\)Protocol.

  5. 5.

    Obliviousness of Step 4 follows from the security of the Scalar Product Protocol.

Regarding security (that Q remains hidden to both parties), notice that while the choice of \(\{v\}_i\) in Step 1 give Bob partial information about S, this step is independent of Q, and thus Bob can learn nothing about Q. Indeed, the only steps that pertain to Q are Step 0, 2, 3B, and 4, and since each of these steps invoke a secure subprotocol, all knowledge of Q remains hidden. \(\square \)

4.3 Proof of Correctness, Obliviousness, and Security

Recall that the overall (secure) RVP proceeds by invoking the RVP Subprotocol for Generating\(S \leftarrow {\mathbb {Z}}_Q\) twice (once to generate S obliviously from Alice’s view and then to generate T obliviously from Bob’s view), followed by a single call to the Addition Modulo Unknown Value Protocol to add \(R = S + T \ (\hbox {mod}\ Q)\).

Theorem 4.2

The above described Random Value Protocol satisfies:

  1. 1.

    Correctness\(R \leftarrow {\mathbb {Z}}_Q\) is chosen uniformly at random.

  2. 2.

    Obliviousness Both parties are oblivious to the output value R, as in Definition 3.

  3. 3.

    SecurityQ is securely hidden from both parties, as in Definition 1.

Proof

Correctness of the RVP Subprotocol for Generating\(S \leftarrow {\mathbb {Z}}_Q\) (that S is sampled uniformly from \({\mathbb {Z}}_Q\)) was demonstrated in Lemma 4.1. Correctness of the RVP then follows from:

  • Observation If Y is any fixed number in \({\mathbb {Z}}_Q\) and X is a uniform random variable in \({\mathbb {Z}}_Q\), then the random variable \(Z := Y + X\) (mod Q) is uniformly distributed in \({\mathbb {Z}}_Q\).

The following observation demonstrates how obliviousness of the RVP follows from obliviousness of the RVP Subprotocol for Generating\(S \leftarrow {\mathbb {Z}}_Q\) (which was proved in Lemma 4.1):

  • Observation If a party’s view includes knowledge of Y but no knowledge of X, then \(Z := Y + X\) (mod Q) is oblivious to that party.

Based on composability of security [6], security follows from the security of the subprotcols used: RVP Subprotocol for Generating\(S \leftarrow {\mathbb {Z}}_Q\) (security proved in Lemma 4.1), and the Addition Modulo Unknown Value Protocol, which itself consists of a secure SPP and FM2NP. \(\square \)

As an aside, we note that the above observations actually guarantee that this protocol chooses S obliviously and uniformly at random even if one of the parties is corrupted maliciously. The Random Value Protocol can therefore be used as a subprotocol in models allowing a malicious adversary, provided that the TBP, Compute\(\mathbf {e}_i\)Protocol, and SPP utilized by the RVP are all secure against a malicious adversary.

5 Two-Party k-Means Clustering Protocol

5.1 Notation and Preliminaries

Following the setup of [18], we assume that two parties, “Alice” and “Bob”, each hold (partial) data describing the d attributes of n objects (we assume Alice and Bob both know d and n). Their aggregate data comprises the (virtual) database \({\mathcal {D}}\), holding the complete information of each of the n objects. The goal is to design an efficient algorithm that allows Alice and Bob to perform k-means clustering on their aggregate data in a manner that protects their private data.

As mentioned in the Introduction, we are working in the model where our data points are viewed as living in \({\mathbb {Z}}^d_N\) for some large RSA modulus N chosen by Alice. Note that if Alice and Bob desire a lattice width of W and \(\mathtt {M}\) denotes the maximum Euclidean distance between points,Footnote 3 then Alice will pick N sufficiently large to guarantee that \(N \ge \frac{n^2 \mathtt {M}^2}{W^2}\) (this inequality guarantees that the sum of all data points does not exceed N). Because Alice chooses the RSA modulus, Bob will be performing the bulk of the computation (on the encrypted data points).

We allow the data points to be arbitrarily partitioned between Alice and Bob (see [18]). This means that there is no assumed pattern to how Alice and Bob hold attributes of different data points (in particular, this subsumes the cases of vertically and horizontally partitioned data). We only demand that between them, each of the d attributes of all n data points is known by either Alice or Bob, but not both. As discussed in [27], attributes of the data points that are measured in units significantly larger than others will dominate distance calculations. Alice and Bob may therefore wish to standardize the data before running a k-means clustering protocol on it. The manner in which this standardization is achieved depends on the nature of the data and we do not explore the possibilities here. Rather, we note that any such standardization can likely be achieved with the Scalar Product Protocol and a private Division Protocol (e.g. the one presented in Sect. 3.1). For a given data point \(\mathbf {D}_i \in {\mathcal {D}}\), we denote Alice’s share of its attributes by \(\mathbf {D}^A_i\), and Bob’s share by \(\mathbf {D}^B_i\).

5.2 Single Database k-Means Algorithms

The single database k-means clustering algorithm that we extend to the two-party setting was introduced by [24] and is summarized below. We chose this algorithm because under appropriate conditions on the distribution of the data, the algorithm is provably correct (as opposed to most other algorithms that are used in practice which have no such provable guarantee of correctness). Additionally, the Initialization Phase (or “seeding process”) is done in an optimized manner, reducing the number of iterations required in the Lloyd Step. In general, the number of iterations required in the Lloyd Step depends on the nature of the data: number of data-points, number of attributes/dimensions, distribution (values) of data points, etc., and hence there is no hard bound on the number of iterations that may be required. However, the protocol of [24] argues that if the data points enjoy certain “nice” properties, then the number of iterations is extremely small (i.e. with high probability, only two iterations are necessary). The number of iterations of the Lloyd Step has both communication as well as privacy implications, see “Appendix A” for a discussion.

The single database k-means clustering algorithm is as follows (see [24] for details):

Step I: Initialization This procedure chooses the cluster centers \({\varvec{\mu }}_1, \dots ,{\varvec{\mu }}_k\) according to (an equivalent version of) the protocol described in [24]:

  1. A.

    Center of Gravity Compute the center of gravity of the n data points and denote this by \(\mathbf {C}\):

    $$\begin{aligned} \mathbf {C} = \frac{\sum _{i=1}^n \mathbf {D}_i}{n} \end{aligned}$$
    (12)
  2. B.

    Distance to Center of Gravity For each \(1 \le i \le n\), compute the distance (squared) between \(\mathbf {C}\) and \(\mathbf {D}_i\). Denote this as \(\widetilde{C}^0_i =\text{ Dist }^2(\mathbf {C}, \mathbf {D}_i)\).

  3. C.

    Average Squared Distance Compute \(\bar{C} :=\frac{\sum _{i=1}^n \widetilde{C}^0_i}{n}\), the average (squared) distance.

  4. D.

    Pick First Cluster Center Pick \({\varvec{\mu }}_1 = \mathbf {D}_i\) with probability:

    $$\begin{aligned} \text{ Pr }[{\varvec{\mu }}_1 = \mathbf {D}_i] =\frac{\bar{C} + \widetilde{C}^0_i}{2n \bar{C}}. \end{aligned}$$
    (13)
  5. E.

    Iterate to Pick the Remaining Cluster Centers Pick \({\varvec{\mu }}_2, \dots ,\)\({\varvec{\mu }}_k\) as follows: Suppose \({\varvec{\mu }}_1, \dots ,\)\({\varvec{\mu }}_{j-1}\) have already been chosen (initially j=2), then we pick \({\varvec{\mu }}_j\) by:

    1. 1.

      For each \(1 \le i \le n\), calculate \(\widetilde{C}^{j-1}_i = \) Dist\(^2(\mathbf {D}_i, \ {\varvec{\mu }}_{j-1})\).

    2. 2.

      For each \(1 \le i \le n\), let \(\widetilde{C}_i\) denote the minimum of \(\{ \widetilde{C}^{l}_i \}_{l=0}^{j-1}\).

    3. 3.

      Update \(\bar{C}\) to be the average of \(\widetilde{C}_i\) (over all \(1 \le i \le n\)).

    4. 4.

      Set \({\varvec{\mu }}_j = \mathbf {D}_i\) with probability:

      $$\begin{aligned} \text{ Pr }[{\varvec{\mu }}_j = \mathbf {D}_i] = \frac{\widetilde{C}_i}{n \bar{C}}. \end{aligned}$$

Step II: Lloyd Step Repeat the following until \({\varvec{\nu }}_1, \dots ,\)\({\varvec{\nu }}_k\) is “sufficiently close” to \({\varvec{\mu }}_1, \dots ,\)\({\varvec{\mu }}_k\):

  1. A.

    Finding the Closest Cluster Centers For each data point \(\mathbf {D}_i \in {\mathcal {D}}\), find the closest cluster center \({\varvec{\mu }}_{j} \in \{{\varvec{\mu }}_1, \dots , {\varvec{\mu }}_k \}\), and assign data point \(\mathbf {D}_i\) to cluster j.

  2. B.

    Calculating the New Cluster Centers For each cluster j, calculate the new cluster center \({\varvec{\nu }}_j\) by finding the average position of all data points in cluster j. Share these new centers between Alice and Bob as \({\varvec{\nu }}^A_1, \dots \), \({\varvec{\nu }}^A_k\) and \({\varvec{\nu }}^B_1, \dots \), \({\varvec{\nu }}^B_k\), respectively.

  3. C.

    Checking the Stopping Criterion Compare the old cluster centers to the new ones. If they are “close enough”, then the algorithm returns the final cluster centers to Alice and Bob. Otherwise, Step II is repeated after Reassigning New Cluster Centers.

  4. D.

    Reassigning New Cluster Centers To reassign new cluster centers, set:

    $$\begin{aligned}&{\varvec{\mu }}^A_1, \dots , {\varvec{\mu }}^A_k = {\varvec{\nu }}^A_1, \dots , {\varvec{\nu }}^A_k, \quad \text{ and } \\&{\varvec{\mu }}^B_1, \dots , {\varvec{\mu }}^B_k = {\varvec{\nu }}^B_1, \dots , {\varvec{\nu }}^B_k. \end{aligned}$$

5.3 Our Two-Party k-Means Clustering Protocol

We now extend the k-means algorithm of [24] to a two-party setting. Section 5.3.1 discusses how to implement Step I of the above algorithm (the Initialization), and Sect. 5.3.2 discusses how to implement Step II of the algorithm (the Lloyd Step). We discuss in “Appendix A” alternative approaches in the number of iterations allowed in the Lloyd Step, and why this question is an issue in terms of protecting privacy.

5.3.1 Step I: Initialization

We now describe how to extend Step I of the above algorithm to the two-party setting. In particular, we need to explain how to perform the computations from Step I in a secure way. As output, Alice should have shares of the cluster centers \({\varvec{\mu }}^A_1, \dots ,{\varvec{\mu }}^A_k\), and Bob should have \({\varvec{\mu }}^B_1, \dots ,{\varvec{\mu }}^B_k\), such that \({\varvec{\mu }}^A_i + {\varvec{\mu }}^B_i ={\varvec{\mu }}_i\), for each \(1 \le i \le k\). Below we follow Step I of the algorithm from Sect. 5.3.1 and describe how to privately implement each step. At the outset of the protocol, we have Alice encrypt her data points once and for all, and send them to Bob. This entails a one-time communication cost of \(O(nd\lambda )\), and without explicit mention we assume that all other subprotocols that require Bob to perform computations on Alice’s encrypted data points do not repeat this communication transfer.

  1. A.

    Center of Gravity To implement Step A, Alice and Bob will need to compute and share:

    $$\begin{aligned} \mathbf {C} = \frac{1}{n}\sum _{i=1}^n \mathbf {D}_i = \frac{1}{n} \left( \sum _{i=1}^n \mathbf {D}^A_i + \sum _{i=1}^n \mathbf {D}^B_i \right) \end{aligned}$$
    (12)

    Since Bob has Alice’s encrypted data, Bob can locally compute (encryptions) of the above sums, and return a (randomized) share of the sum to Alice. To compute the final shares of \(\mathbf {C}\), they need to divide by n, where division is according to the division algorithm in \({\mathbb {Z}}_N\) (as per Sect. 3). One way to do this is to run a Private Division Protocol (e.g. the protocol presented in Sect. 3.1), but since the divisor n is publicly known, it may be cheaper (in terms of communication complexity) to just have each party perform the division locally, with a few calls to FM2NP, as in (3) and (4).

  2. B.

    Distance to Center of Gravity Alice and Bob can run a secure Distance Protocol (see, e.g. [18]) on the encrypted data such that Bob obtains as output an encryption of \(\widetilde{C}^0_i\), where \(\widetilde{C}^0_i\) is the distance (squared) between \(\mathbf {C}\) and \(\mathbf {D}_i\). He randomizes this encryption and returns it to Alice, so that they share \(\widetilde{C}^0_i = \widetilde{C}^{A,0}_i + \widetilde{C}^{B,0}_i\) for each i.

  3. C.

    Average Squared Distance Define the following sums:

    $$\begin{aligned} P := \sum _{i=1}^n \widetilde{C}^{A,0}_i \quad \text{ and } \quad P' := \sum _{i=1}^n \widetilde{C}^{B,0}_i, \end{aligned}$$

    which Alice and Bob can compute locally. They then compute the division of (\(P + P'\)) by n, which can be done via a Private Division Protocol (e.g. the protocol presented in Sect. 3.1) or, since the divisor n is public, they can compute this locally (with a few calls to FM2NP) using (3) and (4). As output, Alice and Bob will be sharing \(\bar{C}\) as desired.

  4. D.

    Pick First Cluster Center Notice that picking a data point \(\mathbf {D}_i\) with probability \(\frac{\bar{C} + \widetilde{C}^0_i}{2n \bar{C}}\) is equivalent to picking a random number \(R \in [0..2n \bar{C}-1]\) and finding the first i such that \(R \le \sum _{j=1}^i \bar{C} + \widetilde{C}^0_j\). We use this observation to pick data points according to weighted probabilities as follows:

    1. 1.

      Picking a RandomR In this step, Alice and Bob pick a random number in \([0..2n \bar{C}-1]\), where \(2n \bar{C} = 2n \bar{C}^{A} + 2n \bar{C}^{B}\). Alice and Bob run the Random Value Protocol (RVP) with \(Q := 2n \bar{C} = 2n \bar{C}^{A} + 2n \bar{C}^{B}\) to generate and share a random number \(R = R^A + R^B \in {\mathbb {Z}}_{2n \bar{C}}\).

    2. 2.

      Alice and Bob will next compare their random number R with the sum \(\sum _{j=1}^i \bar{C} + \widetilde{C}^0_j\), and find the first i such that \(R \le \sum _{j=1}^i \bar{C} +\widetilde{C}^0_j\). They will then set \({\varvec{\mu }}_1 = \mathbf {D}_i\). The actual implementation of this can be found in the Choose\({\varvec{\mu }}_1\)Protocol in “Appendix C”.

  5. E.

    Iterate to Pick the Remaining Cluster Centers

    1. 1.

      This step is done analogously to Step I.B.

    2. 2.

      This step outputs the minimum of \(\{ \widetilde{C}^l_i \}_{l=0}^{j-1}\). However, they don’t have to take the minimum over all j numbers, since from the previous iteration of this step, they already have \(\widetilde{C}_i = \text{ min } \{\widetilde{C}^l_i \}_{l=0}^{j-2}\). Thus, they can compute the minimum of two numbers, that is reset \(\widetilde{C}_i\) to be:

      $$\begin{aligned} \widetilde{C}_i = \min \{ \widetilde{C}_i, \widetilde{C}^{j-1}_i \} . \end{aligned}$$

      Therefore, Alice and Bob run the FM2NP on inputs \((\widetilde{C}^A_i, \widetilde{C}^{A,j-1}_i)\) and \((\widetilde{C}^B_i, \widetilde{C}^{B,j-1}_i)\) so that they share the location of (the new) \(\widetilde{C}_i\) (let \(L = L^A + L^B\) denote this location). They can then share the new \(\widetilde{C}_i = \min \{ \widetilde{C}_i, \widetilde{C}^{j-1}_i \}\) by running the SPP on inputs \(\mathbf {x} = (\widetilde{C}^A_i, \widetilde{C}^{A,j-1}_i, L^A)\) and \(\mathbf {y} =(\widetilde{C}^B_i, \widetilde{C}^{B,j-1}_i, L^B)\) and function \(f(\mathbf {x}, \mathbf {y}) = L\widetilde{C}^{j-1}_i + (1-L)\widetilde{C}_i\).

    3. 3.

      This step is done analogously to Step I.C.

    4. 4.

      This step is done analogously to Step I.D.

5.3.2 Step II: Lloyd Step

In this section, we discuss how to implement the Lloyd Step while maintaining privacy protection.

  1. A.

    Finding the Closest Cluster Centers Alice and Bob repeat the following for each \(\mathbf {D}_i \in {\mathcal {D}}\):

    1. 1.

      Find the Distance (squared) to Each Cluster Center Note that because finding the minimum of all distances is equivalent to finding the minimum of the distances squared, and because the latter is easier to compute (no square root), we will calculate the latter. Since Bob has (encryptions of) Alice’s shares of the data points and the cluster centers, Bob can run a secure Distance Protocol (see, e.g. [18]) to obtain for each cluster center j the (encrypted) distance \(X_{i,j}\) of data point \(\mathbf {D}_i\) to cluster center j. As usual, Bob randomizes each distance and returns them to Alice, so that for each j, Alice and Bob share the vector \(\mathbf {X}_i = (X_{i,1}, \dots X_{i,k})\).

    2. 2.

      Alice and Bob run the Find Minimum ofkNumbers Protocol (FMkNP) on \(\mathbf {X}^A_i\) and \(\mathbf {X}^B_i\) to obtain a share of (a vector representation of) the location of the closest cluster center to \(\mathbf {D}_i\):

      $$\begin{aligned} \mathbf {C}_i := (0, \dots , 0, 1, 0, \dots , 0) \in {\mathbb {Z}}_2^k, \end{aligned}$$
      (14)

      where the 1 appears in the jth coordinate if cluster center \({\varvec{\mu }}_j\) is closest to \(\mathbf {D}_i\). Note that in actuality, \(\mathbf {C}_i\) is shared between Alice and Bob:

      $$\begin{aligned} \mathbf {C}_i = \mathbf {C}^A_i + \mathbf {C}^B_i, \end{aligned}$$

      and Alice encrypts her share and sends this to Bob.

  2. B.

    Calculating the New Cluster Centers The following will be done for each cluster \(1 \le j \le k\). We break the calculation into three steps: In Step 1, Bob will compute the sum of data points in cluster j, in Step 2 he will compute the total number of points in cluster j, and in Step 3 the result of Step 1 will be divided by the result of Step 2. To simplify the notation, by \(E(\mathbf {C}_i)\) we will mean \((E(\mathbf {C}_{i,1}), \dots , E(\mathbf {C}_{i,k}))\).

    1. 1.

      Sum of Data Points in Clusterj In this step, Bob will compute the sum of all data points in cluster j. We denote this sum as:

      $$\begin{aligned} \mathbf {S}_j \in {\mathbb {Z}}^d_N = \sum _{i=1}^{n} C_{i,j} \cdot \mathbf {D}_i \qquad \text{ where } \ C_{i,j} = j\text{ th } \text{ coordinate } \text{ of } \mathbf {C}_{i} = {\left\{ \begin{array}{ll} 1 &{} \text{ if } \mathbf {D}_i \in \text{ cluster } j \\ 0 &{} \text{ O.W. } \end{array}\right. } \end{aligned}$$

      At the end of this step, Alice and Bob will share \(\mathbf {S}_j =\mathbf {S}^A_j + \mathbf {S}^B_j\) (here the addition is in \({\mathbb {Z}}^d_N\)). Recall from Step A above that for each data point \(\mathbf {D}_i\), Bob has \(E(\mathbf {C}^A_i)\) and \(\mathbf {C}^B_i\), where:

      $$\begin{aligned} \mathbf {C}^A_i + \mathbf {C}^B_i = \mathbf {C}_i = (0, \dots , 0, 1, 0, \dots , 0). \end{aligned}$$

      Utilizing the homomorphic and single multiplication properties of E, Bob can compute (an encryption) of \(\mathbf {S}_j\), returning a randomized share to Alice so that they share \(\mathbf {S}_j\) as desired.

    2. 2.

      Number of Data Points in Clusterj Now Alice and Bob wish to share the total number of points in cluster j, denoted by \(T_j\). Notice that:

      $$\begin{aligned} T_j = \sum _{i=1}^n C_{i,j}, \end{aligned}$$

      i.e. \(T_j\) can be found by summing the jth coordinate of \(\mathbf {C}_i\) for each i. Bob can compute \(T_j\) using his own shares of \(\mathbf {C}_i\) and Alice’s encrypted shares, and randomizing his computation, Alice and Bob share \(T_j = T^A_j +T^B_j\).

    3. 3.

      Centroid of Data Points in Clusterj In this step Alice and Bob must divide \(\mathbf {S}^A_j + \mathbf {S}^B_j\) (from Step 1) by the total number of data points \(T_j\) in cluster j to obtain the new cluster center \({\varvec{\nu }}_j\):

      $$\begin{aligned} {\varvec{\nu }}_j = \frac{\mathbf {S}^A_j + \mathbf {S}^B_j}{T^A_j + T^B_j} \end{aligned}$$
      (15)

      Alice and Bob run a Private Division Protocol (e.g. the protocol presented in Sect. 3.1) d times (once for each dimension d) on inputs \(P = \mathbf {S}^A_{j,l} +\mathbf {S}^B_{j,l}\) (the lth coordinate of \(\mathbf {S}_j\)) and divisor \(D = T^A_j + T^B_j\) (notice that necessarily \(D \in [1..n]\)).

  3. C.

    Checking the Stopping Criterion Alice and Bob run a secure Distance Protocol (see, e.g. [18]) k times, on the lth time it outputs shares of \(\Vert {\varvec{\mu }}_l - {\varvec{\nu }}_l \Vert ^2\). They can then add their shares together and run the FM2NP to compare these sums with \(\epsilon \), some agreed upon predetermined value. They can then open their outputs from the FM2NP to determine if the stopping criterion has been met.

  4. D.

    Reassigning New Cluster Centers The final step of our algorithm, replacing the old cluster centers with the new ones, is easily accomplished:

    $$\begin{aligned}&\text{ Alice } \text{ sets: } ({\varvec{\mu }}^A_1, \dots , {\varvec{\mu }}^A_k) = ({\varvec{\nu }}^A_1, \dots , {\varvec{\nu }}^A_k), \text{ and } \\&\text{ Bob } \text{ sets: } ({\varvec{\mu }}^B_1, \dots , {\varvec{\mu }}^B_k) = ({\varvec{\nu }}^B_1, \dots , {\varvec{\nu }}^B_k). \end{aligned}$$

5.4 Communication Analysis

The following table gives a succinct summary of the communication complexity of the two-party k-means clustering protocol presented in Sect. 5.3.

Thus, assuming \(n > \lambda d\), the overall communication complexity of the 2-Party k-means clustering protocol of Sect. 5.3 is:

$$\begin{aligned} O(\lambda nd) + O(\lambda ^2kmn). \end{aligned}$$
(16)

Recall that k is the number of clusters, \(\lambda \) is the security parameter, n is the number of data points, d is the number of attributes of each data point, and m is the number of iterations in the Lloyd Step. The communication cost of our protocol matches the communication complexity of [18] while simultaneously enjoying the extra guarantee of security against an honest-but-curious adversary.

As mentioned in the Introduction, k-means clustering can also be performed securely by applying generic tools from multi-party computation, e.g. via Yao’s garbled circuit (see [31]). Let \(\xi _{(k)}\) denote the communication cost of a non-secure protocol that finds the minimum of k numbers (each of at most \(\lambda \) bits) that are shared between two parties. Notice that a circuit representation of the single databasek-means clustering protocol of [24] has size at least:

$$\begin{aligned} O(kmnd) + O(mn \xi _{(k)}) \end{aligned}$$
(17)

The first term is necessary, e.g. to add together all the data points in each cluster during each iteration of the Lloyd Step, and the second term is necessary to, e.g. find the minimum of k numbers for each data point (when deciding which cluster the data point belongs to). Notice that any implementation of a protocol that finds the minimum of k (\(\lambda \)-bit) numbers will cost at least \(O(\lambda k)\). Using these observations and the fact that applying Yao’s garbled circuit techniques to a circuit of size O(|C|) has communication complexity \(O(\lambda |C|)\), we have that the communication complexity of a generic solution is at least:

$$\begin{aligned} O(\lambda kmnd) + O(\lambda ^2 kmn). \end{aligned}$$
(18)

Notice that the second term of our protocol’s communication complexity in (16) matches that of the generic solution in (18), while our first term enjoys asymptotic advantage of a factor of mk over the first term of (18). Furthermore, if d is sufficiently large so that \(d \ge \lambda \), then the first term of Eq. (18) dominates, in which case our protocol has overall asymptotic advantage over a generic solution by a factor of \(\min (mk, d/\lambda )\).

Table 2 Communication complexity of the 2-party k-means clustering protocol of Sect. 5.3

Notice that our protocol consists entirely of the subprotocols listed in Sect. 2.2 together with secure Distance, Division, and Random Value protocols, and utilization of a homomorphic encryption scheme (e.g. Paillier). With the exception of the places in our k-means clustering protocol that rely on the homomorphic encryption scheme, all of the subprotocols could be invoked by applying Yao’s garbled circuit technique to the relevant circuit that represents the subprotocol’s functionality. Thus, in comparing our solution to a generic solution that applies Yao to the circuit that represents the overall (insecure) k-means clustering protocol, it will be useful to separate out the communication costs of transferring ciphertexts (of the homomorphic encryption scheme) versus the rest of the communication cost. As can be seen from Table 2, there are \((nd + n + kmn + 2kmd)\) ciphertexts exchanged in our k-means clustering protocol. Thus, with the assumption that \(d \ge \lambda \) so that the first terms of (16) and (18) dominate, our solution will out perform a generic solution so long as:

$$\begin{aligned} \text{ Communication }(nd + n + kmn + 2kmd\ \text{ ciphertexts }) < \text{ Communication }(\text{ Yao }(kmnd)) \end{aligned}$$
(19)

Letting \(l = \min (km, d)\), (19) reduces to comparing the communication cost of sending l ciphertexts versus performing Yao on a circuit of size \(l^2\). Asymptotically, since both Yao and homomorphic encryption (e.g. Paillier) add a factor of \(O(\lambda )\) to communication, our protocol enjoys a factor of \(l = \min (km, d)\) (asymptotic) communication advantage over Yao. An advantage will also be observed in practice (accounting for the fact that the constants ignored in the O(\(\lambda \)) asymptotic costs of Yao may be smaller than those for the homomorphic encryption) so long as the extra cost of the encryption scheme is less than l times the extra cost of employing Yao.

As a final point, we note that there \(O(\lambda \text{ log }^c \lambda )\)-sized circuits that can perform integer reciprocation (see [26]). Assuming these methods can be translated to perform division as defined in Sect. 3, we could apply Yao’s garbled circuit techniques locally (i.e. not for the entire k-means protocol, but only for division), in which case the second term in (16) will dominate the third as long as \(n \ge d \text{ log }^c \lambda \) (instead of \(n \ge d\lambda \)).

6 Conclusion and Future Work

As mentioned in Sect. 2.3, the proof of security of the two-party k-means clustering protocol presented above follows from the fact that each of the subprotocols are secure. The only exception to this is in Step C of the Lloyd Step, where Alice and Bob must decide if their protocol has reached the termination condition. Although Alice and Bob remain oblivious to any actual values at this stage, they will gain the information of exactly how many iterations were required in the Lloyd Step. There are various ways of defining the model to handle this potential information leak and thus maintain perfect privacy protection (see “Appendix A”).

The focus of this paper was on performing k-Means clustering when the underlying data is divided among two parties. An interesting direction for further research is the extension to generic multiparty computation for \(n > 2\) parties. There are a number of techniques that can be used to extend secure two-party protocols to the \(n>2\) setting; see, e.g. discussion and references in [8], which describes a multiparty coin-flipping protocol.

Another extension that would be interesting to consider is that of preserving privacy in a malicious adversary model. One approach would be to augment the 2-party protocols presented in this paper with standard techniques (e.g. [15, 17]) to boost security from the honest-but-curious to the malicious adversary setting.

Finally, the focus of the presented k-means protocol was to minimize the (asymptotic) communication cost. An interesting open problem is to consider other costs (e.g. round complexity, computation, etc.), as well as optimizing actual (not just asymptotic) communication. One aspect of this follow-up work would likely involve finding suitable instantiations of the subprotocols listed in Sect. 2.2.