1 Introduction

Linear regression is an important statistical tool that models the relationship between some explanatory values (features) and an outcome value using a linear function. Despite its simple definition, a linear regression model is very useful. Indeed, it can be used to quantitatively relate the features and the outcome (e.g., identify which features influence more directly the outcome) and for future prediction (e.g., if a new vector of features with no known outcome is given, the model can be used to make a prediction about it). Ridge regression is one of the most widely-used forms of regression; see the survey in [21]. It lessens the over-fitting of ordinary least squares regression without adding computational cost. In practice, this is achieved giving preference to models with small Euclidean norm. To enhance the efficacy of the learned model, prior experience in model training suggests using training data from a large and diverse set. Indeed, it is known that having more data (more relevant features and/or more data points) typically improves the ability to learn a reliable model. A simple way to obtain such training dataset is to merge data contained in “data silos” collected by different entities. However, in many applications (e.g., personalized medicine [28]) the data points encode sensitive information and are collected by possibly mutually distrustful entities. Often, these entities will not (or cannot) share the private data contained in their silos, making collaborative analysis on joint data impossible.

Consider the following example: We would like to use a given linear regression method in order to predict the weight of a baby at birth on the basis of some ultrasound measurements made during the last month of pregnancy (e.g., head circumference, femur length,...). On one hand, in order to avoid computing a biased model, we would like to run the selected learning algorithm on data points collected in different hospitals in various locations. On the other hand, each hospital legally cannot share (in the clear) patients’ sensitive data (the measurements) with other hospitals or with a third party (e.g., a cloud-computing server). This real-life case exemplifies the challenge on which we focus on: training a linear regression model on joint data that must be kept confidential and/or are owned by multiple parties. Moreover, we want to run such collaborative analysis without exposing an entity’s sensitive data to any other party in the system (i.e., no entity in the system is trusted to handle the data in the clear).

Our paper takes up the above challenge and proposes an efficient solution in the two-server model [16], commonly used by previous works on privacy-preserving machine learning (e.g., see [12, 22, 23]), where no party needs to be trusted to handle the data in the clear. In this setting, the computation of the model from the merged data is outsourced to two non-colluding (but not necessarily trusted) third-parties. After a first phase of collecting private data in encrypted form from possibly many data-owners, the two third parties then engage in a second phase for the computation of the model itself. The system is designed in such a way that no extra information (beside that released by the model itself) is revealed to these two parties if they do not collude (condition that can, for example, be enforced by law). Our solution is based only on a simple cryptographic primitive that can be implemented via efficient constructions. Indeed, our system is designed using just a linearly-homomorphic encryption (LHE) scheme, that is, an encryption scheme that enables computing the sum of encrypted messages. Previous solutions to the problem considered here are based on multi-party computation protocols (e.g., secret-sharing based protocols like BGW [6] or the 2-party protocol by Yao [29]) or on somewhat-homomorphic encryption (i.e., encryption schemes that support a limited number of arithmetic operations on encrypted messages). A hybrid approach that uses both homomorphic encryption and Yao’s scheme was presented in [23]. In this work, we present the first approach to privacy-preserving ridge regression that uses only linearly-homomorphic encryption. We believe that this result is interesting both from the theoretical and the practical points of view. Indeed our system can be seen as a new black-box application of LHE and shows that this basic crypto-primitive can be used alone to handle involved tasks (i.e., ridge regression over distributed data). Furthermore, our system achieves practical performances when implemented using a standard encryption scheme like Paillier’s cipher [24]. We show this via an evaluation of our system that uses synthetically-generated and real-world data. Overall, our experiments demonstrate that, for many real scenarios, LHE is all you need to privately yet efficiently train a ridge regression model on distributed data. As an illustrative example, consider the following existing medical scenario: the Warfarin dosing model. Warfarin is a popular anticoagulant for which the International Warfarin Pharmacogenetics Consortium proposed an accurate dosing model trained using linear regression on a medical database that was the merge of the data silos collected by 21 research groups. Using a commodity machine, our system can compute the same model in less than 3 min with the guarantee of confidentiality for the data silos of each research group involved.

Related Work. The question of privacy-preserving machine learning was introduced in 2000 by two pioneering works [1, 20]. Later on, privacy-preserving linear regression was considered in a number of different works (e.g., [2, 8, 10, 15, 17,18,19, 25]). In 2013, Nikolaenko et al. [23] introduced the scenario we consider in this paper: privacy-preserving linear regression protocol in the two-server model. The solution in [23] considers ridge regression on a horizontally-partitioned dataset in which each party has some of the data points that form the training set (e.g., two or more hospitals, each of which collects the same medical data on different sets of patients). Their solution is based on LHE and Yao’s protocol. The latter is a two-party protocol that allows the evaluation of a circuit C on a pair of inputs (ab) such that one party knows only a and the other party knows only b. At the end of the protocol, the value C(ab) is revealed but no party learns extra information beyond what is revealed by this value. In [23], the ridge regression model is computed using Yao’s protocol to compute the solution of a linear system of the form \(A{\varvec{w}} = {\varvec{b}}\) where the entries of matrix A and vector \({\varvec{b}}\) are encrypted (and must be kept private). The solution \({\varvec{w}}^*\) is the model. The circuit C is the one that solves a linear system computing the Cholesky decomposition of the coefficient matrix. Recently, in [12], the system presented in [23] was extended to vertically-partitioned datasets in which the features in the training dataset are distributed among different parties (e.g., two or more hospitals, each of which collects different medical data on the same set of patients). Gascón et al. [12] achieve this result using multiparty-computation techniques to allow the data-owners to distribute shares of the merged datasets to the two parties active in the second phase. Moreover, Gascón et al. also improve the running time of the second phase of the protocol presented in [23] by designing a new conjugate gradient descent algorithm that is used as circuit C in the place of Cholesky decomposition. This approach was subsequently further improved by Mohassel and Zhang [22] using mini-batch stochastic gradient descent, and extended to logistic regression and neural networks on arbitrarily partitioned datasets.

Our Contribution. Our paper follows this line of work and presents a novel system for ridge regression in the two-server model. For the first phase, we extend the approach used by Nikolaenko et al. to datasets that are arbitrarily partitioned using the techniques of labeled-homomorphic encryption [4] to support multiplications among pairs of ciphertexts encrypted via an LHE scheme. In this way we show that a solution based only on LHE can handle scenarios more complicated than the horizontally-partitioned case. For the second phase, we avoid Yao’s protocol by designing an ad-hoc two-party protocol that solves \(A{\varvec{w}} = {\varvec{b}}\) using only the linear homomorphic property of the underlying encryption scheme. This allows to boost the overall performance and, in particular, to considerably reduce the communication overhead.Footnote 1 As a highlight, if we horizontally partition (into ten equal-sized parts) a dataset of 10 millions instances and 20 features, our privacy-preserving regression method runs in under 2 minFootnote 2 and produces a communication overhead of 1.3 MB. The system presented in [23] needs more than 50 min and 270 MB exchanged data to perform a similar computation.Footnote 3 Finally, we notice that gradient descent based solutions (e.g., [12, 22]) use iterative algorithms and present the problem of estimating the number of iterations t. Either t is fixed to a high value that ensures finding a good approximation of the model, which incurs higher complexity for the protocol; either t is chosen adaptively based on the dataset, which can be infeasible in the privacy-preserving setting. Our solution for solving \(A{\varvec{w}} = {\varvec{b}}\) does not present this problem.

2 Background

Linear Regression. A linear regression learning algorithm is a procedure that on input n points \(\{({\varvec{x}}_1,y_1), \cdots , ({\varvec{x}}_n, y_n)\}\) (where \({\varvec{x}}_i\in \mathbb {R}^d\) and \(y_i\in \mathbb {R}\)) outputs a vector \({\varvec{w}}^*\in \mathbb {R}^d\) such that for all \(i = 1, \cdots , n\). One common way to compute such a model \({\varvec{w}}^*\) is to use the squared-loss function and the associated empirical error function (mean squared error): \(f_{X,{\varvec{y}}}({\varvec{w}}) = \Vert X{\varvec{w}}-{\varvec{y}}\Vert _2^2\). Here \(X\in \mathbb {R}^{n\times d}\) is the matrix with the vector as \(i^{\mathrm {th}}\) row and \({\varvec{y}}\in \mathbb {R}^n\) is the vector with the value \(y_i\) as \(i^{\mathrm {th}}\) component. We assume that X is always full-rank (i.e., \({\text {rk}}(X)=d\)). Specifically, \({\varvec{w}}^*\) is computed by minimizing a linear combination of the aforementioned error function and a regularization term, that is, \({\varvec{w}}^*\in \mathop {\mathrm{argmin}}\nolimits _{{\varvec{w}}\in \mathbb {R}^d} f_{X,{\varvec{y}}}({\varvec{w}})+\lambda R({\varvec{w}})\) where \(\lambda \ge 0\) is fixed. The regularization term is added to avoid over-fitting the training dataset and to bias toward simpler models. In practice, one of the most common regularization terms is the 2-norm (\(R({\varvec{w}})=\Vert {\varvec{w}}\Vert _2^2\)), which generates a model with overall smaller components. In this case (called ridge regression), the model \({\varvec{w}}^*\) is computed by minimizing the function \(F_{\mathrm {ridge}}({\varvec{w}}) = \Vert X{\varvec{w}}-{\varvec{y}}\Vert _2^2 + \lambda \Vert {\varvec{w}}\Vert _2^2\). Since, , we have that \({\varvec{w}}^*\) is computed solving the linear system

$$\begin{aligned} A{\varvec{w}}={\varvec{b}} \end{aligned}$$
(1)

where (symmetric \(d\times d\) matrix) and (vector of d components). Notice that since X is full-rank, A is positive definite and therefore \(\det (A)>0\) (in particular A is invertible).

Cryptographic Tools. To design our privacy-preserving system, we utilize homomorphic encryption. Let \((\mathcal {M},+)\) be a finite group. A linearly-homomorphic encryption (LHE) scheme for messages in \(\mathcal {M}\) is defined by three algorithms:

  1. 1.

    The key-generation algorithm \({{\mathsf {Gen}}}\) takes as input the security parameter \(\kappa \) and outputs a matching pair of secret and public keys, \(( sk , pk )\leftarrow {{\mathsf {Gen}}}(\kappa )\).

  2. 2.

    The encryption algorithm \({\mathsf {Enc}}\) is a randomized algorithm that uses the public key \( pk \) to transform a message m from \(\mathcal {M}\) (plaintext space) into a ciphertext, \(c \leftarrow {\mathsf {Enc}}_ pk (m)\).

  3. 3.

    The decryption algorithm \({\mathsf {Dec}}\) is a deterministic function that uses the secret key \( sk \) to recover the original plaintext from a ciphertext c.

The standard security property (semantic security) says that it is infeasible for any computationally bounded algorithm to gain extra information about a plaintext when given only its ciphertext and the public key \( pk \). Moreover, we have the homomorphic property: Let \(\mathcal {C}\) be the set of all possible ciphertexts, then there exists an operation \(\odot \) on \(\mathcal {C}\) such that for any a-tuple of ciphertexts \(c_1 \leftarrow {\mathsf {Enc}}_ pk (m_1), \cdots , c_a \leftarrow {\mathsf {Enc}}_ pk (m_a)\) (a positive integer), it holds that \(\Pr [{\mathsf {Dec}}_ sk (c_1\odot \cdots \odot c_a) = m_1 + \cdots + m_a] = 1\). This implies that, if \(c = {\mathsf {Enc}}_ pk (m)\), \({\mathsf {Dec}}_ sk ({\mathsf {cMult}}(a,c))=am\), where \({\mathsf {cMult}}(a,c)=c\odot \cdots \odot c\) (a times).

In some cases being able to perform only linear operations on encrypted messages is not sufficient. For example, when considering arbitrarily partitioned datasets, we will need to be able to compute the encryption of the product of two messages given the encryptions of the individual messages. An LHE scheme cannot directly handle such an operation. On the other hand, a general solution to the problem of computing on encrypted data can be obtained via the use of fully-homomorphic encryption [13]. Since full fledged constructions of fully-homomorphic encryption are still inefficient, more efficient solutions have been designed for evaluating low-degree polynomials over encrypted data functionalities (somewhat-homomorphic encryption). In a recent work, Barbosa et al. [4] introduce the concept of labeled-homomorphic encryption (labHE); this new primitive significantly accelerates homomorphic computation over encrypted data when the function that is being computed is known to the party that decrypts the result. Since in this paper we consider that the machine-learning algorithm and the data distribution among the participants is publicly known, the previous assumption is satisfied and we can make use of labHE. In particular, Barbosa et al. show how to design an homomorphic encryption scheme that supports the evaluation of degree-two polynomials using only an LHE and a pseudo-random function. The new scheme is public-key and works in the multi-user setting: two or more users encrypt different messages, an encryption of the evaluation of a degree-two polynomial on these messages can be constructed by any party having access to the public key and the ciphertexts. Then the party holding the secret key can decrypt and reveal the result of the evaluation (the polynomial is public, the correspondence user-ciphertext is known). We briefly recall here their construction [4, Sect. 5] in the case that the polynomial is evaluated on messages encrypted only by two different users.

Let \(({{\mathsf {Gen}}}, {\mathsf {Enc}}, {\mathsf {Dec}})\) be an LHE scheme with security parameter \(\kappa \) and message space \(\mathcal {M}\). Assume that a multiplication operation is given in \(\mathcal {M}\); i.e., \((\mathcal {M}, +, \cdot )\) is a finite ring. Let also \(F:\{ 0, 1 \}^s\times \mathcal {L}\rightarrow \mathcal {M}\) be a pseudo-random function with seed space \(\{ 0, 1 \}^s\) (\(s={\text {poly}}(\kappa ))\) and label space \(\mathcal {L}\). Define:

  • \({\mathsf {labGen}}(\kappa )\): On input \(\kappa \), it runs \({{\mathsf {Gen}}}(\kappa )\) and outputs \(( sk , pk )\).

  • \({\mathsf {localGen}}( pk )\): For each user i and with the public key as input, it samples a random seed \(\sigma _i\) in \(\{0,1\}^s \) and computes \( pk _i={\mathsf {Enc}}_ pk \bigl (\underline{{\sigma _i}}\bigr )\) where \(\underline{{\sigma _i}}\) is an encoding of \(\sigma _i\) as an element of \(\mathcal {M}\). It outputs \((\sigma _i, pk _i)\).

  • \({\mathsf {labEnc}}_{ pk }(\sigma _i, m, \tau )\): On input a message \(m\in \mathcal {M}\) with label \(\tau \in \mathcal {L}\) from the user i, it computes \(b=F(\sigma _i, \tau )\) and outputs the labeled ciphertext \({\varvec{c}}=(a,c)\in \mathcal {M}\times \mathcal {C}\) with \(a=m-b\) in \(\mathcal {M}\) and \(c= {\mathsf {Enc}}_ pk (b)\).

  • \({\mathsf {labMult}}({\varvec{c}},{\varvec{c}}')\): On input two labeled ciphertexts, \({\varvec{c}}=(a,c)\) and \({\varvec{c}}'=(a',c')\), it computes a “multiplication” ciphertext \(d = {\mathsf {labMult}}({\varvec{c}},{\varvec{c}}')\) as \( d={\mathsf {Enc}}_ pk (a\cdot a')\odot {\mathsf {cMult}}(a,c')\odot {\mathsf {cMult}}(a',c)\). Observe that \({\mathsf {Dec}}_ sk (d)=m\cdot m'-b\cdot b'\). Moreover, notice that given two or more multiplication ciphertexts \(d_1,\dots , d_n\), they can be “added” using the operation of the underlying LHE scheme: \(d_1\odot \cdots \odot d_n\). Assume that user i and user j have both encrypted n messages, \(m_1,\cdots ,m_n\) and \(m'_1,\cdots ,m'_n\), respectively. Let \(\tilde{c}\in \mathcal {C}\) be the ciphertext obtained as

    $$ \bigodot _{t=1}^{n}{\mathsf {labMult}}\bigl ({\mathsf {labEnc}}_ pk (\sigma _i,m_t,\tau _t), {\mathsf {labEnc}}_ pk (\sigma _j,m'_t,\tau '_t)\bigr ). $$
  • \({\mathsf {labDec}}_ sk ( pk _i, pk _j, \tilde{c})\): On input \( \tilde{c}\), it first recovers \(\sigma _i\) and \(\sigma _j\) from \({\mathsf {Dec}}_ sk ( pk _i)\) and \({\mathsf {Dec}}_ sk ( pk _j)\). Next, it computes \(b_t=F(\sigma _i,\tau _t)\) and \(b'_t=F(\sigma _j,\tau '_t)\) for all \(t = 1, \cdots , n\). Finally, it computes \(\tilde{b}=\sum _{t=1}^{n}b_t\cdot b'_t\) and \(\tilde{m}={\mathsf {Dec}}_ sk (\tilde{c})-\tilde{b}\). It is easy to verify that \(\tilde{m} = \sum _{t=1}^{n}m_t\cdot m'_t\).

Data Representation. In order to use the cryptographic tools described in the former section, we need to represent the real values that form the input datasets as elements in the finite set \(\mathcal {M}\) (the message space). Without loss of generality, we assume that \(\mathcal {M}=\mathbb {Z}_N\) for some big integer N and that the entries of X and \({\varvec{y}}\) are numbers from the real interval \([-\delta , \delta ]\) (with \(\delta >0\))Footnote 4 with at most \(\ell \) digits in their fractional part. In this case, the conversion from real values to elements in \(\mathcal {M}\) can be easily done by rescaling all the entries of X and \({\varvec{y}}\) and then mapping the integers in \(\mathbb {Z}_N\) using the modular operation. For this reason, from now on we consider that the entries of X and \({\varvec{y}}\) are integers from 0 to \(N-1\). This implies that we consider the matrix A and the vector \({\varvec{b}}\) having positive integer entriesFootnote 5 and, finally, that we assume that the model \({\varvec{w}}^*\) is a vector in \(\mathbb {Q}^d\). Notice that for the integer representation of A and \({\varvec{b}}\) it holds that \(\Vert A\Vert _\infty , \Vert {\varvec{b}}\Vert _\infty \le 10^{2\ell }(n\delta ^2+\lambda )\). Therefore, if \(10^{2\ell }(n\delta ^2+\lambda )\le \frac{N-1}{2}\), then A and \({\varvec{b}}\) are embedded in \(\mathbb {Z}_N\) without overflow for their entries. However, if the linear system (1) is now solved over \(\mathbb {Z}_N\), then clearly the entries of the solution are given as modular residues of \(\mathbb {Z}_N\) and may be different from the entries of the desired model \({\varvec{w}}^*\) in \(\mathbb {Q}^d\). In order to solve this problem and recover the model in \(\mathbb {Q}^d\) from the model computed over \(\mathbb {Z}_N\), we can apply the rational reconstruction technique component-wise. With rational reconstruction [11, 27] we mean the application of the Lagrange-Gauss algorithm to recover a rational \(t=r/s\) from its representation in \(\mathbb {Z}_N\) as \(t' = r \, s^{-1} \bmod N\), for N big enough (see (4) in Sect. 4).

3 Threat Model and System Overview

We consider the setting where the training dataset is not available in the clear to the entity that wants to train the ridge regression model. Instead, the latter can access encrypted copies of the data and, for this reason, needs the help of the party handling the cryptographic keys in order to learn the desired model. More precisely, protocols in this paper are designed for the following parties:

  • The Data-Owners: There are m data-owners \(\mathrm {DO}_1, \cdots ,\mathrm {DO}_m\); each data-owner \(\mathrm {DO}_i\) has a private dataset \(\mathcal {D}_i\) and is willing to share it only if encrypted.

  • The Machine-Learning Engine (\(\mathrm {MLE}\)): This is the party that wants to run a linear regression algorithm on the dataset \(\mathcal {D}\) obtained by merging the local datasets \(\mathcal {D}_1,\cdots , \mathcal {D}_m\), but has access only to the encrypted copies of them. For this reason, \(\mathrm {MLE}\) needs the help of the Crypto Service Provider.

  • The Crypto Service Provider (\(\mathrm {CSP}\)) takes care of initializing the encryption scheme used in the system and interacts with \(\mathrm {MLE}\) to help it in achieving its task (computing the linear regression model). \(\mathrm {CSP}\) manages the cryptographic keys and is the only entity capable of decrypting.

We assume that \(\mathrm {MLE}\) and \(\mathrm {CSP}\) do not collude and that all the parties involved are honest-but-curious. That is, they always follow the instructions of the protocol but try to learn extra information about the dataset from the messages received during the execution of the protocol (i.e., passive security). Moreover, we assume that for each pair of parties involved in the protocol there exists a private and authenticated peer-to-peer channel. In particular, communications between any two players cannot be eavesdropped.

The goal is to ensure that \(\mathrm {MLE}\) obtains the model while both \(\mathrm {MLE}\) and \(\mathrm {CSP}\) do not learn any other information about the private datasets \(\mathcal {D}_i\) beyond what is revealed by the model itself. Even in the case that one of the two servers (\(\mathrm {MLE}\) or \(\mathrm {CSP}\)) colludes with some of the data-owners, they should learn no extra information about the data held by the honest data-owners. In order to achieve this goal we design a system that can be seen as multi-party protocol run by the \(m+2\) parties mentioned before and specified by a sequence of steps. This system (described in Sect. 4) has the following two-phase architecture:  

Phase 1 (merging the local datasets)::

\(\mathrm {CSP}\) generates the key pair \(( sk , pk )\), stores \( sk \) and makes \( pk \) public; each \(\mathrm {DO}_i\) sends to \(\mathrm {MLE}\) specific ciphertexts computed using \( pk \) and the values in \(\mathcal {D}_i\). \(\mathrm {MLE}\) uses the ciphertexts received and the homomorphic property of the underling encryption scheme in order to obtain encryptions of A and \({\varvec{b}}\) (coefficient matrix and vector in (1)).

Phase 2 (computing the model)::

\(\mathrm {MLE}\) uses the ciphertexts \({\mathsf {Enc}}_ pk (A)\) and \({\mathsf {Enc}}_ pk ({\varvec{b}})\) and private random values in order to obtain encryptions of new values that we call “masked data”; these encryptions are sent to the \(\mathrm {CSP}\); the latter decrypts and runs a given algorithm on the masked data. The output of this computation (“masked model”) is a vector \(\tilde{{\varvec{w}}}\) that is sent back from the \(\mathrm {CSP}\) to the \(\mathrm {MLE}\). The latter computes the output \({\varvec{w}}^*\) from \(\tilde{{\varvec{w}}}\).

  Informally, we say that the system is correct if the model computed by the \(\mathrm {MLE}\) is equal to the model computed by the learning algorithm in the clear using \(\mathcal {D}\) as training data. And we say that the system is private if the distribution of the masked data sent by the \(\mathrm {MLE}\) to the \(\mathrm {CSP}\) is independent of the distribution of the local inputs. Thus, no information about \(\mathcal {D}_1,\cdots ,\mathcal {D}_m\) is revealed by the messages exchanged during Phase 2.

As we will see in Sect. 4, the specific design of the protocol realizing Phase 1 depends on the distributed setting: horizontally- or arbitrarily-partitioned datasets. However, in both cases, the data-owners input encryptions of local values and the \(\mathrm {MLE}\) gets the encryptions of A and \({\varvec{b}}\). The \(\mathrm {CSP}\) simply takes care of initializing the cryptographic primitive and generates the relative keys. Phase 2 is realized by an interactive protocol between the \(\mathrm {MLE}\) and the \(\mathrm {CSP}\). \(\mathrm {CSP}\) takes on input the encryptions of A and \({\varvec{b}}\) from the \(\mathrm {MLE}\) and returns the solution of the system \(A{\varvec{w}} = {\varvec{b}}\) following this pattern (we refer to this as the “masking trick”):

  • The \(\mathrm {MLE}\) samples a random invertible matrixFootnote 6 \(R \in \mathrm {GL}(d,\mathcal {M})\) and a random vector \({\varvec{r}} \in \mathcal {M}\) and it uses the linear homomorphic property of the underlying encryption scheme to compute \(C' = {\mathsf {Enc}}_ pk (AR)\) and \({\varvec{d}}'={\mathsf {Enc}}_ pk ({\varvec{b}} + A{\varvec{r}})\). The values \(C = AR\) and \({\varvec{d}}={\varvec{b}}+A{\varvec{r}}\) are the “masked data.” We slightly abuse notation here; \({\mathsf {Enc}}_ pk (\cdot )\) is applied component-wise in the computation of C and of \({\varvec{d}}'\).

  • The \(\mathrm {CSP}\) decrypts \(C'\) and \({\varvec{d}}'\) and computes \(\tilde{{\varvec{w}}}=C^{-1}{\varvec{d}}\). The vector \(\tilde{{\varvec{w}}}\) is the “masked model” sent back to the \(\mathrm {MLE}\).

  • The \(\mathrm {MLE}\) computes the desired model as \({\varvec{w}}^* = R\tilde{{\varvec{w}}}-{\varvec{r}}\). Indeed, it is easy to verify that \(R\tilde{{\varvec{w}}}-{\varvec{r}}= R(AR)^{-1}({\varvec{b}} + A{\varvec{r}})-{\varvec{r}} = A^{-1}{\varvec{b}}\).

Informally, the security of the encryption scheme assures privacy against an honest-but-curious \(\mathrm {MLE}\). On the other hand, if R and \({\varvec{r}}\) are sampled uniformly at random, then the distribution of the masked data is independent of A and \({\varvec{b}}\). This guarantees privacy against an honest-but-curious \(\mathrm {CSP}\). Similar masking tricks have been previously used in different settings. In [3], a similar method is used to design a secret-shared based MPC protocol for the evaluation of general functions. In this work, we tailor the masking trick for the goal of solving the linear system \(A{\varvec{w}}={\varvec{b}}\) gaining in efficiency. In [26], masking with random values is used to outsource a large-scale linear system to an untrusted “cloud server”. They assume that the coefficient matrix A and vector \({\varvec{b}}\) of the linear system are known to a “cloud customer” seeking the solution \({\varvec{w}}\). In this work, A and \({\varvec{b}}\) are encrypted and the masking is applied “inside the encryption”; to make the masking trick, which works in \(\mathbb {Q}\), compatible with the encryption and the modular arithmetic used for it, we make use of rational reconstruction.Footnote 7

Notice that the two-server model allows for different implementations in practice. If we consider applications in which the majority of data-owners are willing to help to run collaborative analysis but don’t want to (or cannot) spend to much resources to execute it, then the role of \(\mathrm {MLE}\) and \(\mathrm {CSP}\) can be taken by two semi-trustedFootnote 8 third-parties (e.g., two independent research institutions). This setting offers the practical advantage that the involvement of all data-owners is minimal. Otherwise, since \(\mathrm {CSP}\) and \(\mathrm {MLE}\) are only required to be non-colluding, their role can be taken by two disjoint subsets of data-owners (e.g., for \(m\ge 2\), we can have \(\mathrm {DO}_1\) and \(\mathrm {DO}_2\) playing the role of \(\mathrm {MLE}\) and \(\mathrm {CSP}\), respectively). In this case, no third-parties are required to implement the system.

4 Protocols Description

In this section we describe how to implement Phase 1 and Phase 2. Let \(({{\mathsf {Gen}}},{\mathsf {Enc}},{\mathsf {Dec}})\) be an LHE scheme with security parameter \(\kappa \) and message space \(\mathcal {M}= \mathbb {Z}_N\).

4.1 Phase 1: Merging the Dataset

Horizontally-Partitioned Setting. Assume that the dataset represented by the matrix X and the vector \({\varvec{y}}\) is horizontally-partitioned in m datasets. That is, the data-owner \(\mathrm {DO}_k \) holds

$$\begin{aligned} \mathcal {D}_k = \bigl \{({\varvec{x}}_{n_{k-1}+1},y_{n_{k-1}+1}), \cdots , ({\varvec{x}}_{n_k},y_{n_k})\bigr \} , \end{aligned}$$
(2)

for \(k=1,\cdots ,m\) (\(0=n_0<n_1<\cdots <n_m=n\)). In this case, as already noticed in [23], defining and \({\varvec{b}}_k=\sum _{i=n_{k-1}+1}^{n_k} y_i{\varvec{x}}_i\), we have that \(A=\sum _{k=1}^{m} A_k +\lambda I\) and \({\varvec{b}}=\sum _{k=1}^{m}{\varvec{b}}_i\). In Protocol \(\varPi _{1,\mathrm {hor}}\), each data-owner \(\mathrm {DO}_k\) computes and sends to \(\mathrm {MLE}\) encryptions of the entries of \(A_k\) and \({\varvec{b}}_k\); then \(\mathrm {MLE}\) computes encryptions of the entries of A and \({\varvec{b}}\) using the above formulas and the operation \(\odot \) (details in Protocol 1).

figure a

Arbitrarily-Partitioned Setting. Assume that each \(\mathrm {DO}_k\) holds some elements of X and \({\varvec{y}}\). That is, \(\mathrm {DO}_k \) holds

$$\begin{aligned} \mathcal {D}_k = \bigl \{X[i,j]={\varvec{x}}_i[j] \mid (i,j)\in D_k\bigr \} \cup \bigl \{ {\varvec{y}}[i]=y_i \mid (i,0)\in D_k\bigr \} , \end{aligned}$$
(3)

where \(D_k \subseteq \{1,\dots ,n\}\times \{0,1,\dots ,d\}\). Assume that each data-owner sends encryptions of the elements it knows to \(\mathrm {MLE}\). Then, in order to compute encryptions of the entries of A and \({\varvec{b}}\), \(\mathrm {MLE}\) needs to multiply two ciphertexts. Indeed, we have \( {\varvec{b}}[i] = \sum _{t=1}^{n} {\varvec{x}}_t[i] {\varvec{y}}[t] \) and \( A[i,j] = \sum _{t=1}^{n} {\varvec{x}}_t[i] {\varvec{x}}_t[j]\) if \(j\ne i\), otherwise \(A[i,i] = \sum _{t=1}^{n} {\varvec{x}}_t[i] {\varvec{x}}_t[i] + \lambda \). To allow this, we use labeled-homomorphic encryption. As we recalled in Sect. 2, the latter can be constructed on top of any LHE scheme and it enhances the underlying scheme with the multiplication command \({\mathsf {labMult}}\). In particular, after having received labeled-encryptions of the input from the data-owners,Footnote 9 \(\mathrm {MLE}\) can compute the encryptions of the entries of A and \({\varvec{b}}\) using formulas of the form \(\bigodot _{t=1}^{n} {\mathsf {labMult}}\bigl ({\mathsf {labEnc}}({\varvec{x}}_t[i]), {\mathsf {labEnc}}({\varvec{x}}_t[j])\bigr )\). Remember that the output of the command \({\mathsf {labMult}}\) used to compute the encryption of the product of two messages, \(m_1\) and \(m_2\), is in fact an encryption of \(m_1m_2-b_1b_2\) where \(b_1,b_2\) are two random values used to compute the labeled-encryptions of the values \(m_1\) and \(m_2\). For this reason, at the end of the procedure described before, \(\mathrm {MLE}\) obtains encryptions of \(A-B\) and \({\varvec{b}}-{\varvec{c}}\), instead of encryption of A and \({\varvec{b}}\), where B and \({\varvec{c}}\) depend on the random values used to encrypt the entries of the local datasets using the labeled-homomorphic scheme. The matrix B and the vector \({\varvec{c}}\) can be reconstructed by the party handling the decryption key (i.e., \(\mathrm {CSP}\)). The decryption procedure of the labeled-homomorphic scheme, \({\mathsf {labDec}}\), accounts for this. However, in the application we consider here (training a ridge regression model) it is necessary that at the end of Phase 1 the \(\mathrm {MLE}\) has proper encryptions for A and \({\varvec{b}}\). Indeed, only in this case we can proceed to Phase 2 and use the masking trick (using the masking trick with labeled-encryptions of A and \({\varvec{b}}\) doesn’t work). For this reason, we need to add one round of communication where \(\mathrm {CSP}\) sends to \(\mathrm {MLE}\) encryptions of the entries of B and \({\varvec{c}}\). This can be done before the beginning of the actual computation (Step 1 of Phase 1) since B and \({\varvec{c}}\) do not depend on the actual data used to train the regression model. In this way, the \(\mathrm {MLE}\) can finally gets encryptions of A and \({\varvec{b}}\). Protocol \(\varPi _{1,\mathrm {arb}}\) in Protocol 2 describes this in detail.

figure b

4.2 Phase 2: Computing the Model

At the end of Phase 1, \(\mathrm {MLE}\) knows component-wise encryption of the matrix A and the vector \({\varvec{b}}\) (both with entries represented in \(\mathbb {Z}_N\), the message space of the LHE scheme used in Phase 1). Recall that the final goal of our system is computing \({\varvec{w}}^*\in \mathbb {Q}^d\) solution of (1). In order to do this in a privacy-preserving manner, in Phase 2 we implement the masking trick described in Sect. 3 and compute \(\tilde{{\varvec{w}}}^*\) that solves (1) in \(\mathbb {Z}_N\). Then we use rational reconstruction to find \({\varvec{w}}^*\). All the details of this are reported in Protocol \(\varPi _2\) (Protocol 3). The correctness is easy to verify, indeed we have \(R\tilde{{\varvec{w}}}-{\varvec{r}}\equiv R(AR)^{-1}({\varvec{b}}+A{\varvec{r}})-{\varvec{r}}\equiv A^{-1}{\varvec{b}}\pmod N\). Security is also straightforward: Protocol \(\varPi _2\) is secure against a honest-but-curious \(\mathrm {CSP}\) because the values seen by it (the masked data \(AR \mod N\) and \( {\varvec{b}}+A{\varvec{r}}\mod N\)) have a distribution that is unrelated with the input datasets. Moreover, Protocol \(\varPi _2\) is secure against a honest-but-curious \(\mathrm {MLE}\) because of the security of the underlying encryption scheme. Indeed, the \(\mathrm {MLE}\) sees only an encrypted version of A and \({\varvec{b}}\). See [14, Appendix A.6] for the formal security proof.

In some applications, a desirable property is that the model is delivered only to the data-owners. If the role of \(\mathrm {MLE}\) and \(\mathrm {CSP}\) is taken by third-parties, this can be achieved using a standard tool like threshold encryption [9]. In this case, the key generation step of Phase 1 is enhanced with the sharing of \( sk \) (i.e., \(\mathrm {CSP}\) knows \( sk \) and each \(\mathrm {DO}_i\) knows a share for \( sk \)). Then, Step 2 of Protocol \(\varPi _2\) is modified in such a way that \(\mathrm {CSP}\) sends to \(\mathrm {MLE}\) the value \({\mathsf {Enc}}_ pk (\tilde{{\varvec{w}}})\), instead of the vector \(\tilde{{\varvec{w}}}\) in the clear. \(\mathrm {MLE}\) computes \({\mathsf {Enc}}_ pk (\varvec{\tilde{w}}^*)\) and broadcasts it to all data-owners. Finally, the \(\mathrm {DO}_i\) collaborates to jointly decrypt and compute \({\varvec{w}}^*\).

figure c

Choice of Parameters. In the last step of \(\varPi _2\) we use rational reconstruction to recover the components of \({\varvec{w}}^*\in \mathbb {Q}^d\) from the solution of \(A{\varvec{w}}={\varvec{b}}\) computed in \(\mathbb {Z}_N\). According to [11, 27] if a rational \(t=r/s\) with \(-R \le r\le R\), \(0<s\le S\) and \(\gcd (s,N)=1\) is represented as \(t'=r s^{-1} \mod N\) in \(\mathbb {Z}_N\), then the Lagrange-Gauss algorithm uniquely recovers r and s provided that \(2RS < N\). Since \({\varvec{w}}^* = A^{-1}\,{\varvec{b}} = \frac{1}{\det (A)} \,{\text {adj}}(A){\varvec{b}}\in \mathbb {Q}^d\), in order to choose N that satisfies the condition stated before, we need to bound the \(\det (A)\) and the entries of the vector \({\text {adj}}(A){\varvec{b}}\). Let \(\alpha =\max \{\Vert A\Vert _\infty ,\Vert {\varvec{b}}\Vert _\infty \}\), using the Hadamard’s inequality, we have that \(0<\det (A)\le \alpha ^{d}\) (A is a positive definite matrix) and \(\Vert {\text {adj}}(A){\varvec{b}}\Vert _\infty \le d (d-1)^{\frac{d-1}{2}}\alpha ^{d}\). Using the same assumptions of Sect. 2 on the entries of X and \({\varvec{y}}\) (that is, the entries of X and \({\varvec{y}}\) are real number in \([-\delta , \delta ]\) with at most \(\ell \) digits in the fractional part), we have that \(\alpha \le 10^{2\ell }(n\delta ^2+\lambda )\). It follows that the condition \(2RS<N\) is fulfilled when

$$\begin{aligned} 2d(d-1)^\frac{d-1}{2}\,10^{4\ell d}\,(n\delta ^2+\lambda )^{2d}< N. \end{aligned}$$
(4)
Fig. 1.
figure 1

Communication overhead in MB of \(\varPi _2\) (\(\delta =1\), 80-bit security, \(\ell =3\), Paillier’s scheme, \(\lambda =0\)).

Communication Complexity. The messages sent during Protocol \(\varPi _{1,\mathrm {hor}}\) and Protocol \(\varPi _2\) contain \(\varTheta (d^2)\) elements from \(\mathbb {Z}_N\), while the ones in Protocol \(\varPi _{1,\mathrm {arb}}\) contain \(\varTheta (dn)\) elements. This implies a communication cost of \(O(d^3\log (nd))\) bits for \(\varPi _{1,\mathrm {hor}}\) and \(\varPi _2\), and of \(O((nd^2+d^3)\log (nd))\) bits for \(\varPi _{1,\mathrm {arb}}\) (details in [14, Appendix A.3]). In particular, our approach significantly improves the communication complexity compared to the previous solutions that use Yao’s scheme [12, 23]. Indeed, the latter requires \(\mathrm {CSP}\) sending the garbled representation of a boolean circuit of millions of gates (see [23, Fig. 5] and [12, Fig. 7]) to \(\mathrm {MLE}\). In [23] the authors show that the garbled representation of one gate is a lookup table of around 30 bytes (80-bit security). This means that a privacy-preserving system based on Yao’s scheme, only for sending the garbled circuit and without considering the other steps needs at least hundreds of megabytes. On the other hand, even for large values of n and d, the communication complexity of \(\varPi _2\) is much smaller than 100 MB (see Fig. 1). For example, in the horizontally-partitioned setting [23] uses same techniques we deploy in \(\varPi _{1,\mathrm {hor}}\) and Yao’s protocol. In particular, [23] reports that the garbled representation of the circuit that solves (1) with \(d=20\) using Cholesky decomposition (24-bit integer representation) has size 270 MB. On the other hand, for a dataset with 10 millions instances and \(d=20\), the overall overheadFootnote 10 of \(\varPi _{1,\mathrm {hor}}+\varPi _2\) is less than 1.3 MB. In the arbitrarily-partitioned setting, the communication overheard of our system is dominated by the cost of Phase 1 (Protocol \(\varPi _{1,\mathrm {arb}}\)) because of its linear dependency on the number of instances n. However, this seems to be the case also in other approaches. For example, in [12], a secure inner-product protocol based on additive secret-sharing and Beaver’s triples [5] is used to compute the inner product of the columns of the matrix X vertically-partitioned among two or more users. The complexity of this approach for Phase 1 is \(\varTheta (nd^2\log (n))\) bits (comparable with the complexity of \(\varPi _{1,\mathrm {arb}}\)). In Phase 2, [12] use Yao’s protocol and conjugate gradient descent (CGD) algorithm to solve (1). They do not report the concrete size of the circuit, but they show the number of gates. For \(d=100\) and 5 iterations of the CGD, more than \(10^8\) gates are used: this gives an overhead of at least 3 GB only for sending the garbled circuit during Phase 2 (assuming a garbled gate is 30 bytes). On the other hand, the overall overhead of \(\varPi _{1,\mathrm {arb}}+\varPi _2\) when \(d=100\) for a dataset of 5 thousands instances is less than 1.3 GB.

The SecureML paper [22] uses only additive secret-sharing and Beaver’s triples to design a system that assumes an arbitrary partitioning of the dataset. When the pre-processing needed for the triples is implemented via LHE, the linear regression training system proposed in [22] has complexity \(\varTheta (nd+n)\). Thus, in terms of communication complexity, [22] performs better than our solution in the arbitrarily-partitioned case. Our system, however, is preferable if the training dataset is horizontally-partitioned and \(n\gg d\) (e.g., \(n=\varTheta (d^{2.5})\)). For example, if \(d=100\) and \(n=10^5\) the system in [22] has an overheard of 200 MB for the pre-processing phase only (see [22, Table II]), while the total cost of \(\varPi _{1,\mathrm {hor}}+\varPi _2\) is less than 120 MB.

5 Implementation

In this section we describe our implementation case study of the system described in Sect. 4. Our goal is to evaluate the effect of the public parameters on the system’s accuracy and efficiency, and to test our system on real-world datasets. In particular, the experiments we run are designed to answer the following questions:

  1. 1.

    Evaluating accuracy: How does the system parameter \(\ell \) (number of digits in the fractional part of the input data) influence the accuracy of the output model \({\varvec{w}}^*\)? Recall that we assume that the values in X and \({\varvec{y}}\) are real number with at most \(\ell \) digits in the fractional part. In practice, this means that each user must truncate all the entries in the local dataset after the \(\ell ^{\mathrm {th}}\) digit in the fractional part. This is done before inputting the values in the privacy-preserving system. On the other hand, in the standard machine learning-setting this requirement is not necessary, and the model is computed using floating point arithmetic on values with more than \(\ell \) digits in the fractional part. For this reason, the model \({\varvec{w}}^*\), which is trained using our privacy-preserving system, can differ from the model \(\bar{{\varvec{w}}}^*\) learned in the clear (same regularization parameter \(\lambda \) is used). To evaluate this difference we use

    $$ R_MSE = \left| \frac{MSE ({\varvec{w}}^*) - MSE (\bar{{\varvec{w}}}^*)}{MSE (\bar{{\varvec{w}}}^*)}\right| $$

    where MSE is the mean squared error of the model computed on a test dataset (this is a common measure of model accuracy in the machine learning setting). The value \(R_MSE \) tells the loss in accuracy caused by using the vector \({\varvec{w}}^*\) instead of \(\bar{{\varvec{w}}}^*\) as model.

  2. 2.

    Evaluating running-time: How do the data parameters n and d influence in practice the running time of each step in our privacy-preserving system? In [14, Appendix A.3], we report the number of different elementary operations (e.g., encryptions, modular additions, etc.) for each step in the system, while in this section we report the total running time of each step.

  3. 3.

    Evaluating efficiency in practice: How does our system behave when is run on real-world data? In particular, we run our system on datasets downloaded from the UCI repository,Footnote 11 which is commonly used for evaluating new machine-learning algorithms in the standard setting (i.e., with no privacy guarantees).

Fig. 2.
figure 2

Error rate \(R_MSE \) (log scale) in function of \(\ell \) (\(n=10^3, d=10\)).

Setup. We implemented our system using Paillier’s scheme with message space \(\mathcal {M}=\mathbb {Z}_N\). In order to assure a security level of at least 100 bits,Footnote 12 decrease the running time and the communication overhead, and satisfy (4), we choose N such that where \(\beta \) is the logarithm in base 2 of the left-hand side of (4). We wrote our software in Python 3 5.2 using the phe 1.3 libraryFootnote 13 to for Paillier encryption/decryption and operations on ciphertexts, and the gmpy2 libraryFootnote 14 for arithmetic operations with large integers. Gaussian elimination was used to compute determinants and linear systems.

To test the system composed by \(\varPi _{1,\mathrm {hor}}+ \varPi _2\), we run experiments in the horizontally-partitioned (HP) setting, splitting n data points evenly among 10 data-owners. To test the system \(\varPi _{1,\mathrm {arb}}+ \varPi _2\), we run experiments in the vertically-partitioned (VP) setting, where we assume that d features are evenly split among 3 data-owners and \(\mathrm {DO}_3\) also has \({\varvec{y}}\).

Numerical Results. All experiments were run on a machine with the following specifics. OS: Scientific Linux 7.4, CPU: 40 core (Intel(R) Xeon(R) CPU E5-2660 v2 2.20 GHz), Memory: 500 GB. All the timings are reported in seconds, all the values are averaged on 5 repetitions of the same experiment.

To answer Question 1, we measure the \(R_MSE \) for different values of \(\ell \) for synthetically-generated data in both the HP and VP settings (see Fig. 2). With the increasing of \(\ell \), regardless of the values of n and d, the value of \(R_MSE \) decreases very rapidly, while the efficiency degrades. Indeed, because of (4), the value of \(\ell \) has effect on the bit-length of the plaintexts and ciphertexts. For this reason, we recommend to choose \(\ell \) equal to a small integer (e.g., \(\ell =3\)). This choice allows to have a negligible error rate (e.g., \(R_MSE \) of order \(10^{-4}\)) without degrading the system efficiency.

Table 1. Running times (secs) for synthetic data in the HP and VP settings (\(\ell =3\)).

To answer Question 2 and assess the effect of parameters n and d on our system’s performance, we report in Table 1 the running time of each step of the system when it is run on synthetic data. The advantage of this approach is that we can run experiments for a wide range of parameters values. For Step 2 in Phase 1 (Protocol \(\varPi _{1,\mathrm {hor}}\) in the HP setting, Protocol \(\varPi _{1,\mathrm {arb}}\) in the VP setting) we report the average running time for one data-owner. In Protocol \(\varPi _{1,\mathrm {hor}}\), Step 2 is the most expensive one. Here, the data-owner \(\mathrm {DO}_k\) computes the \(d\times d\) matrix \(A_k\) and encrypts its entries. In our setting (n data points evenly split among the ten data-owners), this costs \(\varTheta (nd^2)\) arithmetic operations on plaintext values and \(\varTheta (d^2)\) encryptions for one data-owner. We verified that the costs of the encryptions is dominant for all values of n considered here.Footnote 15 In Step 3 of \(\varPi _{1,\mathrm {hor}}\), the \(\mathrm {MLE}\) computes the encryption of A and \({\varvec{b}}\) using approximately \(\varTheta (d^2)\) ciphertexts additions (i.e., multiplications modulo N), which turns out to be fast. In \(\varPi _{1,\mathrm {arb}}\), Step 3 is the most expensive step, here the \(\mathrm {MLE}\) performs \(\varTheta (nd^2)\) ciphertexts operation to compute \({\mathsf {Enc}}_ pk (A)\) and \({\mathsf {Enc}}_ pk ({\varvec{b}})\). In particular, the running time of \(\varPi _{1,\mathrm {arb}}\) is more influenced by the value of n than that of \(\varPi _{1,\mathrm {hor}}\) and \(\varPi _2\). Finally, for \(\varPi _2\) the results in Table 1 show that Step 1 requires longer time compared to the other two steps because of the \(\varTheta (d^3)\) operations done on ciphertexts. Step 2 and 3 require \(\varTheta (d^2)\) decryptions and \(\varTheta (d^2)\) operations on plaintexts and therefore are faster (e.g., less then 27 s for both the steps for a dataset of one hundred thousands instances with 40 features).

To answer Question 3 and show the practicality of our system we report in Table 2 the total running time and communication overhead for seven different UCI datasets (references in [14, Appendix A.4]). Some of these datasets were used also in [12, 23]. For example, [23] reports a running time of 45 s and a communication overhead of 83 MB (69 MB, resp.) for the Phase 2 of their system run on the dataset “forest” (“wine”, resp.) ([23, Table I]). Our protocol \(\varPi _2\) for the same datasets takes about 3 s with less then 83 kB sent. Phase 2 of the system presented in [12] runs on the dataset “student” in 19 s ([12, Table 3]) and we estimate an overhead of 3 GB (20 CGD iterations). Protocol \(\varPi _2\) on the same dataset runs in about 40 s with 484 kB of overhead.

Table 2. Running times (secs) for UCI datasets in the HP and VP settings.