Secure multiparty learning from the aggregation of locally trained models

https://doi.org/10.1016/j.jnca.2020.102754Get rights and content

Abstract

In many applications, multiple parties would benefit from a precise learning model trained from the aggregated dataset. However, the trivial method that all the data is aggregated into one datacenter and processed centrally is not appropriate when data privacy is a significant concern. In this paper, we propose a new framework for secure multi-party learning and construct a concrete scheme by incorporating aggregate signature and proxy re-encryption techniques. Unlike the previous solutions for multi-party privacy-preserving machine learning, we don't use encryption algorithm to encrypt the whole dataset or the intermediate values during the training process. In our scheme, secure verifiable computation delegation is utilized to privately label a public dataset from the aggregation of locally trained models. Using these newly generated labeled data items, the participants can update their learning models with great accuracy improvement. Further, we prove that the proposed scheme satisfies the desired security properties, and the experimental analysis on MNIST and HAM10000 shows that it is highly efficient.

Introduction

Due to its superior performance, machine learning has become a core component of many data-driven applications, such as image classification (Krizhevsky et al., 2017), speech recognition (Graves et al., 2013), and self-driving vehicles (Ren et al., 2015). In these applications, multiple parties would benefit from a precise learning model trained from the aggregated dataset. For example, a hospital can train a neural network model based on the patients' electronic medical records from its dataset. The model can later be used for interpreting the prospective patients' queries into the probability of heart disease, which will be used to assist the doctor in making a better diagnosis. While the use of a large dataset is more favorable for decreasing the misdiagnosis rate, a large volume of data that is required to train a good learning model is not always available. Intuitively, a trivial method for multi-party learning is collecting the data from multiple parties to construct a large enough dataset center and processing them centrally. However, this approach is not appropriate when data privacy is a significant concern (Alipanahi et al., 2015).

The increasing awareness of data security and privacy in the aforementioned application scenarios has motivated much work on privacy-preserving linear regression (Du et al., 2004), decision tree (Lindell and Pinkas, 2002), and logistic regression (Slavkovic et al., 2007). Generally, SMC (Secure Multi-party Computation) (Yao, 1982) or homomorphic encryption (Van Dijk et al., 2010) are utilized to process each data item of the whole original dataset. Considering the huge volume of the dataset in machine learning, these methods incur a high computation overhead. In Mohassel and Zhang (2017), Mohassel et al. proposed privacy-preserving machine learning in a two-server model based on OT (oblivious transfer) (Asharov et al., 2013) and LHE (leveled homomorphic encryption) (Brakerski et al., 2014). Although the two servers can do most of the online computations, the client's off-line computation overhead remains high since it involves heavy cryptographic operations for each node of the neural network. To protect the privacy of the training data, classifier aggregation (Pathak et al., 2010) provides another way. Pathak et al. present a protocol for generating a global classifier from multiple local classifiers using the parameter averaging method. However, parameter averaging (Pathak et al., 2010) is only{ }applicable to a collection of the same classifier type. Moreover, there is information leakage from the averaged parameters. The differential privacy (Dwork, 2008) mechanism can be used to prevent information leakage from the averaged parameters and provides strong privacy guarantees for the algorithms by defining the upper bound of privacy loss on the adjacent database.

In Fredrikson et al. (2015), Fredrikson et al. proposed the inversion attack, which shows that there is information leakage about the training data from the input and output of the learning models or from the intermediated values during the training process. To reduce the information leakage, Mohassel and Zhang (2017) proposed SecureML, which is a privacy-preserving machine learning framework based on homomorphic encryption, garbled circuits, and secret sharing. SecureML works with two non-colluding servers who privately train the neural network and share the final learned model. In Hamm et al. (2016), Hamm et al. proposed a new multi-party training method from the aggregation of local classifiers. Compared with the previous research that encrypts the raw dataset, this method is quite efficient. Nevertheless, it cannot protect the local model privacy since the local classifiers are first transferred to a central entity without any privacy protection.

The main contribution of this work is SML, a secure multi-party learning framework for privately improving personalized learning model based on the technique of knowledge transfer (Breiman, 1996). Specifically, our contributions can be summarized as follows:

  • SML enables mutually distrusting (honest-but-curious) parties to improve their local models from a secure aggregation of the locally trained models.

  • We use proxy re-encryption to realize secure aggregation of the label encryptions generated from multiple keys. Also, the aggregate signature and verifiable computation delegation are utilized in SML so that the security property of verifiability for the aggregate result against a malicious cloud server can be satisfied.

  • We study the empirical performance of SML with MNIST and HAM 10000. The efficiency analysis shows that the scheme is practically efficient. Meanwhile, the generalization ability of each updated model shows great improvement compared with the original model.

This paper is the full version of the paper (Ma et al., 2019a) presented in ML4CS 2019. The main differences from the conference version are as follows: First, we present the related work of privacy-preserving machine learning to illustrate the research progress. We also introduce the preliminaries and present a formal definition of the security requirements in Section 2 and Section 3. Meanwhile, we present the formal security Proof in Section 5. Second, we simplify the system model of SML, in which the central entity is not necessarily needed; we update the construction of SML in Section 4 accordingly. Finally, we provide the experimental evaluations of SML on HAM10000. Besides, we present a thorough efficiency analysis of the proposed scheme concerning the computation cost of all the participants.

As privacy-preserving multi-party learning and SMC are closely related, it is trivial to utilize SMC methods to protect the private data. Orlandi et al. (2007) presented an oblivious neural network via homomorphic encryption, where each activation function of the neural network was implemented interactively between the user and the model owner by introducing a multiplicative blinding step. Rouhani et al. (2018) proposed a scalable provable-secure deep learning method based on Yao's garbled circuit (Yao, 1982). However, it is hard to apply these methods in the real applications attribute to low efficiency. In Mohassel and Zhang (2017), Mohassel et al. proposed privacy-preserving logistic regression, linear regression, and neural network learning. The proposed scheme was constructed from secret sharing (Shamir, 1979), oblivious transfer (Asharov et al., 2013), and secure two-party computation under the two non-colluding servers security model. In Mohassel and Rindal (2018), Mohassel et al. proposed ABY3, a general framework for privacy-preserving machine learning in the three-server model. Moreover, as long as there is at most one of the three servers is compromised, data privacy can be protected. In ABY3, data privacy is realized through three-party sharing, and the learning process is completed by utilizing binary sharing, arithmetic sharing, and Yao sharing.

Li et al. (2017) proposed a privacy-preserving multi-party deep learning scheme based on a multi-key FHE (Fully Homomorphic Encryption (Van Dijk et al., 2010)), in which each participant encrypts his original dataset with his public key and uploads them to the cloud server. Once the cloud server has received all the encrypted datasets, he could implement the deep learning algorithm on these encrypted datasets. However, Li et al. did not present the experimental results about the proposed scheme, so it is hard to verify its practicality in the real applications. In consideration of the large volume of the dataset, the complex structure of the deep neural network, and the heavy computation overhead of the fully homomorphic encryption, privacy-preserving deep learning based on FHE or multi-key FHE have a long way to go before they can be used in practical applications.

Instead of processing the whole training dataset, Shokri and Shmatikov (2015) presented a novel method that the distributed participants first train a local learning model on their dataset independently. Later, a small subset of their models' key parameters is shared with other participants during the training process. Compared to SMC, this method shows great efficiency improvement. However, sharing key parameters during training will cause information leakage. As shown by Szegedy et al. (2013), an adversary who has access to the model parameters may identify where a model is vulnerable. The differential privacy mechanism (Dwork, 2008) is a useful tool for resisting model inversion attacks but also introduces accuracy loss.

The efficiency of multi-party machine learning can also be improved by leveraging outsourcing computation (Chen et al., 2014, 2016). In Zhang et al. (2018), a single-layer perceptron training protocol supports the verifiability of the results was proposed based on outsourcing computation. In Ma et al. (2018), Ma et al. proposed a secure summation protocol to privately exchange the parameter updates within all the data owners based on additively homomorphic encryption. Classifier aggregation (Pathak et al., 2010) provided a new framework to solve privacy leakage problems without using a cryptography-based privacy-preserving method, in which the global classifier was generated by averaging all the local classifiers. Papernot et al. (2016) proposed the approach of Private Aggregation of Teacher Ensembles (PATE) that provides a reliable privacy guarantee to the training data. However, it is not trivial to apply PATE to multi-party settings.

To protect the data and the neural network model during the prediction phase, Barni et al. (2006) proposed an interactive protocol from homomorphic encryption. In Barni et al. (2006), participants have to decrypt the ciphertexts for each layer of the neural network during the prediction process. The interactiveness not only brings high communication overhead but also results in significant knowledge leakage. In Ma et al. (2019b), Ma et al. proposed a new non-interactive framework for privacy-preserving neural network prediction under two non-colluding servers security model. In Gilad-Bachrach et al. (2016), Dowlin et al. presented CryptoNets that can be applied to encrypted data. A major bottleneck is computation complexity stems from the encryption and decryption of fully homomorphic encryption. Also, the limited arithmetic set of homomorphic encryption prevents the use of commonly used activation function, such as sigmoid or tanh. In CryptoNets, the activation was substituted by the square function. Moreover, the experiment results demonstrated that such substitution brought accuracy loss.

Federated learning (Konecný et al., 2016) provided another method for privacy-preserving multi-party machine learning. Federated learning is a distributed machine learning environment in which the training data cannot be moved away from its source. It works like follows: the clients download the current learning model from the server, improve it from its dataset, and then send back the weighted parameters to the server, who will average all the parameters to update the shared model. Currently, federated learning has been applied in Google keyboard and TensorFlow federated learning. Indeed, significant improvements have been achieved in the efficiency and accuracy of federated learning. However, there are several challenges to improve the security and privacy level about it. First, the security model of the federated model is weak since the server is assumed to be honest. Even though the dataset remains on its location, the frequent parameter updates can be used to infer some information about the training data; such attacks include membership attack (Shokri et al., 2017) and inverse (Szegedy et al., 2013). The second drawback of federated learning is that all the clients share one model in the end. As each client has its dataset, it is more attractive for them to acquire a personalized model.

Organization. In Section 2, we briefly review the cryptographical primitives that we use in the proposed protocol. Section 3 formally describes the system model and security definitions. In Section 4, a concrete construction of our scheme is proposed. We present the security Proof and efficiency analysis of our scheme in Section 5. In Section 6, we evaluate the empirical performance of our protocol, and finally, we conclude the paper in Section 7.

Section snippets

Preliminaries

In this section, we first present the formal definition of proxy re-encryption (Blaze et al., 1998). We then review the basic definition and properties of bilinear pairings and the bilinear aggregate signature.

System model

SML consists of n + 1 parties: n data owners O1,O2,,On and one cloud server S (as illustrated in Fig. 1). In this application scenario, we assume that each data owner Oi, who holds a private dataset Di (e.g., electronic medical records) and a learning model M generated from the dataset, is willing to improve the accuracy of the local learning model without leaking any sensitive information about the local dataset. S provides verifiable computation delegation service to the data owners. In our

High description

Briefly, in the proposed system, each data owner Oi first trains a local learning model Mi based on its own dataset Di. Given an unlabeled public dataset, it can be labeled from the model ensemble {M1, M2, …, Mn}. We emphasize that such unlabeled datasets are common in many applications, such as public text or image datasets or even medical datasets.

The labels generated from each local learning model are encrypted and sent to S. The aggregation of these encryptions is computed and returned by S

Security analysis

Theorem 1

The proposed SML scheme satisfies the security requirement of privacy.

Proof

we prove the theorem by contradiction from the definition of privacy. Assume that there exists PPT adversary A who has a non-negligible advantage ε in the experiment ExpASML[M,UD,κ]; we can then can build an efficient algorithm B that takes A as a sub-algorithm to break PRE or aggregate signature. In Blaze et al. (1998), Blaze et al. have proved that if there exists a PPT adversary A who is able to break PRE with a

Experimental evaluation

In this section, we present the effectiveness and efficiency performance of the proposed scheme. For the effectiveness evaluation of SML, we use MNIST and HAM10000. For the machine learning related sub-algorithm, the experiments are programmed by Kersas 2.2.4 language, the platform is TensorFlow 1.4.1, and the backend is a minicomputer with 64-core CPUs, 256 GB memories together with 2 T P100 GPUs, and the operating system is Ubuntu 16.4. For the efficiency evaluation analysis, we mainly

Conclusion

In this work, we present SML from the aggregation of locally trained models. The proposed scheme introduces a new method based on the knowledge transfer for multi-party learning. We proved that the proposed scheme satisfies the security requirements of privacy and accuracy. Moreover, the scheme is secure even when the cloud sever is malicious. We implement our scheme on MNIST and HAM1000, and the empirical performance shows that the proposed scheme is effective and efficient.

CRediT authorship contribution statement

Xu Ma: Conceptualization, Methodology, Formal analysis, Writing - original draft. Cunmei Ji: Software, Validation. Xiaoyu Zhang: Resources, Investigation. Jianfeng Wang: Project administration, Data curation. Jin Li: Methodology, Formal analysis. Kuan-Ching Li: Data curation. Xiaofeng Chen: Supervision, Writing - review & editing.

Declaration of competing interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Secure Multiparty Learning from the Aggregation of Locally Trained Models”.

Acknowledgements

This work is supported by National Cryptography Development Fund (No. MMJJ20180110)and National Natural Science Foundation of China (No. 61802227).

Xu Ma received the bachelor's degree of computer science from Ludong University in 2008 and the master and PhD degrees of information security and cryptography from Sun Yat-sen University in 2010 and 2013, respectively. He is currently an associate professor in the school of software, Qufu Normal University, China. He is also a post-doctor of the department of cyberspace security, Xidian University. His research focuses on applied cryptography, outsourcing computation and privacy-preserving

References (41)

  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • X. Chen et al.

    New algorithms for secure outsourcing of modular exponentiations

    IEEE Trans. Parallel Distr. Syst.

    (2014)
  • X. Chen et al.

    Verifiable computation over large database with incremental updates

    IEEE Trans. Comput.

    (2016)
  • N.C.F. Codella et al.

    Skin Lesion Analysis toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    (2019)
  • W. Du et al.

    Privacy-preserving multivariate statistical analysis: linear regression and classification

  • C. Dwork

    Differential privacy: a survey of results

  • M. Fredrikson et al.

    Model inversion attacks that exploit confidence information and basic countermeasures

  • R. Gilad-Bachrach et al.

    Cryptonets: applying neural networks to encrypted data with high throughput and accuracy

  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

  • J. Hamm et al.

    Learning privately from multi-party data

  • Cited by (26)

    • VREFL: Verifiable and Reconnection-Efficient Federated Learning in IoT scenarios

      2022, Journal of Network and Computer Applications
      Citation Excerpt :

      It remains an unsolved problem that how to efficiently achieve secure FL for a massive user group. In the aspect of verifiability, the use of Bilinear Pairing can support the verification function without compromising any worker’s privacy (Xu et al., 2019; Li et al., 2019; Han et al., 2021; Ma et al., 2020). However, its communication overhead is linearly dependent on the dimension of local updates.

    • Privacy-preserving Byzantine-robust federated learning

      2022, Computer Standards and Interfaces
    • From city promotion via city marketing to city branding: Examining urban strategies in 23 Chinese cities

      2021, Cities
      Citation Excerpt :

      They merely consult public and experts to obtain opinions on city slogans and logo selections. This goes against the commonly embraced principle of broad participation in city branding (Dinnie, 2010; Klijn et al., 2012; Ma et al., 2020; Kavaratzis & Kalandides, 2015). Many cities are eager to promote their international image by organising mega events, such as the Dubai 2020 Expo (de Jong et al., 2019).

    View all citing articles on Scopus

    Xu Ma received the bachelor's degree of computer science from Ludong University in 2008 and the master and PhD degrees of information security and cryptography from Sun Yat-sen University in 2010 and 2013, respectively. He is currently an associate professor in the school of software, Qufu Normal University, China. He is also a post-doctor of the department of cyberspace security, Xidian University. His research focuses on applied cryptography, outsourcing computation and privacy-preserving machine learning.

    Cunmei Ji received his master degree in School of Computer and Science from Zhejiang University in 2010, then he joined the Huawei Company as a software engineer, and from 2011 to 2015 he worked at Cisco Systems, Inc., there his work focused on the distributed systems and communication of multimedia. He is currently a lecturer in the school of software, Qufu Normal University, China. His research interests include deep learning, computer vision and bioinformatics.

    Xiaoyu Zhang received her B.S. degree of information and computing science at Shandong Jianzhu University, Jinan, China, in 2014. She got her Ph.D degree in Information Security from Xidian University in 2019. Her research fields focus on machine learning, privacy protection and data security.

    Jianfeng Wang received his M.S. degree in Mathematics from Xidian University, China. He got his Ph.D degree in Cryptography from Xidian University in 2016. Currently, he works at Xidian University. He visited Swinburne University of Technology, Australia, from December 2017 to December 2018. His research interests include applied cryptography, cloud security and searchable encryption.

    Jin Li received the B.S. degree in mathematics from Southwest University in 2002 and the Ph.D. degree in information security from Sun Yat-sen University in 2007. He is currently a Professor with Guangzhou University. He has been selected as one of Science and Technology New Star in Guangdong Province. He has published over 200 research papers in refereed international conferences and journals. His research interests include applied cryptography and security in cloud computing. He has served as the program chair or program committee member in many international conferences.

    Kuan-Ching Li is currently a Distinguished Professor at Providence University, Taiwan. He is a recipient of awards and funding support from several agencies and industrial companies, as also received distinguished chair professorships from universities in China and other countries. Besides publication in refereed journals and top conferences papers, he is co-author/co-editor of several technical professional books published by CRC Press/Taylor & Francis, Springer, and McGraw-Hill. Dr. Li's research interests include GPU/manycore computing, Big Data, and cloud. He is a senior member of the IEEE and a fellow of the IET.

    Xiaofeng Chen received the B.S. and M.S. degrees in mathematics from Northwest University, China, in 1998 and 2000, respectively, and the Ph.D. degree in cryptography from Xidian University in 2003.He is currently a Professor with Xidian University. He has published over 100 research papers in refereed international conferences and journals. His research interests include applied cryptography and cloud computing security. He has served as program/general chair or program committee member in over 30 international conferences. He is on the Editorial Board of Security and Communication Networks, Computing and Informatics, and the International Journal of Embedded Systems. His work has been cited over 9000 times in Google Scholar.

    View full text