Abstract
To enjoy the advantage of cloud service while preserving security and privacy, huge data is increasingly outsourced to cloud in encrypted form. Unfortunately, encryption may impede the analysis and computation over the outsourced dataset. Naïve Bayesian classification is an effective algorithm to predict the class label of unlabeled samples. In this paper, we investigate naïve Bayesian classification on encrypted dataset in cloud and propose a secure scheme for the challenging problem. In our scheme, all the computation task of naïve Bayesian classification are completed by the cloud, which can dramatically reduce the burden of data owner and users. Based on the theoretical proof, our scheme can guarantee the security of both input dataset and output classification results, and the cloud can learn nothing useful about the training data of data owner and the test samples of users throughout the computation. Additionally, we evaluate our computation complexity and communication overheads in detail.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inform. 77(2), 81–97 (2008)
Boneh, D., Goh, E.-J., Nissim, K.: Evaluating 2-DNF formulas on ciphertexts. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 325–341. Springer, Heidelberg (2005). doi:10.1007/978-3-540-30576-7_18
Bost, R., Popa, R.A., Tu, S., Goldwasser, S.: Machine learning classification over encrypted data. In: The Network and Distributed System Security Symposium (NDSS), pp. 1–14 (2015)
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., Zhu, M.Y.: Tools for privacy preserving distributed data mining. ACM Sigkdd Explorations Newslett. 4(2), 28–34 (2002)
Clifton, C., Vaidya, J., Kantarcioglu, M.: Privacy-preserving naïve Bayes classification. VLDB J. 17(4), 879–898 (2008)
Dong, C., Chen, L., Camenisch, J., Russello, G.: Fair private set intersection with a semi-trusted arbiter. In: Wang, L., Shafiq, B. (eds.) DBSec 2013. LNCS, vol. 7964, pp. 128–144. Springer, Heidelberg (2013)
Elgamal, T.: A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans. Inf. Theory 31(4), 469–472 (1985)
Elmehdwi, Y., Samanthula, B.K., Jiang, W.: Secure k-nearest neighbor query over encrypted data in outsourced environments. In: IEEE 30th International Conference on Data Engineering (ICDE), pp. 664–675 (2014)
Goldreich, O.: Foundations of Cryptography: Volume II, Basic Applications. Cambridge University Press, Cambridge (2004)
Kantarcıoglu, M., Vaidya, J., Clifton, C.: Privacy preserving naive Bayes classifier for horizontally partitioned data. In: IEEE ICDM workshop on privacy preserving data mining, pp. 3–9 (2003)
Kim, H.J., Kim, J.U., Ra, Y.G.: Boosting naïve Bayes text classification using uncertainty-based selective sampling. Neurocomputing 67, 403–410 (2005)
Lindell, Y., Pinkas, B.: Privacy preserving data mining. J. Cryptology 15(3), 36–54 (2002)
Liu, A., Zhengy, K., Liz, L., Liu, G., Zhao, L., Zhou, X.: Efficient secure similarity computation on encrypted trajectory data. In: IEEE 31st International Conference on Data Engineering (ICDE), pp. 66–77 (2015)
Liu, X., Lu, R., Ma, J., Chen, L., Qin, B.: Privacy-preserving patient-centric clinical decision support system on naive Bayesian classification. IEEE J. Biomed. Health Inform. 20(2), 655–668 (2016)
Lops, P., Gemmis, M.D., Semeraro, G.: Content-based recommender systems: state of the art and trends. In: Recommender Systems Handbook, pp. 73–105 (2011)
Mitchell, T.: Machine Learning, 1st edn. McGraw-Hill Science/Engineering/Math, New York (1997)
Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, pp. 223–238. Springer, Heidelberg (1999)
Samanthula, B.K., Elmehdwi, Y., Jiang, W.: k-nearest neighbor classification over semantically secure encrypted relational data. IEEE Trans. Knowl. Data Eng. 27(5), 1261–1273 (2015)
Samanthula, B.K., Jiang, W.: Efficient privacy-preserving range queries over encrypted data in cloud computing. In: IEEE Sixth International Conference on Cloud Computing, pp. 51–58 (2013)
Yang, Z., Zhong, S., Wright, R.N.: Privacy-preserving classification of customer data without loss of accuracy. In: Siam International Conference on Data Mining, pp. 92–102 (2005)
Yao, A.: How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science, pp. 162–167. IEEE (1986)
Yi, X., Zhang, Y.: Privacy-preserving naive Bayes classification on distributed data via semi-trusted mixers. Inform. Syst. 34(3), 371–380 (2009)
Yuan, J., Yu, S.: Privacy preserving back-propagation neural network learning made practical with cloud computing. IEEE Trans. Parallel Distrib. Syst. 25(1), 212–221 (2014)
Zhu, Y., Huang, Z., Takagi, T.: Secure and controllable k-nn query over encrypted cloud data with key confidentiality. J. Parallel Distrib. Comput. 89, 1–12 (2016)
Zhu, Y., Wang, Z., Zhang, Y.: Secure k-NN query on encrypted cloud data with limited key-disclosure and offline data owner. In: The 20th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 401–414 (2016)
Acknowledgements
We thank the anonymous reviewers and our shepherd, Prof. Xun Yi, for their valuable feedbacks. This work is partly supported by the Natural Science Foundation of Jiangsu Province of China (No. BK20150760), the Fundamental Research Funds for the Central Universities (No. NZ2015108, NS2016094), the China Postdoctoral Science Foundation funded project (No. 2015M571752), and the Natural Science Foundation of China (No. 61472470).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Paillier Cryptosystem. Paillier [17] proposed an efficient public key cryptosystem with semantic security (indistinguishability under chosen plaintext attack, IND-CPA). The encryption scheme is additively homomorphic, i.e.,
Here, E denotes the encryption function, \(m_1\) and \(m_2\) are arbitrary messages in plaintext space, pk is the public key, \(r_1\) and \(r_2\) are random parameters for encryption, \(E_{pk}(m_1,r_1)\) denotes the encrypted result of \(m_1\) using the random parameter \(r_1\), and \(\kappa \) is a positive integer. In this paper, we also use \(E_{pk}(m_1)\) to denote the encrypted value of \(m_1\) while it is unnecessary to emphasize the random parameter. We briefly review the main steps of the encryption system as follows.
Key Generation. Select two large enough primes p and q. Then, the secret key sk is \(s =\text {lcm}(p-1, q-1)\), that is, the least common multiple of \(p-1\) and \(q-1\). The public key pk is (N, g), where \(N=pq \) and \(g\in \mathbb {Z}_{N^{2}}^{*}\) such that \(\gcd \left( L(g^{s}\mod N^{2}), N\right) =1 \), that is, the maximal common divisor of \(L(g^{s}\mod N^{2})\) and N is equivalent to 1. Here, \(L(x)={(x-1)}/N\), the same below.
Encryption. Let \(m_0\) be a number in plaintext space \(\mathbb {Z}_{N}\). Select a random \(r\in \mathbb {Z}_{N}^{*}\) as the secret parameter, then the ciphertext \(c_0\) of \(m_0\) is \(c_0=g^{m_0}r^{N}\ mod \ N^{2}\).
Decryption. Let \(c_0\in \mathbb {Z}_{N^{2}}\) be a ciphertext. The plaintext hidden in \(c_0\) is
Paillier homomorphic encryption system is an important secure building block to be used in our scheme. As a probabilistic encryption, Paillier homomorphic encryption system also has the self-blinding property, that is, \(E_{pk}(m_0, r_1)*r_2^{N}=E_{pk}(m_0, r_1r_2)\) and \(m_0=D_{sk}(E_{pk}\left( m_0, r_1)\right) =D_{sk}\left( E_{pk}(m_0, r_1)*r_2^{N}\right) \) for any \(r_2\in \mathbb {Z}_{N}^{*}\).
For simplicity, we also use \(E_{pk}(m_0)\) to denote the encrypted result of \(m_0\) while it is no need to emphasize the random parameter r.
Description about SCTC. In SCTC, CS\(_1\) holds the ciphertext \(\{E_{pk}(X), E_{pk}(Y)\}\), and CS\(_2\) holds secret key. SCTC enables CS\(_1\) to obtain an output \(\varGamma \) which satisfies \(\varGamma =E_{pk}(1)\) if \(X\geqslant Y\), otherwise \(\varGamma =E_{pk}(0)\). It is remarked that CS\(_1\) and CS\(_2\) cannot know any plaintext hidden in \(E_{pk}(X), E_{pk}(Y)\) and \(\varGamma \) throughout SCTC. The main steps of SCTC are as follows.
Assume \(0\leqslant X,~Y< 2^{\varOmega }\). We have \(0\leqslant (2X+1),~2Y< 2^{\varOmega +1}\). Besides, \(X\geqslant Y\) if and only if \(2X+1>2Y\). In SCTC, CS\(_1\) selects a random positive number \(\delta <2^{\varOmega +1}\), and sends \(\varPhi \) to CS\(_2\), where \(\varPhi \) equals to the ciphertext \(E_{pk}\left( \delta (2X+1-2Y)+2^{\varOmega +2}\right) \) or \(E_{pk}\left( \delta (2Y-2X-1)+2^{\varOmega +2}\right) \) with the same probability \(50\,\%\). Here, \(\varPhi \) can be locally achieved by CS\(_1\) using the following equations based on the homomorphic property.
After receiving \(\varPhi \), CS\(_2\) can learn whether \(D_{sk}(\varPhi )>2^{2\varOmega +2}\) which indicates the size relationship of \((2X+1)\) and 2Y. Nevertheless, CS\(_2\) cannot know it is \(2X+1>2Y\) or \(2Y>2X+1\), since CS\(_2\) has no idea which value is set as \(\varPhi \) by CS\(_1\). CS\(_2\) tells CS\(_1\) whether \(D_{sk}(\varPhi )>2^{2\varOmega +2}\) or not in encrypted form \(\varPsi =E_{pk}(1)\) or \(E_{pk}(0)\). At last, CS\(_1\) can obtain the comparison result (in encrypted form) of \(2X+1>2Y\) by setting \(\varGamma =\varPsi \) if selecting the first value as \(\varPhi \), otherwise \(\varGamma =E_{pk}\left( 1-D_{sk}(\varPsi )\right) =E_{pk}(1)\varPsi ^{N-1}\) which is just the comparison result of \(X\geqslant Y\). During the execution of SCTC, CS\(_1\) can only access the encrypted values \(E_{pk}(X), E_{pk}(Y)\), \(\varPsi \) and \(\varGamma \). CS\(_2\) can decrypt \(\varPhi \), but cannot learn useful information about X and Y, owing to the randomness of \(\delta \) and CS\(_1\)’s coin-toss \(\phi \). Thus, the plaintext hidden in the input/output ciphertext will be well protected from both CS\(_1\) and CS\(_2\).
Security Proof. The proof of Theorem 1.
Proof
In the following, we consider the view of CS\(_1\) and CS\(_2\), respectively.
CS\(_1\): In our first stage shown in Protocol 1, CS\(_1\) can access nothing but the encrypted values \(E_{pk}(C_k)\), \(Y_{ki}\) and \(E_{pk}\left( (m_iP_i)^{d-1}\right) \) for each \(k=1,2,\cdots ,m\) and \(i=1,2,\cdots ,\lambda \). Thus, CS\(_1\) cannot learn any useful information about the plaintext hidden in the encrypted values, owing to the semantic security (IND-CPA) of Paillier homomorphic encryption system.
CS\(_2\): In Protocol 1, CS\(_2\) holds the secret key sk, and he can also receive \(\varPi _k(\varvec{X}_k)\) (for each \(k=1,2,\cdots ,m\)) and \(E_{pk}(m_iP_i)\) (for each \(i=1,2,\cdots ,\lambda \)). Through decrypting, CS\(_2\) can attain \(\varPi _k\left( R_{k1}(C_k-1), R_{k1}(C_k-2),\cdots , R_{k1}(C_k-\lambda )\right) \) and \(m_iP_i\). Because \(P_i\) is randomly selected from \(\mathbb {Z}_N^*\), CS\(_2\) can learn nothing about \(m_i\) (in this paper, we have assumed \(m_i>0\)). For each \(C_k\), it has \(1\leqslant C_k\leqslant \lambda \), thus it must be one and only one \(R_{ki}(C_k-i)\) equals to 0 for all \(i\in \{1,2,\cdots ,\lambda \}\). Due to the randomness of \(R_{ki}\), CS\(_2\) can only know some \(R_{ki}(C_k-i)=0\), i.e., some i meets \(C_k-i=0\). Nevertheless, CS\(_2\) does not know which i satisfies \(R_{ki}(C_k-i)=0\), on account of the random permutation \(\varPi _k\). Therefore, CS\(_2\) can learn nothing during our first stage.
To sum up, no information about Alice’s dataset is disclosed to CS\(_1\) or CS\(_2\) throughout our first stage. Our first stage is secure. It completes our proof of Theorem 1.
The proof of Theorem 2.
Proof
We consider the view of CS\(_1\) and CS\(_2\) as follows.
CS\(_1\): In SCTC protocol, i.e., Protocol 2, CS\(_1\) can obtain the values \(E_{pk}(X)\), \(E_{pk}(Y)\) and \(\varPsi \) in total. All the three values are the ciphertext of Paillier homomorphic encryption system. Based on its semantic security (IND-CPA), CS\(_1\) cannot learn any useful information about the plaintext hidden in the encrypted values \(E_{pk}(X)\), \(E_{pk}(Y)\) and \(\varPsi \).
CS\(_2\): In Protocol 2, CS\(_2\) can receive \(\varPhi \) only. After decrypting \(\varPhi \), CS\(_2\) can get \(\left( \delta (2X+1-2Y)+2^{2\varOmega +2}\right) \) or \(\left( \delta (2Y-2X-1)+2^{2\varOmega +2}\right) \) with the same probability \(50\,\%\). Nevertheless, \(\delta \) is a random number holden by CS\(_1\). Thus, CS\(_2\) can learn no useful information about X or Y from the decrypted result of \(\varPhi \).
In all, our SCTC is secure, which completes the proof of Theorem 2.
The proof of Theorem 3.
Proof
Our SMRP protocol, namely Protocol 4, also consists of two participants: CS\(_1\) and CS\(_2\). We will consider their view during the protocol, respectively.
CS\(_1\): In Protocol 4, in addition to the input of himself, CS\(_1\) can attain the set \(\{E_{pk}(X_\alpha Y_\beta \theta _1\theta _4),\) \(E_{pk}(X_\beta Y_\alpha {\theta _2}\theta _3)\}\) and \(\{\varPsi _1,\varPsi _2,\varPsi _3,\varPsi _4\}\). From the former encrypted data set, CS\(_1\) can infer nothing, because of the semantic security (IND-CPA) of Paillier homomorphic encryption system.
For \(\varPsi _i\) (\(i=1,2,3\)), it is either \(E_{pk}(0)\) or a blinded \(\varPhi _i\). It is easy to say CS\(_1\) cannot efficiently distinguish \(\varPsi _i\) from any other value in ciphertext space, based on the semantic security.
Additionally, \(\varPsi _4=E_{pk}(1)\) or \(E_{pk}(0)\). Hence, CS\(_1\) cannot deduce any plaintext hidden in \(\varPsi _4\), owing to the security of Paillier encryption system.
CS\(_2\): In SMRP protocol, CS\(_2\) can access secret key and the encrypted value set \(\{E_{pk}(X_\alpha \theta _1), E_{pk}(Y_\alpha \theta _2),E_{pk}(X_\beta \theta _3),E_{pk}(Y_\beta \theta _4)\}\), \(\{\varPhi _1,\varPhi _2,\varPhi _3,\varPhi _4\}\). Because of the randomness of \(\{\theta _1,\theta _2,\theta _3,\theta _4\}\), the values \(\{X_\alpha , Y_\alpha ,X_\beta , Y_\beta \}\) can be securely protected from CS\(_2\).
By decrypting \(\{\varPhi _1,\varPhi _2,\varPhi _3\}\), CS\(_2\) can get the set \(\{\beta -\alpha +\omega _1,\) \(X_\beta -X_\alpha +\omega _2,\) \(Y_\beta -Y_\alpha +\omega _3\}\) or \(\{\alpha -\beta +\omega _1,\) \(X_\alpha -X_\beta +\omega _2,\) \(Y_\alpha -Y_\beta +\omega _3\}\) with the same probability \(50\,\%\). Since the random \(\omega _1, \omega _2, \omega _3\in \mathbb {Z}_N\), CS\(_2\) can learn nothing about \(\{\alpha ,\beta , X_\alpha , X_\beta , Y_\alpha , Y_\beta \}\).
From \(\varPhi _4\), CS\(_2\) can obtain \(\delta (2X_\beta Y_\alpha +1-2X_\alpha Y_\beta )+2^{4\varOmega }\) or \(\delta (2X_\alpha Y_\beta -2X_\beta Y_\alpha -1)+2^{4\varOmega }\) with \(50\,\%\) probability both. Since the random \(\delta \) is holden by CS\(_1\), CS\(_2\) cannot learn any useful information by decrypting \(\varPhi _4\).
Overall, CS\(_1\) or CS\(_2\) can learn nothing useful about \(\{\alpha ,\) \(\beta ,\) \( X_\alpha ,\) \( X_\beta ,\) \(Y_\alpha ,\) \(Y_\beta \}\), thus SMRP protocol is secure, and we complete the proof of Theorem 3.
The proof of Theorem 4.
Proof
Our second stage involves three participants: CS\(_1\), CS\(_2\), Bob. We will consider the view of them, respectively.
Bob: Apart from his unlabeled sample \(\varvec{S}\), Bob only can receive \(\mu _1\) and \(\mu _2\) from the cloud. \(\mu _1\) is a random selected by CS\(_1\), and \(\mu _2=D_{sk}(Ecl)+\mu _1\). Then, Bob can obtain nothing but \(D_{sk}(Ecl)\) which is just the class label of his sample \(\varvec{S}\) using naïve Bayesian classification. That is, Bob can learn nothing about the data of Alice.
CS\(_1\): Apart from employing SCTC and SMRP, there is a for loop (line 16 to 20) in our second stage shown in Protocol 3. In the for loop, CS\(_1\) can access nothing but \(E_{pk}(\prod _{t=1}^dm_{it}H_{it})\). Thus, CS\(_1\) cannot learn any useful information about the data of Alice and Bob, based on the security of SCTC, SMRP, and Paillier encryption system.
CS\(_2\): Similarly, CS\(_2\) can only gain secret key and \(E_{pk}(m_{it}H_{it})\) for each \(t=1,\) 2, \(\cdots ,\) d. Here, each \(H_{it}\) is randomly selected by CS\(_1\) from \(\mathbb {Z}^*_{N}\). Therefore, every \(m_{it}\) can be well preserved from CS\(_2\).
In all, our second stage can ensure that Bob can obtain only the class label of his sample, and the cloud can learn nothing useful about the data of Alice and Bob. Our second stage is thus secure, which completes the proof of Theorem 4.
The analysis procedure of computation and communication complexity in detail are shown as follow.
Computation Complexity. In Protocol 1 (namely our first stage), Eqs. (11) and (12) need to be done \(m\lambda \) times, which leads to \(m\lambda \) encryptions and \(2m\lambda \) exponentiations. From lines 10 to 15, \(\lambda \) encryptions and \(4\lambda \) exponentiations are performed to calculate all \(E_{pk} (m_i^{d-1})\) (for \( i=1,2,\cdots ,\lambda \)). Based on the above analysis, the total computation complexity of Protocol 1 is bounded by \(O(m\lambda )\) encryptions and \(O(m\lambda )\) exponentiations.
In SCTC protocol, two encryptions and exponentiations are performed to compute \(\varPhi \) (line 3 and line 5). In addition, considering Eq. (13) acquires one encryption and Eq. (14) needs one exponentiation, the total computation complexity of SCTC Protocol is bounded by O(1) encryptions and O(1) exponentiations.
In SMRP protocol, one encryption and four exponentiations are performed to compute ciphertext multiplication from lines 1 to 8. The rest part of this protocol needs five encryptions and fourteen exponentiations to acquire the maximum of two ciphertexts. Therefore, the total computation complexity of SMRP Protocol is bounded by O(1) encryptions and O(1) exponentiations as well.
In Protocol 3 (namely our second stage), SCTC protocol (line 4 and line 7) is performed \(m(\lambda +2)d\) times to get the comparative result of multiple ciphertexts. From lines 16 to 20, it takes one encryption and \(d+1\) exponentiations to compute multiple ciphertext multiplication. From lines 22 to 24, SMRP protocol (line 23) needs to be done \(\lambda -1\) times. Based on the aforementioned analysis, SCTC protocol is bounded by O(1) encryptions and O(1) exponentiations. Besides, the SMRP protocol is also bounded by O(1) encryptions and O(1) exponentiations. Therefore, the total complexity of Protocol 3 is bounded by \(O(m\lambda d)\) encryptions and \(O(m\lambda d)\) exponentiations.
Communication Complexity. In our first stage, i.e., Protocol 1, CS\(_1\) needs to send \(m+1\) \(\lambda \)-dimensional vectors to CS\(_2\) who returns CS\(_1\) \(m+1\) corresponding \(\lambda \)-dimensional vectors. Considering all the data transferred between two clouds are in encrypted form, the whole communication complexity in first stage is bounded by \(O(m\lambda \mathcal {K})\) bits, where \(\mathcal {K}\) is the encryption key size.
In SCTC protocol, there is only two ciphertexts transferred between CS\(_1\) and CS\(_2\), which means communication complexity in SCTC protocol is bounded by \(O(\mathcal {K})\) bits.
In SMRP protocol, the number of ciphertexts sent by CS\(_1\) and CS\(_2\) is seventeen. Thus, the communication complexity in this protocol is bounded by \(O(\mathcal {K})\) bits.
In our second stage (namely Protocol 3), SCTC protocol needs to be done \(m(\lambda +2)d\) times, which leads to \(2m(\lambda +2)d\) ciphertexts transferred between CS\(_1\) and CS\(_2\). In addition, while computing \(E_{pk}(\prod _{t=1}^dm_{it})\), CS\(_1\) sends \(\lambda d\) ciphertexts to CS\(_2\) who returns \(\lambda \) ciphertexts to CS\(_1\). At last, \(17(\lambda -1)\) ciphertexts need to be transmitted between CS\(_1\) and CS\(_2\), while CS\(_1\) and CS\(_2\) implement \(\lambda -1\) times SMRP protocol. Therefore, the communication complexity in Protocol 3 is bounded by \( O(m\lambda d\mathcal {K})\) bits.
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Li, X., Zhu, Y., Wang, J. (2016). Secure Naïve Bayesian Classification over Encrypted Data in Cloud. In: Chen, L., Han, J. (eds) Provable Security. ProvSec 2016. Lecture Notes in Computer Science(), vol 10005. Springer, Cham. https://doi.org/10.1007/978-3-319-47422-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-47422-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47421-2
Online ISBN: 978-3-319-47422-9
eBook Packages: Computer ScienceComputer Science (R0)