Abstract
Most distributed data mining algorithms can efficiently manage and mine complete data from distributed resources. However, for an incomplete data some modifications are required in order to perform distributed data mining techniques and maintaining the privacy of the sensitive information to provide pretty good results of data mining. Classification is important tasks of data mining aimed at discovering knowledge and classify new instances. SVM is classified as one of the most important algorithm used for classification problems in several various spheres. In this paper, we proposed a new distributed privacy-preserving protocol with multiple imputations of missing or incomplete data. More so, a multiple imputations based on multivariate imputation by chained equations is used for missing data and Paillier cryptosystem for maintaining the privacy of the participants. Finally we constructed a global SVM model by introducing a third party (semi-honest approach) over vertical partition data based in Gram matrix without revealing the privacy of the data and used to classify new instances. The performance evolution of the proposed protocol was investigated while using accuracy metric on the distributed and centralized data. Results of our experiments reveal that the accuracy is the same as centralized data and achieve better results with imputed data while compared with omitted data. The performance of distributed data on our protocol achieves better processing time compared with centralized data.
Similar content being viewed by others
References
Oliveira, S.R., Zaïane, O.R.: A privacy-preserving clustering approach toward secure and effective data analysis for business collaboration. Comput. Secur. 26(1), 81–93 (2007)
Mariscal, G., Marbán, Ó., Fernández, C.: A survey of data mining and knowledge discovery process models and methodologies. Knowl. Eng. Rev. 25(02), 137–166 (2010)
Maimon, O., Rokach, L.: Introduction to knowledge discovery and data mining. Data Mining and Knowledge Discovery Handbook, pp. 1–15. Springer, New York (2010)
Wang, J., Luo, Y., Zhao, Y., Le, J.: A survey on privacy preserving data mining. In: 2009 First International Workshop on Database Technology and Applications, pp. 111–114, 2009
Jagannathan, G., Wright, R.N.: Privacy-preserving imputation of missing data. Data Knowl. Eng. 65(1), 40–56 (2008)
Lin, K.-P., Chen, M.-S.: On the design and analysis of the privacy-preserving svm classifier. IEEE Trans. Knowl. Data Eng. 23(11), 1704–1717 (2011)
Agrawal, R., Srikant, R.: Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of data–SIGMOD 00, pp. 439–450, 2000
Sun, C., Gao, H., Zhou, J., Fu, Y., She, L.: A new hybrid approach for privacy preserving distributed data mining. IEICE Trans. Inf. Syst 97(4), 876–883 (2014)
Zhou, J., Cao, Z., Dong, X., Lin, X.: Ppdm: a privacy-preserving protocol for cloud-assisted e-healthcare systems. IEEE J. Sel. Top. Signal Process. 9(7), 1332–1344 (2015)
Ahuja, S.P., Mani, S., Zambrano, J.: A survey of the state of cloud computing in healthcare. Netw. Commun. Technol. 1(2), 12 (2012)
Grobauer, B., Walloschek, T., Stocker, E.: Understanding cloud computing vulnerabilities. IEEE Secur. Priv. 9(2), 50–57 (2011)
Voas, J., Zhang, J.: Cloud computing: new wine or just a new bottle? IT Prof. 11(2), 15–17 (2009)
Bhagyashree, A., and Borkar, V.: Data mining in cloud computing. In: MPGI National Multi Conference, pp. 7–8. 2012
Graham, J.W.: Missing data analysis: making it work in the real world. Annu. Rev. Psychol. 60, 549–576 (2009)
Schenker, N., Raghunathan, T.E., Chiu, P.-L., Makuc, D.M., Zhang, G., Cohen, A.J.: Multiple imputation of missing income data in the national health interview survey. J. Am. Stat. Assoc. 101(475), 924–933 (2006)
Yuan, Y.: Multiple Imputation for Missing Data: Concepts and New Development, pp. 1–3. SAS Institute Inc, Rockville, MD (2010)
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Zhang, K., Lan, L., Wang, Z., Moerchen, F.: Scaling up Kernel SVM on limited resources: a low-rank linearization approach. Artif. Intell. Stat. 22, 1425–1434 (2012)
Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 1592, pp. 223–238 (1999)
Nishide, T., Sakurai, K.: Distributed Paillier cryptosystem without trusted dealer. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 6513, LNCS, pp. 44–60 (2011)
Rahulamathavan, Y., Veluru, S., Phan, R.C.W., Chambers, J.A., Rajarajan, M.: Privacy-preserving clinical decision support system using gaussian kernel-based classification. IEEE J. Biomed. Heal. Inform. 18(1), 56–66 (2014)
Sen, J.: Homomorphic Encryption: Theory and Applications, arXiv:1305.5886 pp. 1–32, 2013
Brickell, J., Shmatikov, V.: Privacy-preserving graph algorithms in the semi-honest model. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3788, LNCS, pp. 236–252 (2005)
Hardt, J., Herke, M., Brian, T., Laubach, W.: Multiple imputation of missing data: a simulation study on a binary response. Open J. Stat. 3, 370–378 (2013)
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Methods Psychiatr. Res. 20(1), 40–49 (2011)
Seera, Manjeevan, Lim, Chee Peng: A hybrid intelligent system for medical data classification. Expert Syst. Appl. 41(5), 2239–2249 (2014)
Lu, Y., Gao, Y., Cao, Z., Cui, J., Dong, Z., Tian, Y., Xu, Y.: A study of health effects of long-distance ocean voyages on seamen using a data classification approach. BMC Med. Inform. Decis. Mak. 10(1), 13 (2010)
Yu, W., Liu, T., Valdez, R., Gwinn, M., Khoury, M.J.: Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med. Inform. Decis. Mak. 10(1), 16 (2010)
H. Office for Civil Rights: Standards for privacy of individually identifiable health information final rule. Federal Regist. 67(157), 53141 (2002)
De Hert, P., Papakonstantinou, V.: The proposed data protection Regulation replacing Directive 95/46/EC: a sound system for the protection of individuals. Comput. Law Secur. Rev. 28(2), 130–142 (2012)
Yu, H., Vaidya, J., Jiang, X.: Privacy-preserving SVM classification on vertically partitioned data. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3918 LNAI, pp. 647–656, 2006
Vaidya, J., Yu, H., Jiang, X.: Privacy-preserving svm classification. Knowl. Inf. Syst. 14(2), 161–178 (2008)
Que, J., Jiang, X., Ohno-Machado, L.: A collaborative framework for distributed privacy-preserving support vector machine learning. AMIA Annu. Symp. Proc. 2012, 1350–9 (2012)
Kaambwa, B., Bryan, S., Billingham, L.: Do the methods used to analyze missing data really matter? an examination of data from an observational study of intermediate care patients. BMC Res. Notes 5(1), 330 (2012)
Sainani, K.L.: Dealing with missing data. PMR 7(9), 990–994 (2015)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Campbell, C., Ying, Y.: Learning with support vector machines. Synth. Lectures Artif. Intell. Mach. Learn. 5(1), 1–95 (2011)
Ben-Hur, A., Ong, C.S., Sonnenburg, S., Schölkopf, B., Rätsch, G.: Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4(10), e1000173 (2008)
Raghunathan, T.E.: What do we do with missing data? some options for analysis of incomplete data. Annu. Rev. Public Health 25(1), 99–117 (2004)
Royston, Patrick, White, Ian R.: Multiple imputation by chained equations (MICE): implementation in Stata. J. Stat. Softw. 45(4), 1–20 (2011)
Wolberg, W.H., Street, W.N., Mangasarian, O.L.: Breast cancer wisconsin (diagnostic) data set, UCI Machine Learning Repository, 1992
Wolberg, W.H.: Breast cancer wisconsin (original) data set. UCI Machine Learning Repository, (1992)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Omer, M.Z., Gao, H. & Mustafa, N. Privacy-preserving of SVM over vertically partitioned with imputing missing data. Distrib Parallel Databases 35, 363–382 (2017). https://doi.org/10.1007/s10619-017-7203-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-017-7203-3