Abstract
In this chapter we investigate practical technologies for security and privacy in data analysis at large scale. We motivate our approach by discussing the challenges and opportunities in light of current and emerging analysis paradigms on large data sets. In particular, we present a framework for privacy-preserving distributed data analysis that is practical for many real-world applications. The framework is called Peers for Privacy (P4P) and features a novel heterogeneous architecture and a number of efficient tools for performing private computation and offering security at large scale. It maintains three key properties, which are essential for real-world applications: (i) provably strong privacy; (ii) adequate efficiency at reasonably large scale; and (iii) robustness against realistic adversaries. The framework gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps, which can be privately evaluated using efficient cryptographic tools, namely verifiable secret sharing over small field (e.g., 32 or 64 bits), which have the same cost as regular, non-private arithmetic. This paradigm supports a large number of statistical learning algorithms, including SVD, PCA, k-means, ID3 and machine learning algorithms based on Expectation-Maximization, as well as all algorithms in the statistical query model (Kearns, Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401, 1993). As a concrete example, we show how singular value decomposition, which is an extremely useful algorithm and the core of many data mining tasks, can be performed efficiently with privacy in P4P. Using real data, we demonstrate that P4P is orders of magnitude faster than other solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Most statistical algorithms need to bound the amount of noise in the data to produce meaningful results. This means that the fraction of cheating users is usually below a much lower threshold (e.g., α < 20 %).
- 4.
- 5.
- 6.
References
Alaggan, M., Gambs, S., Kermarrec, A.M.: Private similarity computation in distributed systems: from cryptography to differential privacy. In: Principles of Distributed Systems. Lecture Notes in Computer Science. Springer, Berlin/New York (2011)
Alderman, E., Kennedy, C.: The Right to Privacy. DIANE, Collingdale (1995)
Beaver, D., Goldwasser, S.: Multiparty computation with faulty majority. In: CRYPTO’89, Santa Barbara
Beerliová-Trubíniová, Z., Hirt, M.: Perfectly-secure mpc with linear communication complexity. In: TCC 2008, New York, pp. 213–230. Springer (2008)
Beimel, A., Nissim1, K., Omri, E.: Distributed private data analysis: simultaneously solving how and what. In: CRYPTO 2008, Santa Barbara (2008)
Ben-David, A., Nisan, N., Pinkas, B.: Fairplaymp: a system for secure multi-party computation. In: CCS’08, Alexandria, pp. 257–266 (2008)
Ben-Or, M., Goldwasser, S., Wigderson, A.: Completeness theorems for non-cryptographic fault-tolerant distributed computation. In: STOC’88, Hong Kong, Chicago, IL, USA, pp. 1–10. ACM (1988)
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: PODS’05, Baltimore, Maryland, USA, pp. 128–138. ACM (2005)
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: STOC 08, Victoria, British Columbia, Canada (2008)
Boaz Barak, E.A.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS’07, Beijing (2007)
Canny, J.: Collaborative filtering with privacy. In: IEEE Symposium on Security and Privacy, San Francisco, Oakland, Ca, USA, pp. 45–57 (2002)
Canny, J.: Collaborative filtering with privacy via factor analysis. In: SIGIR’02, Tampere, Tampere, Finland, pp. 238–245. ACM (2002)
Chen, H., Cramer, R.: Algebraic geometric secret sharing schemes and secure multi-party computations over small fields. In: CRYPTO 2006, Santa Barbara (2006)
Chin, F., Ozsoyoglu, G.: Auditing for secure statistical databases. In: ACM 81: Proceedings of the ACM’81 Conference, Seattle, ACM’ 81 is Los Angeles, Ca, USA, pp. 53–59 (1981)
Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS 2006, Vancouver, B.C., Canada (2006)
Cohen, W.W.: Enron email dataset. (2004) http://www-2.cs.cmu.edu/~enron/
Cohen Benaloh, J.: Secret sharing homomorphisms: keeping shares of a secret secret. In: CRYPTO’86, Santa Barbara, pp. 251–260 (1987)
Cormode, G.: Personal privacy vs population privacy: learning to attack anonymization. In: KDD’11, Chicago, pp. 1253–1261. ACM, New York (2011)
Cramer, R., Damgård, I.: Zero-knowledge proof for finite field arithmetic, or: can zero-knowledge be for free? In: CRYPTO’98, San Diego. Springer (1998)
Dalenius, T.: Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–444 (1977)
Damgård, I., Ishai, Y., Krøigaard, M., Nielsen, J.B., Smith, A.: Scalable multiparty computation with nearly optimal work and resilience. In: CRYPTO 2008, Santa Barbara, pp. 241–261 (2008)
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW’07, Geneva, Banff, Alberta, Canada, pp. 271–280. ACM (2007)
Dhanjani, N.: Amazon’s elastic compute cloud [ec2]: initial thoughts on security implications. http://www.dhanjani.com/archives/2008/04/
Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: PODS’03, San Diego, San Diego, California, pp. 202–210 (2003)
Du, W., Zhan, Z.: Using randomized response techniques for privacy-preserving data mining. In: KDD’03, Washington DC, pp. 505–510. ACM, New York (2003)
Du, W., Han, Y., Chen, S.: Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM 04, Toronto, Lake Buena Vista, Florida, USA, pp. 222–233 (2004)
Duan, Y.: Privacy without noise. In: CIKM’09, Hong Kong. ACM, New York (2009)
Duan, Y., Wang, J., Kam, M., Canny, J.: A secure online algorithm for link analysis on weighted graph. In: Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, SDM 05, Newport Beach, pp. 71–81 (2005)
Duan, Y., Canny, J.: Zero-knowledge test of vector equivalence and granulation of user data with privacy. In: IEEE GrC 2006, Atlanta (2006)
Duan, Y., Canny, J.: Practical private computation and zero-knowledge tools for privacy-preserving distributed data mining. In: SDM’08, Atlanta (2008)
Duan, Y., Canny, J.: How to deal with malicious users in privacy-preserving distributed data mining. Stat. Anal. Data Min. 2(1), 18–33 (2009)
Duan, Y., Canny, J., Zhan, J.: P4P: Practical large-scale privacy-preserving distributed computation robust against malicious users. In: USENIX Security Symposium 2010, San Francisco, Washington, D.C, pp. 609–618 (2010)
Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: PinKDD, San Jose, pp. 1–13 (2007)
Dwork, C.: Ask a better question, get a better answer a new approach to private data analysis. In: ICDT 2007, Barcelona, Spain, pp. 18–27. Springer (2007)
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: EUROCRYPT 2006, Saint Petersburg. Springer (2006)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC 2006, New York, pp. 265–284. Springer (2006)
Evfimievski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: PODS’03, San Diego, pp. 211–222 (2003)
Feigenbaum, J., Nisan, N., Ramachandran, V., Sami, R., Shenker, S.: Agents’ privacy in distributed algorithmic mechanisms. In: Workshop on Economics and Information Securit, Berkeley (2002)
Fiat, A., Shamir, A.: How to prove yourself: practical solutions to identification and signature problems. In: CRYPTO 86, Santa Barbara, California, USA (1987)
Fitzi, M., Hirt, M., Maurer, U.: General adversaries in unconditional multi-party computation. In: ASIACRYPT’99, Singapore (1999)
Ganta, S.R., Kasiviswanathan, S.P., Smith, A.: Composition attacks and auxiliary information in data privacy. In: KDD’08, Las Vegas, pp. 265–273. ACM, New York (2008)
Goldreich, O.: Foundations of Cryptography: Volume 2 – Basic Applications. Cambridge University Press, Cambridge (2004)
Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game. In: STOC’87, New York, pp. 218–229 (1987)
Goldreich, O., Oren, Y.: Definitions and properties of zero-knowledge proof systems. J. Cryptol. 7(1), 1–32 (1994)
Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. SIAM J. Comput. 18(1), 186–208 (1989)
Goldwasser, S., Levin, L.: Fair computation of general functions in presence of immoral majority. In: CRYPTO’90, Santa Barbara, pp. 77–93. Springer (1991)
Hirt, M., Maurer, U.: Complete characterization of adversaries tolerable in secure multi-party computation (extended abstract). In: PODC’97, Santa Barbara (1997)
Hirt, M., Maurer, U.: Player simulation and general adversary structures in perfect multiparty computation. J. Cryptol. 13(1), 31–60 (2000)
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM’03, Melbourne, Florida, USA, p. 99. IEEE Computer Society, Washington (2003)
Kearns, M.: Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401 (1993)
Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: SIGMOD’11, Athens, Greece, pp. 193–204. ACM, New York (2011)
Kleinberg, J., Papadimitriou, C., Raghavan, P.: Auditing boolean attributes. In: PODS’00, Dallas, pp. 86–91. ACM, New York (2000). doi:http://doi.acm.org/10.1145/335168.335210
Lehoucq, R.B., Sorensen, D.C., Yang, C.: ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, San Francisco (1998)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE 23rd International Conference on Data Engineering, Istanbul, pp. 106–115 (2007)
Lindell, Y., Pinkas, B.: Privacy preserving data mining. J. Cryptol. 15(3), 177–206 (2002)
Lindell, Y., Pinkas, B., Smart, N.P.: Implementing two-party computation efficiently with security against malicious adversaries. In: SCN’08, Amalfi, Italy (2008)
Liu, W.M., Wang, L.: Privacy streamliner: a two-stage approach to improving algorithm efficiency. In: CODASPY’12, San Antonio, pp. 193–204. ACM, New York (2012)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: Proceedings of the IEEE 22rd International Conference on Data Engineering, Atlanta (2006)
Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay—a secure two-party computation system. In: SSYM’04: Proceedings of the 13th Conference on USENIX Security Symposium, San Diego, CA, pp. 20–20. USENIX Association, Berkeley (2004)
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM 53(9), 89–97 (2010)
McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the netflix prize contenders. In: KDD’09, Paris, pp. 627–636 (2009)
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS’07 Rhode Island (2007)
Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: SIGMOD’07, Beijing, pp. 665–676. ACM, New York (2007)
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC’07, El Paso, Texas, USA, pp. 75–84. ACM (2007)
Paillier, P.: Trapdooring discrete logarithms on elliptic curves over rings. In: ASIACRYPT’00, Kyoto (2000)
Pedersen, T.: Non-interactive and information-theoretic secure verifiable secret sharing. In: CRYPTO’91, Santa Barbara (1992)
Pinkas, B., Schneider, T., Smart, N., Williams, S.: Secure two-party computation is practical. Cryptology ePrint Archive, Report 2009/314 (2009)
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of database systems, PODS’98, Seattle, p. 188. ACM, New York (1998). doi:10.1145/275487.275508. http://doi.acm.org/10.1145/275487.275508
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI International (1998)
Stewart, G.W., Sun, J.G.: Matrix Perturbation Theory. Academic, Boston New York (1990)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Trefethen, L.N., III, D.B.: Numerical Linear Algebra. SIAM, Philadelphia (1997)
Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: KDD’03, Washington DC (2003)
Wright, R., Yang, Z.: Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: KDD’04, New York, pp. 713–718 (2004)
Xiao, X., Tao, Y.: M-invariance: Towards privacy preserving re-publication of dynamic datasets. In: SIGMOD 2007, Beijing, pp. 689–700 (2007)
Yang, Z., Zhong, S., Wright, R.N.: Privacy-preserving classification of customer data without loss of accuracy. In: SDM 2005, Newport Beach (2005)
Yao, A.C.C.: Protocols for secure computations. In: FOCS’82, Chicago, pp. 160–164. IEEE (1982)
Zhang, L., Jajodia, S., Brodsky, A.: Information disclosure under realistic assumptions: privacy versus optimality. In: CCS’07, Alexandria, pp. 573–583 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Duan, Y., Canny, J. (2014). Practical Distributed Privacy-Preserving Data Analysis at Large Scale. In: Gkoulalas-Divanis, A., Labbi, A. (eds) Large-Scale Data Analytics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9242-9_8
Download citation
DOI: https://doi.org/10.1007/978-1-4614-9242-9_8
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9241-2
Online ISBN: 978-1-4614-9242-9
eBook Packages: Computer ScienceComputer Science (R0)