Practical Distributed Privacy-Preserving Data Analysis at Large Scale

Duan, Yitao; Canny, John

doi:10.1007/978-1-4614-9242-9_8

Yitao Duan³ &
John Canny⁴

3144 Accesses
3 Citations

Abstract

In this chapter we investigate practical technologies for security and privacy in data analysis at large scale. We motivate our approach by discussing the challenges and opportunities in light of current and emerging analysis paradigms on large data sets. In particular, we present a framework for privacy-preserving distributed data analysis that is practical for many real-world applications. The framework is called Peers for Privacy (P4P) and features a novel heterogeneous architecture and a number of efficient tools for performing private computation and offering security at large scale. It maintains three key properties, which are essential for real-world applications: (i) provably strong privacy; (ii) adequate efficiency at reasonably large scale; and (iii) robustness against realistic adversaries. The framework gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps, which can be privately evaluated using efficient cryptographic tools, namely verifiable secret sharing over small field (e.g., 32 or 64 bits), which have the same cost as regular, non-private arithmetic. This paradigm supports a large number of statistical learning algorithms, including SVD, PCA, k-means, ID3 and machine learning algorithms based on Expectation-Maximization, as well as all algorithms in the statistical query model (Kearns, Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401, 1993). As a concrete example, we show how singular value decomposition, which is an extremely useful algorithm and the core of many data mining tasks, can be performed efficiently with privacy in P4P. Using real data, we demonstrate that P4P is orders of magnitude faster than other solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.netflixprize.com/.
2.
http://www.microsoft.com/azure/default.mspx.
3.
Most statistical algorithms need to bound the amount of noise in the data to produce meaningful results. This means that the fraction of cheating users is usually below a much lower threshold (e.g., α < 20 %).
4.
http://www.i2p2.de/.
5.
http://bid.berkeley.edu/projects/p4p/.
6.
http://www.teradata.com/business-needs/Big-Data-Analytics/.

References

Alaggan, M., Gambs, S., Kermarrec, A.M.: Private similarity computation in distributed systems: from cryptography to differential privacy. In: Principles of Distributed Systems. Lecture Notes in Computer Science. Springer, Berlin/New York (2011)
Google Scholar
Alderman, E., Kennedy, C.: The Right to Privacy. DIANE, Collingdale (1995)
Google Scholar
Beaver, D., Goldwasser, S.: Multiparty computation with faulty majority. In: CRYPTO’89, Santa Barbara
Google Scholar
Beerliová-Trubíniová, Z., Hirt, M.: Perfectly-secure mpc with linear communication complexity. In: TCC 2008, New York, pp. 213–230. Springer (2008)
Google Scholar
Beimel, A., Nissim1, K., Omri, E.: Distributed private data analysis: simultaneously solving how and what. In: CRYPTO 2008, Santa Barbara (2008)
Google Scholar
Ben-David, A., Nisan, N., Pinkas, B.: Fairplaymp: a system for secure multi-party computation. In: CCS’08, Alexandria, pp. 257–266 (2008)
Google Scholar
Ben-Or, M., Goldwasser, S., Wigderson, A.: Completeness theorems for non-cryptographic fault-tolerant distributed computation. In: STOC’88, Hong Kong, Chicago, IL, USA, pp. 1–10. ACM (1988)
Google Scholar
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: PODS’05, Baltimore, Maryland, USA, pp. 128–138. ACM (2005)
Google Scholar
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: STOC 08, Victoria, British Columbia, Canada (2008)
Google Scholar
Boaz Barak, E.A.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS’07, Beijing (2007)
Google Scholar
Canny, J.: Collaborative filtering with privacy. In: IEEE Symposium on Security and Privacy, San Francisco, Oakland, Ca, USA, pp. 45–57 (2002)
Google Scholar
Canny, J.: Collaborative filtering with privacy via factor analysis. In: SIGIR’02, Tampere, Tampere, Finland, pp. 238–245. ACM (2002)
Google Scholar
Chen, H., Cramer, R.: Algebraic geometric secret sharing schemes and secure multi-party computations over small fields. In: CRYPTO 2006, Santa Barbara (2006)
Google Scholar
Chin, F., Ozsoyoglu, G.: Auditing for secure statistical databases. In: ACM 81: Proceedings of the ACM’81 Conference, Seattle, ACM’ 81 is Los Angeles, Ca, USA, pp. 53–59 (1981)
Google Scholar
Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS 2006, Vancouver, B.C., Canada (2006)
Google Scholar
Cohen, W.W.: Enron email dataset. (2004) http://www-2.cs.cmu.edu/~enron/
Cohen Benaloh, J.: Secret sharing homomorphisms: keeping shares of a secret secret. In: CRYPTO’86, Santa Barbara, pp. 251–260 (1987)
Google Scholar
Cormode, G.: Personal privacy vs population privacy: learning to attack anonymization. In: KDD’11, Chicago, pp. 1253–1261. ACM, New York (2011)
Google Scholar
Cramer, R., Damgård, I.: Zero-knowledge proof for finite field arithmetic, or: can zero-knowledge be for free? In: CRYPTO’98, San Diego. Springer (1998)
Google Scholar
Dalenius, T.: Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–444 (1977)
Google Scholar
Damgård, I., Ishai, Y., Krøigaard, M., Nielsen, J.B., Smith, A.: Scalable multiparty computation with nearly optimal work and resilience. In: CRYPTO 2008, Santa Barbara, pp. 241–261 (2008)
Google Scholar
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW’07, Geneva, Banff, Alberta, Canada, pp. 271–280. ACM (2007)
Google Scholar
Dhanjani, N.: Amazon’s elastic compute cloud [ec2]: initial thoughts on security implications. http://www.dhanjani.com/archives/2008/04/
Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: PODS’03, San Diego, San Diego, California, pp. 202–210 (2003)
Google Scholar
Du, W., Zhan, Z.: Using randomized response techniques for privacy-preserving data mining. In: KDD’03, Washington DC, pp. 505–510. ACM, New York (2003)
Google Scholar
Du, W., Han, Y., Chen, S.: Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM 04, Toronto, Lake Buena Vista, Florida, USA, pp. 222–233 (2004)
Google Scholar
Duan, Y.: Privacy without noise. In: CIKM’09, Hong Kong. ACM, New York (2009)
Google Scholar
Duan, Y., Wang, J., Kam, M., Canny, J.: A secure online algorithm for link analysis on weighted graph. In: Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, SDM 05, Newport Beach, pp. 71–81 (2005)
Google Scholar
Duan, Y., Canny, J.: Zero-knowledge test of vector equivalence and granulation of user data with privacy. In: IEEE GrC 2006, Atlanta (2006)
Google Scholar
Duan, Y., Canny, J.: Practical private computation and zero-knowledge tools for privacy-preserving distributed data mining. In: SDM’08, Atlanta (2008)
Google Scholar
Duan, Y., Canny, J.: How to deal with malicious users in privacy-preserving distributed data mining. Stat. Anal. Data Min. 2(1), 18–33 (2009)
Article MATH MathSciNet Google Scholar
Duan, Y., Canny, J., Zhan, J.: P4P: Practical large-scale privacy-preserving distributed computation robust against malicious users. In: USENIX Security Symposium 2010, San Francisco, Washington, D.C, pp. 609–618 (2010)
Google Scholar
Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: PinKDD, San Jose, pp. 1–13 (2007)
Google Scholar
Dwork, C.: Ask a better question, get a better answer a new approach to private data analysis. In: ICDT 2007, Barcelona, Spain, pp. 18–27. Springer (2007)
Google Scholar
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: EUROCRYPT 2006, Saint Petersburg. Springer (2006)
Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC 2006, New York, pp. 265–284. Springer (2006)
Google Scholar
Evfimievski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: PODS’03, San Diego, pp. 211–222 (2003)
Google Scholar
Feigenbaum, J., Nisan, N., Ramachandran, V., Sami, R., Shenker, S.: Agents’ privacy in distributed algorithmic mechanisms. In: Workshop on Economics and Information Securit, Berkeley (2002)
Google Scholar
Fiat, A., Shamir, A.: How to prove yourself: practical solutions to identification and signature problems. In: CRYPTO 86, Santa Barbara, California, USA (1987)
Google Scholar
Fitzi, M., Hirt, M., Maurer, U.: General adversaries in unconditional multi-party computation. In: ASIACRYPT’99, Singapore (1999)
Google Scholar
Ganta, S.R., Kasiviswanathan, S.P., Smith, A.: Composition attacks and auxiliary information in data privacy. In: KDD’08, Las Vegas, pp. 265–273. ACM, New York (2008)
Google Scholar
Goldreich, O.: Foundations of Cryptography: Volume 2 – Basic Applications. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game. In: STOC’87, New York, pp. 218–229 (1987)
Google Scholar
Goldreich, O., Oren, Y.: Definitions and properties of zero-knowledge proof systems. J. Cryptol. 7(1), 1–32 (1994)
Article MATH MathSciNet Google Scholar
Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. SIAM J. Comput. 18(1), 186–208 (1989)
Article MATH MathSciNet Google Scholar
Goldwasser, S., Levin, L.: Fair computation of general functions in presence of immoral majority. In: CRYPTO’90, Santa Barbara, pp. 77–93. Springer (1991)
Google Scholar
Hirt, M., Maurer, U.: Complete characterization of adversaries tolerable in secure multi-party computation (extended abstract). In: PODC’97, Santa Barbara (1997)
Google Scholar
Hirt, M., Maurer, U.: Player simulation and general adversary structures in perfect multiparty computation. J. Cryptol. 13(1), 31–60 (2000)
Article MATH MathSciNet Google Scholar
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM’03, Melbourne, Florida, USA, p. 99. IEEE Computer Society, Washington (2003)
Google Scholar
Kearns, M.: Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401 (1993)
Google Scholar
Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: SIGMOD’11, Athens, Greece, pp. 193–204. ACM, New York (2011)
Google Scholar
Kleinberg, J., Papadimitriou, C., Raghavan, P.: Auditing boolean attributes. In: PODS’00, Dallas, pp. 86–91. ACM, New York (2000). doi:http://doi.acm.org/10.1145/335168.335210
Lehoucq, R.B., Sorensen, D.C., Yang, C.: ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, San Francisco (1998)
Book Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE 23rd International Conference on Data Engineering, Istanbul, pp. 106–115 (2007)
Google Scholar
Lindell, Y., Pinkas, B.: Privacy preserving data mining. J. Cryptol. 15(3), 177–206 (2002)
Article MATH MathSciNet Google Scholar
Lindell, Y., Pinkas, B., Smart, N.P.: Implementing two-party computation efficiently with security against malicious adversaries. In: SCN’08, Amalfi, Italy (2008)
Google Scholar
Liu, W.M., Wang, L.: Privacy streamliner: a two-stage approach to improving algorithm efficiency. In: CODASPY’12, San Antonio, pp. 193–204. ACM, New York (2012)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: Proceedings of the IEEE 22rd International Conference on Data Engineering, Atlanta (2006)
Google Scholar
Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay—a secure two-party computation system. In: SSYM’04: Proceedings of the 13th Conference on USENIX Security Symposium, San Diego, CA, pp. 20–20. USENIX Association, Berkeley (2004)
Google Scholar
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM 53(9), 89–97 (2010)
Article Google Scholar
McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the netflix prize contenders. In: KDD’09, Paris, pp. 627–636 (2009)
Google Scholar
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS’07 Rhode Island (2007)
Google Scholar
Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: SIGMOD’07, Beijing, pp. 665–676. ACM, New York (2007)
Google Scholar
Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC’07, El Paso, Texas, USA, pp. 75–84. ACM (2007)
Google Scholar
Paillier, P.: Trapdooring discrete logarithms on elliptic curves over rings. In: ASIACRYPT’00, Kyoto (2000)
Google Scholar
Pedersen, T.: Non-interactive and information-theoretic secure verifiable secret sharing. In: CRYPTO’91, Santa Barbara (1992)
Google Scholar
Pinkas, B., Schneider, T., Smart, N., Williams, S.: Secure two-party computation is practical. Cryptology ePrint Archive, Report 2009/314 (2009)
Google Scholar
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of database systems, PODS’98, Seattle, p. 188. ACM, New York (1998). doi:10.1145/275487.275508. http://doi.acm.org/10.1145/275487.275508
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI International (1998)
Google Scholar
Stewart, G.W., Sun, J.G.: Matrix Perturbation Theory. Academic, Boston New York (1990)
MATH Google Scholar
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Google Scholar
Trefethen, L.N., III, D.B.: Numerical Linear Algebra. SIAM, Philadelphia (1997)
Google Scholar
Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: KDD’03, Washington DC (2003)
Google Scholar
Wright, R., Yang, Z.: Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: KDD’04, New York, pp. 713–718 (2004)
Google Scholar
Xiao, X., Tao, Y.: M-invariance: Towards privacy preserving re-publication of dynamic datasets. In: SIGMOD 2007, Beijing, pp. 689–700 (2007)
Google Scholar
Yang, Z., Zhong, S., Wright, R.N.: Privacy-preserving classification of customer data without loss of accuracy. In: SDM 2005, Newport Beach (2005)
Google Scholar
Yao, A.C.C.: Protocols for secure computations. In: FOCS’82, Chicago, pp. 160–164. IEEE (1982)
Google Scholar
Zhang, L., Jajodia, S., Brodsky, A.: Information disclosure under realistic assumptions: privacy versus optimality. In: CCS’07, Alexandria, pp. 573–583 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

NetEase Youdao, Beijing, China
Yitao Duan
Computer Science Division, University of California, Berkeley, CA, 94720, USA
John Canny

Authors

Yitao Duan
View author publications
You can also search for this author in PubMed Google Scholar
John Canny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yitao Duan .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Ireland
Aris Gkoulalas-Divanis
IBM Research - Zurich, Rüschlikon, Switzerland
Abderrahim Labbi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Duan, Y., Canny, J. (2014). Practical Distributed Privacy-Preserving Data Analysis at Large Scale. In: Gkoulalas-Divanis, A., Labbi, A. (eds) Large-Scale Data Analytics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9242-9_8

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9242-9_8
Published: 28 November 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9241-2
Online ISBN: 978-1-4614-9242-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics