Skip to main content

Practical Distributed Privacy-Preserving Data Analysis at Large Scale

  • Chapter
  • First Online:
Large-Scale Data Analytics

Abstract

In this chapter we investigate practical technologies for security and privacy in data analysis at large scale. We motivate our approach by discussing the challenges and opportunities in light of current and emerging analysis paradigms on large data sets. In particular, we present a framework for privacy-preserving distributed data analysis that is practical for many real-world applications. The framework is called Peers for Privacy (P4P) and features a novel heterogeneous architecture and a number of efficient tools for performing private computation and offering security at large scale. It maintains three key properties, which are essential for real-world applications: (i) provably strong privacy; (ii) adequate efficiency at reasonably large scale; and (iii) robustness against realistic adversaries. The framework gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps, which can be privately evaluated using efficient cryptographic tools, namely verifiable secret sharing over small field (e.g., 32 or 64 bits), which have the same cost as regular, non-private arithmetic. This paradigm supports a large number of statistical learning algorithms, including SVD, PCA, k-means, ID3 and machine learning algorithms based on Expectation-Maximization, as well as all algorithms in the statistical query model (Kearns, Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401, 1993). As a concrete example, we show how singular value decomposition, which is an extremely useful algorithm and the core of many data mining tasks, can be performed efficiently with privacy in P4P. Using real data, we demonstrate that P4P is orders of magnitude faster than other solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.netflixprize.com/.

  2. 2.

    http://www.microsoft.com/azure/default.mspx.

  3. 3.

    Most statistical algorithms need to bound the amount of noise in the data to produce meaningful results. This means that the fraction of cheating users is usually below a much lower threshold (e.g., α < 20 %).

  4. 4.

    http://www.i2p2.de/.

  5. 5.

    http://bid.berkeley.edu/projects/p4p/.

  6. 6.

    http://www.teradata.com/business-needs/Big-Data-Analytics/.

References

  1. Alaggan, M., Gambs, S., Kermarrec, A.M.: Private similarity computation in distributed systems: from cryptography to differential privacy. In: Principles of Distributed Systems. Lecture Notes in Computer Science. Springer, Berlin/New York (2011)

    Google Scholar 

  2. Alderman, E., Kennedy, C.: The Right to Privacy. DIANE, Collingdale (1995)

    Google Scholar 

  3. Beaver, D., Goldwasser, S.: Multiparty computation with faulty majority. In: CRYPTO’89, Santa Barbara

    Google Scholar 

  4. Beerliová-Trubíniová, Z., Hirt, M.: Perfectly-secure mpc with linear communication complexity. In: TCC 2008, New York, pp. 213–230. Springer (2008)

    Google Scholar 

  5. Beimel, A., Nissim1, K., Omri, E.: Distributed private data analysis: simultaneously solving how and what. In: CRYPTO 2008, Santa Barbara (2008)

    Google Scholar 

  6. Ben-David, A., Nisan, N., Pinkas, B.: Fairplaymp: a system for secure multi-party computation. In: CCS’08, Alexandria, pp. 257–266 (2008)

    Google Scholar 

  7. Ben-Or, M., Goldwasser, S., Wigderson, A.: Completeness theorems for non-cryptographic fault-tolerant distributed computation. In: STOC’88, Hong Kong, Chicago, IL, USA, pp. 1–10. ACM (1988)

    Google Scholar 

  8. Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: PODS’05, Baltimore, Maryland, USA, pp. 128–138. ACM (2005)

    Google Scholar 

  9. Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: STOC 08, Victoria, British Columbia, Canada (2008)

    Google Scholar 

  10. Boaz Barak, E.A.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: PODS’07, Beijing (2007)

    Google Scholar 

  11. Canny, J.: Collaborative filtering with privacy. In: IEEE Symposium on Security and Privacy, San Francisco, Oakland, Ca, USA, pp. 45–57 (2002)

    Google Scholar 

  12. Canny, J.: Collaborative filtering with privacy via factor analysis. In: SIGIR’02, Tampere, Tampere, Finland, pp. 238–245. ACM (2002)

    Google Scholar 

  13. Chen, H., Cramer, R.: Algebraic geometric secret sharing schemes and secure multi-party computations over small fields. In: CRYPTO 2006, Santa Barbara (2006)

    Google Scholar 

  14. Chin, F., Ozsoyoglu, G.: Auditing for secure statistical databases. In: ACM 81: Proceedings of the ACM’81 Conference, Seattle, ACM’ 81 is Los Angeles, Ca, USA, pp. 53–59 (1981)

    Google Scholar 

  15. Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: NIPS 2006, Vancouver, B.C., Canada (2006)

    Google Scholar 

  16. Cohen, W.W.: Enron email dataset. (2004) http://www-2.cs.cmu.edu/~enron/

  17. Cohen Benaloh, J.: Secret sharing homomorphisms: keeping shares of a secret secret. In: CRYPTO’86, Santa Barbara, pp. 251–260 (1987)

    Google Scholar 

  18. Cormode, G.: Personal privacy vs population privacy: learning to attack anonymization. In: KDD’11, Chicago, pp. 1253–1261. ACM, New York (2011)

    Google Scholar 

  19. Cramer, R., Damgård, I.: Zero-knowledge proof for finite field arithmetic, or: can zero-knowledge be for free? In: CRYPTO’98, San Diego. Springer (1998)

    Google Scholar 

  20. Dalenius, T.: Towards a methodology for statistical disclosure control. Statistik Tidskrift 15, 429–444 (1977)

    Google Scholar 

  21. Damgård, I., Ishai, Y., Krøigaard, M., Nielsen, J.B., Smith, A.: Scalable multiparty computation with nearly optimal work and resilience. In: CRYPTO 2008, Santa Barbara, pp. 241–261 (2008)

    Google Scholar 

  22. Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: WWW’07, Geneva, Banff, Alberta, Canada, pp. 271–280. ACM (2007)

    Google Scholar 

  23. Dhanjani, N.: Amazon’s elastic compute cloud [ec2]: initial thoughts on security implications. http://www.dhanjani.com/archives/2008/04/

  24. Dinur, I., Nissim, K.: Revealing information while preserving privacy. In: PODS’03, San Diego, San Diego, California, pp. 202–210 (2003)

    Google Scholar 

  25. Du, W., Zhan, Z.: Using randomized response techniques for privacy-preserving data mining. In: KDD’03, Washington DC, pp. 505–510. ACM, New York (2003)

    Google Scholar 

  26. Du, W., Han, Y., Chen, S.: Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM 04, Toronto, Lake Buena Vista, Florida, USA, pp. 222–233 (2004)

    Google Scholar 

  27. Duan, Y.: Privacy without noise. In: CIKM’09, Hong Kong. ACM, New York (2009)

    Google Scholar 

  28. Duan, Y., Wang, J., Kam, M., Canny, J.: A secure online algorithm for link analysis on weighted graph. In: Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, SDM 05, Newport Beach, pp. 71–81 (2005)

    Google Scholar 

  29. Duan, Y., Canny, J.: Zero-knowledge test of vector equivalence and granulation of user data with privacy. In: IEEE GrC 2006, Atlanta (2006)

    Google Scholar 

  30. Duan, Y., Canny, J.: Practical private computation and zero-knowledge tools for privacy-preserving distributed data mining. In: SDM’08, Atlanta (2008)

    Google Scholar 

  31. Duan, Y., Canny, J.: How to deal with malicious users in privacy-preserving distributed data mining. Stat. Anal. Data Min. 2(1), 18–33 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  32. Duan, Y., Canny, J., Zhan, J.: P4P: Practical large-scale privacy-preserving distributed computation robust against malicious users. In: USENIX Security Symposium 2010, San Francisco, Washington, D.C, pp. 609–618 (2010)

    Google Scholar 

  33. Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: PinKDD, San Jose, pp. 1–13 (2007)

    Google Scholar 

  34. Dwork, C.: Ask a better question, get a better answer a new approach to private data analysis. In: ICDT 2007, Barcelona, Spain, pp. 18–27. Springer (2007)

    Google Scholar 

  35. Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: EUROCRYPT 2006, Saint Petersburg. Springer (2006)

    Google Scholar 

  36. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC 2006, New York, pp. 265–284. Springer (2006)

    Google Scholar 

  37. Evfimievski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: PODS’03, San Diego, pp. 211–222 (2003)

    Google Scholar 

  38. Feigenbaum, J., Nisan, N., Ramachandran, V., Sami, R., Shenker, S.: Agents’ privacy in distributed algorithmic mechanisms. In: Workshop on Economics and Information Securit, Berkeley (2002)

    Google Scholar 

  39. Fiat, A., Shamir, A.: How to prove yourself: practical solutions to identification and signature problems. In: CRYPTO 86, Santa Barbara, California, USA (1987)

    Google Scholar 

  40. Fitzi, M., Hirt, M., Maurer, U.: General adversaries in unconditional multi-party computation. In: ASIACRYPT’99, Singapore (1999)

    Google Scholar 

  41. Ganta, S.R., Kasiviswanathan, S.P., Smith, A.: Composition attacks and auxiliary information in data privacy. In: KDD’08, Las Vegas, pp. 265–273. ACM, New York (2008)

    Google Scholar 

  42. Goldreich, O.: Foundations of Cryptography: Volume 2 – Basic Applications. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  43. Goldreich, O., Micali, S., Wigderson, A.: How to play any mental game. In: STOC’87, New York, pp. 218–229 (1987)

    Google Scholar 

  44. Goldreich, O., Oren, Y.: Definitions and properties of zero-knowledge proof systems. J. Cryptol. 7(1), 1–32 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  45. Goldwasser, S., Micali, S., Rackoff, C.: The knowledge complexity of interactive proof systems. SIAM J. Comput. 18(1), 186–208 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  46. Goldwasser, S., Levin, L.: Fair computation of general functions in presence of immoral majority. In: CRYPTO’90, Santa Barbara, pp. 77–93. Springer (1991)

    Google Scholar 

  47. Hirt, M., Maurer, U.: Complete characterization of adversaries tolerable in secure multi-party computation (extended abstract). In: PODC’97, Santa Barbara (1997)

    Google Scholar 

  48. Hirt, M., Maurer, U.: Player simulation and general adversary structures in perfect multiparty computation. J. Cryptol. 13(1), 31–60 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  49. Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM’03, Melbourne, Florida, USA, p. 99. IEEE Computer Society, Washington (2003)

    Google Scholar 

  50. Kearns, M.: Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401 (1993)

    Google Scholar 

  51. Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: SIGMOD’11, Athens, Greece, pp. 193–204. ACM, New York (2011)

    Google Scholar 

  52. Kleinberg, J., Papadimitriou, C., Raghavan, P.: Auditing boolean attributes. In: PODS’00, Dallas, pp. 86–91. ACM, New York (2000). doi:http://doi.acm.org/10.1145/335168.335210

  53. Lehoucq, R.B., Sorensen, D.C., Yang, C.: ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM, San Francisco (1998)

    Book  Google Scholar 

  54. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the IEEE 23rd International Conference on Data Engineering, Istanbul, pp. 106–115 (2007)

    Google Scholar 

  55. Lindell, Y., Pinkas, B.: Privacy preserving data mining. J. Cryptol. 15(3), 177–206 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  56. Lindell, Y., Pinkas, B., Smart, N.P.: Implementing two-party computation efficiently with security against malicious adversaries. In: SCN’08, Amalfi, Italy (2008)

    Google Scholar 

  57. Liu, W.M., Wang, L.: Privacy streamliner: a two-stage approach to improving algorithm efficiency. In: CODASPY’12, San Antonio, pp. 193–204. ACM, New York (2012)

    Google Scholar 

  58. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: Proceedings of the IEEE 22rd International Conference on Data Engineering, Atlanta (2006)

    Google Scholar 

  59. Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay—a secure two-party computation system. In: SSYM’04: Proceedings of the 13th Conference on USENIX Security Symposium, San Diego, CA, pp. 20–20. USENIX Association, Berkeley (2004)

    Google Scholar 

  60. McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM 53(9), 89–97 (2010)

    Article  Google Scholar 

  61. McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the netflix prize contenders. In: KDD’09, Paris, pp. 627–636 (2009)

    Google Scholar 

  62. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS’07 Rhode Island (2007)

    Google Scholar 

  63. Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared databases. In: SIGMOD’07, Beijing, pp. 665–676. ACM, New York (2007)

    Google Scholar 

  64. Nissim, K., Raskhodnikova, S., Smith, A.: Smooth sensitivity and sampling in private data analysis. In: STOC’07, El Paso, Texas, USA, pp. 75–84. ACM (2007)

    Google Scholar 

  65. Paillier, P.: Trapdooring discrete logarithms on elliptic curves over rings. In: ASIACRYPT’00, Kyoto (2000)

    Google Scholar 

  66. Pedersen, T.: Non-interactive and information-theoretic secure verifiable secret sharing. In: CRYPTO’91, Santa Barbara (1992)

    Google Scholar 

  67. Pinkas, B., Schneider, T., Smart, N., Williams, S.: Secure two-party computation is practical. Cryptology ePrint Archive, Report 2009/314 (2009)

    Google Scholar 

  68. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of database systems, PODS’98, Seattle, p. 188. ACM, New York (1998). doi:10.1145/275487.275508. http://doi.acm.org/10.1145/275487.275508

  69. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report SRI-CSL-98-04, SRI International (1998)

    Google Scholar 

  70. Stewart, G.W., Sun, J.G.: Matrix Perturbation Theory. Academic, Boston New York (1990)

    MATH  Google Scholar 

  71. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Google Scholar 

  72. Trefethen, L.N., III, D.B.: Numerical Linear Algebra. SIAM, Philadelphia (1997)

    Google Scholar 

  73. Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: KDD’03, Washington DC (2003)

    Google Scholar 

  74. Wright, R., Yang, Z.: Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: KDD’04, New York, pp. 713–718 (2004)

    Google Scholar 

  75. Xiao, X., Tao, Y.: M-invariance: Towards privacy preserving re-publication of dynamic datasets. In: SIGMOD 2007, Beijing, pp. 689–700 (2007)

    Google Scholar 

  76. Yang, Z., Zhong, S., Wright, R.N.: Privacy-preserving classification of customer data without loss of accuracy. In: SDM 2005, Newport Beach (2005)

    Google Scholar 

  77. Yao, A.C.C.: Protocols for secure computations. In: FOCS’82, Chicago, pp. 160–164. IEEE (1982)

    Google Scholar 

  78. Zhang, L., Jajodia, S., Brodsky, A.: Information disclosure under realistic assumptions: privacy versus optimality. In: CCS’07, Alexandria, pp. 573–583 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yitao Duan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Duan, Y., Canny, J. (2014). Practical Distributed Privacy-Preserving Data Analysis at Large Scale. In: Gkoulalas-Divanis, A., Labbi, A. (eds) Large-Scale Data Analytics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9242-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-9242-9_8

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-9241-2

  • Online ISBN: 978-1-4614-9242-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics