ABSTRACT
The ubiquitous nature of computers is driving a massive increase in the amount of data generated by humans and machines. Two natural consequences of this are the increased efforts to (a) derive meaningful information from accumulated data and (b) ensure that data is not used for unintended purposes. In the direction of analyzing massive amounts of data (a.), tools like MapReduce, Spark, Dryad and higher level scripting languages like Pig Latin and DryadLINQ have significantly improved corresponding tasks for software developers. The second, but equally important aspect of ensuring confidentiality (b.), has seen little support emerge for programmers: while advances in cryptographic techniques allow us to process directly on encrypted data, programmer-friendly and efficient ways of programming such data analysis jobs are still missing. This paper presents novel data flow analyses and program transformations for Pig Latin, that automatically enable the execution of corresponding scripts on encrypted data. We avoid fully homomorphic encryption because of its prohibitively high cost; instead, in some cases, we rely on a minimal set of operations performed by the client. We present the algorithms used for this translation, and empirically demonstrate the practical performance of our approach as well as improvements for programmers in terms of the effort required to preserve data confidentiality.
- Amazon EC2. http://amazon.com/ec2.Google Scholar
- Apache Pig. http://pig.apache.org.Google Scholar
- Apache PigMix benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix.Google Scholar
- HElib. https://github.com/shaih/HElib.Google Scholar
- The GNU Multiple Precision Arithmetic Library. https://gmplib.org/.Google Scholar
- Wikipedia database download. http://en.wikipedia.org/wiki/Wikipedia: Database_download.Google Scholar
- Zero MQ. http://zeromq.org.Google Scholar
- A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical Privacy: The SuLQ Framework. In PODS, pages 128--138, 2005. Google ScholarDigital Library
- A. Boldyreva, N. Chenette, Y. Lee, and A. O'Neill. Order-preserving symmetric encryption. In EUROCRYPT, pages 224--241, 2009. Google ScholarDigital Library
- Y. Brun and N. Medvidovic. Keeping Data Private while Computing in the Cloud. In IEEE CLOUD, pages 285--294, 2012. Google ScholarDigital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In PLDI, pages 363--375, 2010. Google ScholarDigital Library
- A. Chlipala. Static checking of dynamically-varying security policies in database-backed applications. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI'10, pages 1--. USENIX Association, 2010. Google ScholarDigital Library
- J. Daemen and V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard. Springer Verlag, Berlin, Heidelberg, New York, 2002. Google ScholarCross Ref
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- I. DeveloperWorks. Process your Data with Apache Pig, 2012. http://www.ibm.com/developerworks/library/l-apachepigdataquery/.Google Scholar
- I. Dinur and K. Nissim. Revealing Information While Preserving Privacy. In PODS, pages 202--210, 2003. Google ScholarDigital Library
- C. Dwork and K. Nissim. Privacy-Preserving Datamining on Vertically Partitioned Databases. In CRYPTO, pages 528--544, 2004.Google ScholarCross Ref
- T. ElGamal. A Public-Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms. IEEE Transactions on Information Theory, 31(4):469--472, 1985. Google ScholarDigital Library
- C. Gentry, A. Sahai, and B. Waters. Homomorphic Encryption from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster, Attribute-Based. In CRYPTO, volume 1, pages 75--92, Aug. 2013.Google Scholar
- Hadoop. Hadoop. http://hadoop.apache.org/.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In EuroSys, 2007. Google ScholarDigital Library
- M. Martin, B. Livshits, and M. S. Lam. Finding application errors and security flaws using pql: A program query language. SIGPLAN Not., 40(10):365--383, Oct. 2005. Google ScholarDigital Library
- J. Mitchell, R. Sharma, D. Stefan, and J. Zimmerman. Information-flow control for programming on encrypted data. In Computer Security Foundations Symposium (CSF), 2012 IEEE 25th, pages 45--60, June 2012. Google ScholarDigital Library
- J. D. Nielsen and M. I. Schwartzbach. A domain-specific programming language for secure multiparty computation. In Proceedings of the 2007 Workshop on Programming Languages and Analysis for Security, pages 21--30. ACM, 2007. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
- P. Paillier. Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In EUROCRYPT, pages 223--238, May 1999. Google ScholarDigital Library
- B. Parno, J. M. McCune, D. Wendlandt, D. G. Andersen, and A. Perrig. Clamp: Practical prevention of large-scale data leaks. In Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, SP '09, pages 154--169. IEEE Computer Society, 2009. Google ScholarDigital Library
- R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan. CryptDB: Protecting Confidentiality with Encrypted Query Processing. In SOSP, pages 85--100, 2011. Google ScholarDigital Library
- I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and Privacy for MapReduce. In NSDI, pages 297--312, 2010. Google ScholarDigital Library
- N. Santos, R. Rodrigues, K. P. Gummadi, and S. Saroiu. Policy-sealed data: A new abstraction for building trusted cloud services. In Proceedings of the 21st USENIX Conference on Security Symposium, pages 10--10. USENIX Association, 2012. Google ScholarDigital Library
- B. Schneier. Description of a new variable-length key, 64-bit block cipher (blowfish). In Fast Software Encryption, pages 191--204. Springer-Verlag, 1994. Google ScholarDigital Library
- M. Schwarzkopf, D. Murray, and S. Hand. The Seven Deadly Sins of Cloud Computing Research. In HotClouds, 2012. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In MSST, pages 1--10, 2010. Google ScholarDigital Library
- J. J. Stephen and P. Eugster. Assured Cloud-Based Data Analysis with ClusterBFT. In Middleware, pages 82--102, 2013.Google Scholar
- J. J. Stephen, S. Savvides, R. Seidel, and P. Eugster. Practical Confidentiality Preserving Big Data Analysis. In HotCloud, 2014.Google Scholar
- S. Tetali, M. Lesani, R. Majumdar, and T. Millstein. MrCrypt: Static Analysis for Secure Cloud Computations. In OOPSLA, pages 271--286, 2013. Google ScholarDigital Library
- S. Tu, F. Kaashoek, S. Madden, and N. Zeldovich. Processing Analytical queries over encrypted data. In PVLDB, pages 289--300, 2013. Google ScholarDigital Library
- A. Yip, X. Wang, N. Zeldovich, and M. F. Kaashoek. Improving application security with data flow assertions. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 291--304. ACM, 2009. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud'10, 2010. Google ScholarDigital Library
Index Terms
- Program analysis for secure big data processing
Recommendations
Big Data Processing Technology Research and Application Prospects
IMCCC '14: Proceedings of the 2014 Fourth International Conference on Instrumentation and Measurement, Computer, Communication and ControlWith the rapid development of cloud computing, Internet of Things, Mobile Internet and other related technologies, data is growing at an unprecedented rate in both scales and types. Nowadays, data has been a kind of enormous business resources in the ...
Big data analytics in Cloud computing: an overview
AbstractBig Data and Cloud Computing as two mainstream technologies, are at the center of concern in the IT field. Every day a huge amount of data is produced from different sources. This data is so big in size that traditional processing tools are unable ...
Comments