skip to main content
10.1145/3134600.3134607acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacsacConference Proceedingsconference-collections
research-article

Kakute: A Precise, Unified Information Flow Analysis System for Big-data Security

Published: 04 December 2017 Publication History

Abstract

Big-data frameworks (e.g., Spark) enable computations on tremendous data records generated by third parties, causing various security and reliability problems such as information leakage and programming bugs. Existing systems for big-data security (e.g., Titian) track data transformations in a record level, so they are imprecise and too coarse-grained for these problems. For instance, when we ran Titian to drill down input records that produced a buggy output record, Titian reported 3 to 9 orders of magnitude more input records than the actual ones. Information Flow Tracking (IFT) is a conventional approach for precise information control. However, extant IFT systems are neither efficient nor complete for big-data frameworks, because theses frameworks are data-intensive, and data flowing across hosts is often ignored by IFT.
This paper presents Kakute, the first precise, fine-grained information flow analysis system for big-data. Our insight on making IFT efficient is that most fields in a data record often have the same IFT tags, and we present two new efficient techniques called Reference Propagation and Tag Sharing. In addition, we design an efficient, complete cross-host information flow propagation approach. Evaluation on seven diverse big-data programs (e.g., WordCount) shows that Kakute had merely 32.3% overhead on average even when fine-grained information control was enabled. Compared with Titian, Kakute precisely drilled down the actual bug inducing input records, a huge reduction of 3 to 9 orders of magnitude. Kakute's performance overhead is comparable with Titian. Furthermore, Kakute effectively detected 13 real-world security and reliability bugs in 4 diverse problems, including information leakage, data provenance, programming and performance bugs. Kakute's source code and results are available on https://github.com/hku-systems/kakute.

References

[1]
S. Akoush, L. Carata, R. Sohan, and A. Hopper. Mrlazy: Lazy runtime label propagation for mapreduce. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing, HotCloud'14, pages 17--17, Berkeley, CA, USA, 2014. USENIX Association.
[2]
M. R. Asghar, M. Ion, G. Russello, and B. Crispo. Securing data provenance in the cloud. In Open problems in network security, pages 145--160. Springer, 2012.
[3]
J. Bell and G. Kaiser. Phosphor: Illuminating dynamic data flow in commodity jvms. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA '14, pages 83--101, New York, NY, USA, 2014. ACM.
[4]
D. Brumley and D. Boneh. Remote timing attacks are practical. Computer Networks, 48(5):701--716, 2005.
[5]
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, Aug. 2008.
[6]
D. Chandra and M. Franz. Fine-grained information flow analysis and enforcement in a java virtual machine. In Computer Security Applications Conference, 2007. ACSAC 2007. Twenty-Third Annual, pages 463--475. IEEE, 2007.
[7]
H. Chen, X. Wu, L. Yuan, B. Zang, P.-c. Yew, and F. T. Chong. From speculation to security: Practical and efficient information flow tracking using speculative hardware. In Computer Architecture, 2008. ISCA'08. 35th International Symposium on, pages 401--412. IEEE, 2008.
[8]
Z. Chothia, J. Liagouris, F. McSherry, and T. Roscoe. Explaining outputs in modern data analytics. Proceedings of the VLDB Endowment, 9(12):1137--1148, 2016.
[9]
J. Clause, W. Li, and A. Orso. Dytan: A generic dynamic taint analysis framework. In Proceedings of the 2007 International Symposium on Software Testing and Analysis, ISSTA '07, pages 196--206, New York, NY, USA, 2007. ACM.
[10]
Y. Cui and J. Widom. Lineage tracing for general data warehouse transformations. The International Journal on Very Large Data Bases, 12(1):41--58, 2003.
[11]
A. Dave and M. Zaharia. Arthur: Rich post-facto debugging for production analytics applications.
[12]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 10--10, 2004.
[13]
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC'06, pages 265--284, Berlin, Heidelberg, 2006. Springer-Verlag.
[14]
W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. In Proceedings of the Ninth Symposium on Operating Systems Design and Implementation (OSDI '10), pages 1--6, 2010.
[15]
M. Ganai, D. Lee, and A. Gupta. Dtam: dynamic taint analysis of multi-threaded programs for relevancy. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, page 46. ACM, 2012.
[16]
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: The pig experience. Proc. VLDB Endow., 2(2):1414--1425, Aug. 2009.
[17]
M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. Millstein, and M. Kim. Bigdebug: Debugging primitives for interactive big data processing in spark. In Proceedings of the 38th International Conference on Software Engineering, ICSE '16, pages 784--795, New York, NY, USA, 2016. ACM.
[18]
R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR 2011. Stanford InfoLab.
[19]
M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. Proc. VLDB Endow., 9(3):216--227, Nov. 2015.
[20]
K. Jee, V. P. Kemerlis, A. D. Keromytis, and G. Portokalidis. Shadowreplica: Efficient parallelization of dynamic data flow tracking. In Proceedings of the 9th ACM conference on Computer and communications security, 2013.
[21]
V. P. Kemerlis, G. Portokalidis, K. Jee, and A. D. Keromytis. Libdft: Practical dynamic data flow tracking for commodity systems. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE '12, pages 121--132, New York, NY, USA, 2012. ACM.
[22]
S. M. Khan, K. W. Hamlen, and M. Kantarcioglu. Silver lining: Enforcing secure information flow at the cloud edge. In Cloud Engineering (IC2E), 2014 IEEE International Conference on, pages 37--46. IEEE, 2014.
[23]
H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In WWW '10: Proceedings of the 19th international conference on World wide web, pages 591--600, New York, NY, USA, 2010. ACM.
[24]
T. R. Leek, G. Z. Baker, R. E. Brown, M. A. Zhivich, and R. Lippmann. Coverage maximization using dynamic taint tracing. Technical report, DTIC Document, 2007.
[25]
D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In Proceedings of the 4th annual Symposium on Cloud Computing, page 17. ACM, 2013.
[26]
A. P. Martin, J. Lyle, and C. Namiluko. Provenance as a security control. In TaPP, 2012.
[27]
P. McDaniel. Data provenance and security. IEEE Security & Privacy, 9(2):83--85, 2011.
[28]
F. McSherry. Privacy integrated queries. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD). Association for Computing Machinery, Inc., June 2009.
[29]
F. McSherry and K. Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS '07, pages 94--103, Washington, DC, USA, 2007. IEEE Computer Society.
[30]
P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. Gupt: Privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 349--360, New York, NY, USA, 2012. ACM.
[31]
P. K. Murthy. Top ten challenges in big data security and privacy. In Test Conference (ITC), 2014 IEEE International, pages 1--1. IEEE, 2014.
[32]
A. C. Myers. Jflow: Practical mostly-static information flow control. In Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '99, pages 228--241, New York, NY, USA, 1999. ACM.
[33]
J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. 2005.
[34]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110. ACM, 2008.
[35]
V. Pappas, V. P. Kemerlis, A. Zavou, M. Polychronakis, and A. D. Keromytis. Cloudfence: Data flow tracking as a cloud service. In Proceedings of the 16th International Symposium on Research in Attacks, Intrusions, and Defenses - Volume 8145, RAID 2013, pages 411--431, New York, NY, USA, 2013. Springer-Verlag New York, Inc.
[36]
https://cwiki.apache.org/confluence/display/PIG/PigMix.
[37]
I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and privacy for mapreduce. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, NSDI'10, pages 20--20, Berkeley, CA, USA, 2010. USENIX Association.
[38]
https://spark.apache.org/examples.html.
[39]
Y. Sun, G. Petracca, X. Ge, and T. Jaeger. Pileus: Protecting user resources from vulnerable cloud services. In Proceedings of the 32Nd Annual Conference on Computer Security Applications, ACSAC '16, pages 52--64, New York, NY, USA, 2016. ACM.
[40]
Y. Tang, P. Ames, S. Bhamidipati, A. Bijlani, R. Geambasu, and N. Sarda. CleanOS: limiting mobile data exposure with idle eviction. In Proceedings of the Tenth Symposium on Operating Systems Design and Implementation (OSDI '12), pages 77--91, 2012.
[41]
A. Yip, X. Wang, N. Zeldovich, and M. F. Kaashoek. Improving application security with data flow assertions. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 291--304, New York, NY, USA, 2009. ACM.
[42]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language.
[43]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.
[44]
A. Zavou, G. Portokalidis, and A. D. Keromytis. Taint-exchange: A generic system for cross-process and cross-host taint tracking. In Proceedings of the 6th International Conference on Advances in Information and Computer Security, IWSEC'11, pages 113--128, Berlin, Heidelberg, 2011. Springer-Verlag.
[45]
J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing data shuffling in data-parallel computation by understanding user-defined functions.
[46]
K. Zhang, X. Zhou, Y. Chen, X. Wang, and Y. Ruan. Sedic: privacy-aware data intensive computing on hybrid clouds. In Proceedings of the 18th ACM conference on Computer and communications security, pages 515--526. ACM, 2011.
[47]
Q. Zhang, J. McCullough, J. Ma, N. Schear, M. Vrable, A. Vahdat, A. C. Snoeren, G. M. Voelker, and S. Savage. Neon: System support for derived data management. In Proceedings of the 6th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '10, pages 63--74, New York, NY, USA, 2010. ACM.
[48]
W. Zhou, S. Mapara, Y. Ren, Y. Li, A. Haeberlen, Z. Ives, B. T. Loo, and M. Sherr. Distributed time-aware provenance. In Proceedings of the VLDB Endowment, volume 6, pages 49--60. VLDB Endowment, 2012.

Cited By

View all
  • (2023)SparkAC: Fine-Grained Access Control in Spark for Secure Data Sharing and AnalyticsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.314954420:2(1104-1123)Online publication date: 1-Mar-2023
  • (2022)DisTA: Generic Dynamic Taint Tracking for Java-Based Distributed Systems2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN53405.2022.00060(547-558)Online publication date: Jun-2022
  • (2021)A Practical Approach for Dynamic Taint Tracking with Control-flow RelationshipsACM Transactions on Software Engineering and Methodology10.1145/348546431:2(1-43)Online publication date: 24-Dec-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACSAC '17: Proceedings of the 33rd Annual Computer Security Applications Conference
December 2017
618 pages
ISBN:9781450353458
DOI:10.1145/3134600
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big-data
  2. Data-intensive Scalable Computing System
  3. Information Flow Tracking

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACSAC 2017

Acceptance Rates

Overall Acceptance Rate 104 of 497 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)4
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)SparkAC: Fine-Grained Access Control in Spark for Secure Data Sharing and AnalyticsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.314954420:2(1104-1123)Online publication date: 1-Mar-2023
  • (2022)DisTA: Generic Dynamic Taint Tracking for Java-Based Distributed Systems2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN53405.2022.00060(547-558)Online publication date: Jun-2022
  • (2021)A Practical Approach for Dynamic Taint Tracking with Control-flow RelationshipsACM Transactions on Software Engineering and Methodology10.1145/348546431:2(1-43)Online publication date: 24-Dec-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media