skip to main content
10.1145/3468264.3468532acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

TaintStream: fine-grained taint tracking for big data platforms through dynamic code translation

Published: 18 August 2021 Publication History

Abstract

Big data has become valuable property for enterprises and enabled various intelligent applications. Today, it is common to host data in big data platforms (e.g., Spark), where developers can submit scripts to process the original and intermediate data tables. Meanwhile, it is highly desirable to manage the data to comply with various privacy requirements. To enable flexible and automated privacy policy enforcement, we propose TaintStream, a fine-grained taint tracking framework for Spark-like big data platforms. TaintStream works by automatically injecting taint tracking logic into the data processing scripts, and the injected scripts are dynamically translated to maintain a taint tag for each cell during execution. The dynamic translation rules are carefully designed to guarantee non-interference in the original data operation. By defining different semantics of taint tags, TaintStream can enable various data management applications such as access control, data retention, and user data erasure. Our experiments on a self-crafted benchmarksuite show that TaintStream is able to achieve accurate cell-level taint tracking with a precision of 93.0% and less than 15% overhead. We also demonstrate the usefulness of TaintStream through several real-world use cases of privacy policy enforcement.

References

[1]
Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, principles, techniques. Addison wesley, 7, 8 (1986), 9.
[2]
Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, and Ali Ghodsi. 2015. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data. 1383–1394. https://doi.org/10.1145/2723372.2742797
[3]
Steven Arzt. 2021. DroidBench 2.0. https://github.com/secure-software-engineering/DroidBench Accessed February 4th, 2021.
[4]
Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. Acm Sigplan Notices, 49, 6 (2014), 259–269. https://doi.org/10.1145/2594291.2594299
[5]
Python Code Quality Authority. 2021. Astroid’s documentation. http://pylint.pycqa.org/projects/astroid/en/latest/ Accessed February 25, 2021.
[6]
Abhishek Bichhawat, Vineet Rajani, Deepak Garg, and Christian Hammer. 2014. Information flow control in WebKit’s JavaScript bytecode. In International Conference on Principles of Security and Trust. 159–178.
[7]
Muhammad Bilal, Lukumon O Oyedele, Junaid Qadir, Kamran Munir, Saheed O Ajayi, Olugbenga O Akinade, Hakeem A Owolabi, Hafiz A Alaka, and Maruf Pasha. 2016. Big Data in the construction industry: A review of present status, opportunities, and future trends. Advanced engineering informatics, 30, 3 (2016), 500–521. https://doi.org/10.1016/j.aei.2016.07.001
[8]
Niklas Broberg and David Sands. 2010. Paralocks: role-based information flow control and beyond. In Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on principles of programming languages. 431–444.
[9]
Omar Chowdhury, Andreas Gampe, Jianwei Niu, Jeffery von Ronne, Jared Bennatt, Anupam Datta, Limin Jia, and William H Winsborough. 2013. Privacy promises that can be kept: a policy analysis method with application to the HIPAA privacy rule. In Proceedings of the 18th ACM symposium on Access control models and technologies. 3–14.
[10]
Ravi Chugh, Jeffrey A Meister, Ranjit Jhala, and Sorin Lerner. 2009. Staged information flow for JavaScript. In Proceedings of the 30th ACM SIGPLAN conference on programming language design and implementation. 50–62.
[11]
James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: a generic dynamic taint analysis framework. In Proceedings of the 2007 international symposium on software testing and analysis. 196–206.
[12]
Benjamin Davis and Hao Chen. 2010. DBTaint: Cross-Application Information Flow Tracking via Databases. WebApps, 10 (2010), 12.
[13]
William Enck, Peter Gilbert, Seungyeop Han, Vasant Tendulkar, Byung-Gon Chun, Landon P Cox, Jaeyeon Jung, Patrick McDaniel, and Anmol N Sheth. 2014. TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems (TOCS), 32, 2 (2014), 1–29.
[14]
Tomoya Enokido and Makoto Takizawa. 2011. Purpose-based information flow control for cyber engineering. IEEE Transactions on Industrial Electronics, 58, 6 (2011), 2216–2225.
[15]
Michael D Ernst, René Just, Suzanne Millstein, Werner Dietl, Stuart Pernsteiner, Franziska Roesner, Karl Koscher, Paulo Barros Barros, Ravi Bhoraskar, and Seungyeop Han. 2014. Collaborative verification of information flow for a high-assurance app store. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 1092–1104.
[16]
Facebook. 2021. PySa Overview. https://pyre-check.org/docs/pysa-basics/ Accessed February 4th, 2021.
[17]
Daniele Faraglia. 2021. Welcome to Faker’s documentation!. https://faker.readthedocs.io/en/master/ Accessed February 4th, 2021.
[18]
Andrew Ferraiuolo, Rui Xu, Danfeng Zhang, Andrew C Myers, and G Edward Suh. 2017. Verification of a practical hardware security architecture through static information flow analysis. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 555–568. https://doi.org/10.1145/3037697.3037739
[19]
Apache Software Foundation. 2021. Apache Storm. https://storm.apache.org/ Accessed December 16, 2020.
[20]
William Fu, Raymond Lin, and Daniel Inge. 2018. Taintassembly: Taint-based information flow control tracking for webassembly. arXiv preprint arXiv:1802.01050.
[21]
Daniel B Giffin, Amit Levy, Deian Stefan, David Terei, David Mazieres, John C Mitchell, and Alejandro Russo. 2012. Hails: Protecting data privacy in untrusted web applications. In 10th $USENIX$ Symposium on Operating Systems Design and Implementation (OSDI 12). 47–60.
[22]
Salvatore Guarnieri, Marco Pistoia, Omer Tripp, Julian Dolby, Stephen Teilhet, and Ryan Berg. 2011. Saving the world wide web from vulnerable JavaScript. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. 177–187.
[23]
Wei Huang, Yao Dong, Ana Milanova, and Julian Dolby. 2015. Scalable and precise taint analysis for android. In Proceedings of the 2015 International Symposium on Software Testing and Analysis. 106–117. https://doi.org/10.1145/2771783.2771803
[24]
Priyank Jain, Manasi Gyanchandani, and Nilay Khare. 2016. Big data privacy: a technological perspective and review. Journal of Big Data, 3, 1 (2016), 1–25. https://doi.org/10.1186/s40537-016-0059-y
[25]
Simon Holm Jensen, Magnus Madsen, and Anders Møller. 2011. Modeling the HTML DOM and browser API in static analysis of JavaScript web applications. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 59–69.
[26]
Gary A Kildall. 1973. A unified approach to global program optimization. In Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages. 194–206.
[27]
Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 11, 1–7.
[28]
Yuanchun Li, Fanglin Chen, Toby Jia-Jun Li, Yao Guo, Gang Huang, Matthew Fredrikson, Yuvraj Agarwal, and Jason I. Hong. 2017. PrivacyStreams: Enabling Transparency in Personal Data Processing for Mobile Apps. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 1, 3 (2017), Article 76, Sept., 26 pages. https://doi.org/10.1145/3130941
[29]
Abid Mehmood, Iynkaran Natgunanathan, Yong Xiang, Guang Hua, and Song Guo. 2016. Protection of big data privacy. IEEE access, 4 (2016), 1821–1834.
[30]
Andrew C Myers and Barbara Liskov. 2000. Protecting privacy using the decentralized label model. ACM Transactions on Software Engineering and Methodology (TOSEM), 9, 4 (2000), 410–442.
[31]
Shigenari Nakamura, Dilewaer Doulikun, Ailixier Aikebaier, Tomoya Enokido, and Makoto Takizawa. 2014. Role-based information flow control models. In 2014 IEEE 28th International Conference on Advanced Information Networking and Applications. 1140–1147. https://doi.org/10.1109/AINA.2014.139
[32]
François Pottier and Vincent Simonet. 2003. Information flow inference for ML. ACM Transactions on Programming Languages and Systems (TOPLAS), 25, 1 (2003), 117–158.
[33]
Andrei Sabelfeld and Andrew C Myers. 2003. Language-based information-flow security. IEEE Journal on selected areas in communications, 21, 1 (2003), 5–19.
[34]
Julian Schütte and Gerd Stefan Brost. 2016. A data usage control system using dynamic taint tracking. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). 909–916. https://doi.org/10.1109/AINA.2016.127
[35]
Edward J Schwartz, Thanassis Avgerinos, and David Brumley. 2010. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In 2010 IEEE symposium on Security and privacy. 317–331.
[36]
Shayak Sen, Saikat Guha, Anupam Datta, Sriram K Rajamani, Janice Tsai, and Jeannette M Wing. 2014. Bootstrapping privacy compliance in big data systems. In 2014 IEEE Symposium on Security and Privacy. 327–342. https://doi.org/10.1109/SP.2014.28
[37]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). 1–10.
[38]
Xiaodan Song, Yun Chi, Koji Hino, and Belle L Tseng. 2007. Information flow modeling based on diffusion rate for prediction and ranking. In Proceedings of the 16th international conference on World Wide Web. 191–200.
[39]
Apache Spark. 2021. Document of PySpark SQL module. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html Accessed February 21, 2021.
[40]
Manu Sridharan, Shay Artzi, Marco Pistoia, Salvatore Guarnieri, Omer Tripp, and Ryan Berg. 2011. F4F: taint analysis of framework-based web applications. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications. 1053–1068.
[41]
Stanford. 2021. Securibench Micro. https://github.com/too4words/securibench-micro Accessed February 4th, 2021.
[42]
Mingshen Sun, Tao Wei, and John CS Lui. 2016. Taintart: A practical multi-level information-flow tracking system for android runtime. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 331–342. https://doi.org/10.1145/2976749.2978343
[43]
Jason Teoh, Muhammad Ali Gulzar, and Miryung Kim. 2020. Influence-based provenance for dataflow applications with taint propagation. In Proceedings of the 11th ACM Symposium on Cloud Computing. 372–386.
[44]
TPC. 2021. TPCx-BB is a Big Data Benchmark. http://www.tpc.org/tpcx-bb/ Accessed December 17, 2020.
[45]
Michael Carl Tschantz, Anupam Datta, and Jeannette M Wing. 2012. Formalizing and enforcing purpose restrictions in privacy policies. In 2012 IEEE Symposium on Security and Privacy. 176–190.
[46]
Kai Wang, Aftab Hussain, Zhiqiang Zuo, Guoqing Xu, and Ardalan Amiri Sani. 2017. Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code. ACM SIGARCH Computer Architecture News, 45, 1 (2017), 389–404. https://doi.org/10.1145/3037697.3037744
[47]
Lun Wang, Joseph P Near, Neel Somani, Peng Gao, Andrew Low, David Dao, and Dawn Song. 2019. Data capsule: A new paradigm for automatic compliance with data privacy regulations. In Heterogeneous Data Management, Polystores, and Analytics for Healthcare. Springer, 3–23. https://doi.org/10.1007/978-3-030-33752-0_1
[48]
Fengguo Wei, Xingwei Lin, Xinming Ou, Ting Chen, and Xiaosong Zhang. 2018. Jn-saf: Precise and efficient ndk/jni-aware inter-language static analysis framework for security vetting of android applications with native code. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1137–1150. https://doi.org/10.1145/3243734.3243835
[49]
Wikipedia. 2021. Children’s Online Privacy Protection Act. https://en.wikipedia.org/wiki/Children%27s_Online_Privacy_Protection_Act Accessed February 13, 2021.
[50]
Wikipedia. 2021. General Data Protection Regulation. https://en.wikipedia.org/wiki/General_Data_Protection_Regulation Accessed February 13, 2021.
[51]
Wikipedia. 2021. Health Insurance Portability and Accountability Act. https://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act Accessed February 13, 2021.
[52]
Lok Kwong Yan and Heng Yin. 2012. Droidscope: Seamlessly reconstructing the OS and dalvik semantic views for dynamic android malware analysis. In 21st USENIX Security Symposium (USENIX Security 12). 569–584.
[53]
Jean Yang, Kuat Yessenov, and Armando Solar-Lezama. 2012. A language for automatically enforcing privacy policies. ACM SIGPLAN Notices, 47, 1 (2012), 85–96.
[54]
Zhemin Yang and Min Yang. 2012. Leakminer: Detect information leakage on android with static taint analysis. In 2012 Third World Congress on Software Engineering. 101–104.
[55]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud, 10, 10-10 (2010), 95.

Cited By

View all
  • (2024)HardTaint: Production-Run Dynamic Taint Analysis via Selective Hardware TracingProceedings of the ACM on Programming Languages10.1145/36897688:OOPSLA2(1615-1640)Online publication date: 8-Oct-2024
  • (2024)Automated End-to-End Dynamic Taint Analysis for WhatsAppCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663824(21-26)Online publication date: 10-Jul-2024
  • (2024)ChainStream: A Stream-based LLM Agent Framework for Continuous Context Sensing and SharingProceedings of the Workshop on Edge and Mobile Foundation Models10.1145/3662006.3662063(18-23)Online publication date: 3-Jun-2024
  • Show More Cited By

Index Terms

  1. TaintStream: fine-grained taint tracking for big data platforms through dynamic code translation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
      August 2021
      1690 pages
      ISBN:9781450385626
      DOI:10.1145/3468264
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 August 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GDPR
      2. Taint tracking
      3. big data platform
      4. privacy compliance

      Qualifiers

      • Research-article

      Conference

      ESEC/FSE '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 112 of 543 submissions, 21%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)61
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 27 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)HardTaint: Production-Run Dynamic Taint Analysis via Selective Hardware TracingProceedings of the ACM on Programming Languages10.1145/36897688:OOPSLA2(1615-1640)Online publication date: 8-Oct-2024
      • (2024)Automated End-to-End Dynamic Taint Analysis for WhatsAppCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663824(21-26)Online publication date: 10-Jul-2024
      • (2024)ChainStream: A Stream-based LLM Agent Framework for Continuous Context Sensing and SharingProceedings of the Workshop on Edge and Mobile Foundation Models10.1145/3662006.3662063(18-23)Online publication date: 3-Jun-2024
      • (2024)DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable ComputingProceedings of the ACM on Software Engineering10.1145/36437611:FSE(767-788)Online publication date: 12-Jul-2024
      • (2023)NaturalFuzz: Natural Input Generation for Big Data Analytics2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00034(1592-1603)Online publication date: 11-Sep-2023
      • (2022)An empirical study on quality issues of eBay's big data SQL analytics platformProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice10.1145/3510457.3513034(33-42)Online publication date: 21-May-2022
      • (2022)TaintSQL: Dynamically Tracking Fine-Grained Implicit Flows for SQL Statements2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00012(1-12)Online publication date: Oct-2022
      • (2022)An Empirical Study on Quality Issues of eBay's Big Data SQL Analytics Platform2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)10.1109/ICSE-SEIP55303.2022.9793914(33-42)Online publication date: May-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media