Skip to main content

An Abstract Interpretation-Based Data Leakage Static Analysis

  • Conference paper
  • First Online:
Theoretical Aspects of Software Engineering (TASE 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14777))

Included in the following conference series:

  • 466 Accesses

Abstract

Data leakage is a well-known problem in machine learning which occurs when the training and testing datasets are not independent. This phenomenon leads to overly optimistic accuracy estimates at training time, followed by a significant drop in performance when models are deployed in the real world. This can be dangerous, notably when models are used for risk prediction in high-stakes applications.

In this paper, we propose an abstract interpretation-based static analysis to prove the absence of data leakage. We implemented it in the NBLyzer framework and we demonstrate its performance and precision on 2111 Jupyter notebooks from the Kaggle competition platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Experiments done on a Ryzen 9 6900HS with 24GB DDR5 running Ubuntu 22.04.

  2. 2.

    We found a number of soundness issues in [20] when working on our formalization.

References

  1. Chouldechova, A., Prado, D.B., Fialko, O., Vaithianathan, R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: FAT, pp. 134–148 (2018)

    Google Scholar 

  2. Cousot, P.: Constructive design of a hierarchy of semantics of a transition system by abstract interpretation. Electron. Notes Theor. Comput. Sci. 277(1–2), 47–103 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  3. Cousot, P.: Abstract semantic dependency. In: Chang, B.-Y.E. (ed.) SAS 2019. LNCS, vol. 11822, pp. 389–410. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32304-2_19

    Chapter  MATH  Google Scholar 

  4. Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Second International Symposium on Programming, pp. 106–130 (1976)

    Google Scholar 

  5. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: POPL, pp. 238–252 (1977)

    Google Scholar 

  6. Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: POPL, pp. 269–282 (1979)

    Google Scholar 

  7. Cousot, P., Cousot, R.: Higher order abstract interpretation (and application to comportment analysis generalizing strictness, termination, projection, and PER analysis. In: ICCL, pp. 95–112 (1994)

    Google Scholar 

  8. Drobnjaković, F., Subotić, P., Urban, C.: An abstract interpretation-based data leakage static analysis. CoRR abs/2211.16073 (2022). https://arxiv.org/abs/2211.16073

  9. Guzharina, A.: We downloaded 10m Jupyter notebooks from GitHub - this is what we learned (2020). https://blog.jetbrains.com/datalore/2020/12/17/we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/. Accessed 22 Jan 2022

  10. Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9), 100804 (2023)

    Article  MATH  Google Scholar 

  11. Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6(4) (2012)

    Google Scholar 

  12. Kharkar, A., Moghaddam, R.Z., Jin, M., Liu, X., Shi, X., Clement, C., Sundaresan, N.: Learning to reduce false positives in analytic bug detectors. In: ICSE, p. 1307-1316 (2022)

    Google Scholar 

  13. Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., Smaragdakis, Y.: Static analysis of shape in TensorFlow programs. In: ECOOP, pp. 15:1–15:29 (2020)

    Google Scholar 

  14. Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.G.: Fine-grained lineage for safer notebook interactions. CoRR abs/2012.06981 (2020). https://arxiv.org/abs/2012.06981

  15. Miné, A.: Weakly relational numerical abstract domains. Ph.D. thesis, École Polytechnique, Palaiseau, France (2004). https://tel.archives-ouvertes.fr/tel-00136630

  16. Namaki, M.H., et al.: Vamsa: automated provenance tracking in data science scripts. In: KDD, pp. 1542–1551 (2020)

    Google Scholar 

  17. Nisbet, R., Miner, G., Yale, K.: Handbook of Statistical Analysis and Data Mining Applications, 2nd edn. Academic Press, Boston (2018). https://doi.org/10.1016/C2012-0-06451-4

    Book  MATH  Google Scholar 

  18. Papadimitriou, P., Garcia-Molina, H.: A model for data leakage detection. In: ICDE, pp. 1307–1310 (2009)

    Google Scholar 

  19. Perkel, J.: Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145–146 (2018)

    Article  MATH  Google Scholar 

  20. Subotić, P., Bojanić, U., Stojić, M.: Statically detecting data leakages in data science code. In: SOAP, pp. 16–22 (2022)

    Google Scholar 

  21. Subotić, P., Milikić, L., Stojić, M.: A static analysis framework for data science notebooks. In: ICSE, pp. 13–22 (2022)

    Google Scholar 

  22. Urban, C.: Static analysis of data science software. In: SAS, pp. 17–23 (2019)

    Google Scholar 

  23. Urban, C., Müller, P.: An abstract interpretation framework for input data usage. In: Ahmed, A. (ed.) ESOP 2018. LNCS, vol. 10801, pp. 683–710. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89884-1_24

    Chapter  MATH  Google Scholar 

  24. Wong, A., et al.: External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. (2021)

    Google Scholar 

Download references

Acknowledgements

We thank our colleagues at Microsoft Azure Data Labs and Microsoft Development Centre Serbia for all their feedback and support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Caterina Urban .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Drobnjaković, F., Subotić, P., Urban, C. (2024). An Abstract Interpretation-Based Data Leakage Static Analysis. In: Chin, WN., Xu, Z. (eds) Theoretical Aspects of Software Engineering. TASE 2024. Lecture Notes in Computer Science, vol 14777. Springer, Cham. https://doi.org/10.1007/978-3-031-64626-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-64626-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-64625-6

  • Online ISBN: 978-3-031-64626-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics