Abstract
Data leakage is a well-known problem in machine learning which occurs when the training and testing datasets are not independent. This phenomenon leads to overly optimistic accuracy estimates at training time, followed by a significant drop in performance when models are deployed in the real world. This can be dangerous, notably when models are used for risk prediction in high-stakes applications.
In this paper, we propose an abstract interpretation-based static analysis to prove the absence of data leakage. We implemented it in the NBLyzer framework and we demonstrate its performance and precision on 2111 Jupyter notebooks from the Kaggle competition platform.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Experiments done on a Ryzen 9 6900HS with 24GB DDR5 running Ubuntu 22.04.
- 2.
We found a number of soundness issues in [20] when working on our formalization.
References
Chouldechova, A., Prado, D.B., Fialko, O., Vaithianathan, R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: FAT, pp. 134–148 (2018)
Cousot, P.: Constructive design of a hierarchy of semantics of a transition system by abstract interpretation. Electron. Notes Theor. Comput. Sci. 277(1–2), 47–103 (2002)
Cousot, P.: Abstract semantic dependency. In: Chang, B.-Y.E. (ed.) SAS 2019. LNCS, vol. 11822, pp. 389–410. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32304-2_19
Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Second International Symposium on Programming, pp. 106–130 (1976)
Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: POPL, pp. 238–252 (1977)
Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: POPL, pp. 269–282 (1979)
Cousot, P., Cousot, R.: Higher order abstract interpretation (and application to comportment analysis generalizing strictness, termination, projection, and PER analysis. In: ICCL, pp. 95–112 (1994)
Drobnjaković, F., Subotić, P., Urban, C.: An abstract interpretation-based data leakage static analysis. CoRR abs/2211.16073 (2022). https://arxiv.org/abs/2211.16073
Guzharina, A.: We downloaded 10m Jupyter notebooks from GitHub - this is what we learned (2020). https://blog.jetbrains.com/datalore/2020/12/17/we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/. Accessed 22 Jan 2022
Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9), 100804 (2023)
Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6(4) (2012)
Kharkar, A., Moghaddam, R.Z., Jin, M., Liu, X., Shi, X., Clement, C., Sundaresan, N.: Learning to reduce false positives in analytic bug detectors. In: ICSE, p. 1307-1316 (2022)
Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., Smaragdakis, Y.: Static analysis of shape in TensorFlow programs. In: ECOOP, pp. 15:1–15:29 (2020)
Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.G.: Fine-grained lineage for safer notebook interactions. CoRR abs/2012.06981 (2020). https://arxiv.org/abs/2012.06981
Miné, A.: Weakly relational numerical abstract domains. Ph.D. thesis, École Polytechnique, Palaiseau, France (2004). https://tel.archives-ouvertes.fr/tel-00136630
Namaki, M.H., et al.: Vamsa: automated provenance tracking in data science scripts. In: KDD, pp. 1542–1551 (2020)
Nisbet, R., Miner, G., Yale, K.: Handbook of Statistical Analysis and Data Mining Applications, 2nd edn. Academic Press, Boston (2018). https://doi.org/10.1016/C2012-0-06451-4
Papadimitriou, P., Garcia-Molina, H.: A model for data leakage detection. In: ICDE, pp. 1307–1310 (2009)
Perkel, J.: Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145–146 (2018)
Subotić, P., Bojanić, U., Stojić, M.: Statically detecting data leakages in data science code. In: SOAP, pp. 16–22 (2022)
Subotić, P., Milikić, L., Stojić, M.: A static analysis framework for data science notebooks. In: ICSE, pp. 13–22 (2022)
Urban, C.: Static analysis of data science software. In: SAS, pp. 17–23 (2019)
Urban, C., Müller, P.: An abstract interpretation framework for input data usage. In: Ahmed, A. (ed.) ESOP 2018. LNCS, vol. 10801, pp. 683–710. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89884-1_24
Wong, A., et al.: External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. (2021)
Acknowledgements
We thank our colleagues at Microsoft Azure Data Labs and Microsoft Development Centre Serbia for all their feedback and support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Drobnjaković, F., Subotić, P., Urban, C. (2024). An Abstract Interpretation-Based Data Leakage Static Analysis. In: Chin, WN., Xu, Z. (eds) Theoretical Aspects of Software Engineering. TASE 2024. Lecture Notes in Computer Science, vol 14777. Springer, Cham. https://doi.org/10.1007/978-3-031-64626-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-64626-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64625-6
Online ISBN: 978-3-031-64626-3
eBook Packages: Computer ScienceComputer Science (R0)