An Abstract Interpretation-Based Data Leakage Static Analysis

Drobnjaković, Filip; Subotić, Pavle; Urban, Caterina

doi:10.1007/978-3-031-64626-3_7

Filip Drobnjaković²⁶,
Pavle Subotić²⁶ &
Caterina Urban²⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14777))

Included in the following conference series:

International Symposium on Theoretical Aspects of Software Engineering

466 Accesses

Abstract

Data leakage is a well-known problem in machine learning which occurs when the training and testing datasets are not independent. This phenomenon leads to overly optimistic accuracy estimates at training time, followed by a significant drop in performance when models are deployed in the real world. This can be dangerous, notably when models are used for risk prediction in high-stakes applications.

In this paper, we propose an abstract interpretation-based static analysis to prove the absence of data leakage. We implemented it in the NBLyzer framework and we demonstrate its performance and precision on 2111 Jupyter notebooks from the Kaggle competition platform.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abstract Interpretation in Industry – Experience and Lessons Learned

Automating Abstract Interpretation

Abstract Interpretation with the Eva Plug-in

Notes

1.
Experiments done on a Ryzen 9 6900HS with 24GB DDR5 running Ubuntu 22.04.
2.
We found a number of soundness issues in [20] when working on our formalization.

References

Chouldechova, A., Prado, D.B., Fialko, O., Vaithianathan, R.: A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: FAT, pp. 134–148 (2018)
Google Scholar
Cousot, P.: Constructive design of a hierarchy of semantics of a transition system by abstract interpretation. Electron. Notes Theor. Comput. Sci. 277(1–2), 47–103 (2002)
Article MathSciNet MATH Google Scholar
Cousot, P.: Abstract semantic dependency. In: Chang, B.-Y.E. (ed.) SAS 2019. LNCS, vol. 11822, pp. 389–410. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32304-2_19
Chapter MATH Google Scholar
Cousot, P., Cousot, R.: Static determination of dynamic properties of programs. In: Second International Symposium on Programming, pp. 106–130 (1976)
Google Scholar
Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: POPL, pp. 238–252 (1977)
Google Scholar
Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: POPL, pp. 269–282 (1979)
Google Scholar
Cousot, P., Cousot, R.: Higher order abstract interpretation (and application to comportment analysis generalizing strictness, termination, projection, and PER analysis. In: ICCL, pp. 95–112 (1994)
Google Scholar
Drobnjaković, F., Subotić, P., Urban, C.: An abstract interpretation-based data leakage static analysis. CoRR abs/2211.16073 (2022). https://arxiv.org/abs/2211.16073
Guzharina, A.: We downloaded 10m Jupyter notebooks from GitHub - this is what we learned (2020). https://blog.jetbrains.com/datalore/2020/12/17/we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/. Accessed 22 Jan 2022
Kapoor, S., Narayanan, A.: Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4(9), 100804 (2023)
Article MATH Google Scholar
Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6(4) (2012)
Google Scholar
Kharkar, A., Moghaddam, R.Z., Jin, M., Liu, X., Shi, X., Clement, C., Sundaresan, N.: Learning to reduce false positives in analytic bug detectors. In: ICSE, p. 1307-1316 (2022)
Google Scholar
Lagouvardos, S., Dolby, J., Grech, N., Antoniadis, A., Smaragdakis, Y.: Static analysis of shape in TensorFlow programs. In: ECOOP, pp. 15:1–15:29 (2020)
Google Scholar
Macke, S., Gong, H., Lee, D.J.L., Head, A., Xin, D., Parameswaran, A.G.: Fine-grained lineage for safer notebook interactions. CoRR abs/2012.06981 (2020). https://arxiv.org/abs/2012.06981
Miné, A.: Weakly relational numerical abstract domains. Ph.D. thesis, École Polytechnique, Palaiseau, France (2004). https://tel.archives-ouvertes.fr/tel-00136630
Namaki, M.H., et al.: Vamsa: automated provenance tracking in data science scripts. In: KDD, pp. 1542–1551 (2020)
Google Scholar
Nisbet, R., Miner, G., Yale, K.: Handbook of Statistical Analysis and Data Mining Applications, 2nd edn. Academic Press, Boston (2018). https://doi.org/10.1016/C2012-0-06451-4
Book MATH Google Scholar
Papadimitriou, P., Garcia-Molina, H.: A model for data leakage detection. In: ICDE, pp. 1307–1310 (2009)
Google Scholar
Perkel, J.: Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145–146 (2018)
Article MATH Google Scholar
Subotić, P., Bojanić, U., Stojić, M.: Statically detecting data leakages in data science code. In: SOAP, pp. 16–22 (2022)
Google Scholar
Subotić, P., Milikić, L., Stojić, M.: A static analysis framework for data science notebooks. In: ICSE, pp. 13–22 (2022)
Google Scholar
Urban, C.: Static analysis of data science software. In: SAS, pp. 17–23 (2019)
Google Scholar
Urban, C., Müller, P.: An abstract interpretation framework for input data usage. In: Ahmed, A. (ed.) ESOP 2018. LNCS, vol. 10801, pp. 683–710. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89884-1_24
Chapter MATH Google Scholar
Wong, A., et al.: External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. (2021)
Google Scholar

Download references

Acknowledgements

We thank our colleagues at Microsoft Azure Data Labs and Microsoft Development Centre Serbia for all their feedback and support.

Author information

Authors and Affiliations

Microsoft, Beograd, Serbia
Filip Drobnjaković & Pavle Subotić
Inria & ENS | PSL, Paris, France
Caterina Urban

Authors

Filip Drobnjaković
View author publications
You can also search for this author in PubMed Google Scholar
Pavle Subotić
View author publications
You can also search for this author in PubMed Google Scholar
Caterina Urban
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Caterina Urban .

Editor information

Editors and Affiliations

National University of Singapore, Singapore, Singapore
Wei-Ngan Chin
Shenzhen University, Guangdong, China
Zhiwu Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Drobnjaković, F., Subotić, P., Urban, C. (2024). An Abstract Interpretation-Based Data Leakage Static Analysis. In: Chin, WN., Xu, Z. (eds) Theoretical Aspects of Software Engineering. TASE 2024. Lecture Notes in Computer Science, vol 14777. Springer, Cham. https://doi.org/10.1007/978-3-031-64626-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-64626-3_7
Published: 14 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64625-6
Online ISBN: 978-3-031-64626-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics