Abstract
In the last few years, there has been an important increase in the number of tools and approaches to define pipelines that allow the development of data science projects. They allow not only the pipeline definition but also the code generation needed to execute the project providing an easy way to carry out the projects even for non-expert users. However, there are still some challenges that these tools do not address yet, e.g. the possibility of executing pipelines defined by using different tools or execute them in different environments (reproducibility and replicability) or models validation and verification by identifying inconsistent operations (intentionality). In order to alleviate these problems, this paper presents a Model-Driven framework for the definition of data science pipelines independent of the particular execution platform and tools. The framework relies on the separation of the pipeline definition into two different modelling layers: conceptual, where the data scientist may specify all the data and models operations to be carried out by the pipeline; operational, where the data engineer may describe the execution environment details where the operations (defined in the conceptual part) will be implemented. Based on this abstract definition and layers separation, the approach allows: the usage of different tools improving, thus, process replicability; the automation of the process execution, enhancing process reproducibility; and the definition of model verification rules, providing intentionality restrictions.
Keywords
This work has been partially funded by the Spanish government (LOCOSS project - PID2020-114615RB-I00), and (ii) European Regional Development Fund (ERDF) and Junta de Extremadura: IB18034, and GR18112 projects.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
References
Baker, M.: 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). https://doi.org/10.1038/533452a, https://www.nature.com/articles/533452a
Bertoa, M.F., Burgueño, L., Moreno, N., Vallecillo, A.: Incorporating measurement uncertainty into OCL/UML primitive datatypes. Softw. Syst. Model. 19(5), 1163–1189 (2019). https://doi.org/10.1007/s10270-019-00741-0
Brambilla, M., Cabot, J., Wimmer, M.: Model-driven software engineering in practice, second edition. Synthesis Lect. Softw. Eng. 3(1), 1–207 (2017). https://doi.org/10.2200/S00751ED2V01Y201701SWE004
Byrne, C.: Development Workflows for Data Scientists. O’Reilly Media, Inc., Newton (2017)
Chapman, A., Missier, P., Simonelli, G., Torlone, R.: Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 14, 507–520 (2020). https://doi.org/10.14778/3436905.3436911
Domenech, A.M., Guillén, A.: ml-experiment: A Python framework for reproducible data science. J. Phys. Conf. Ser. 1603(1), 012025 (2020). https://doi.org/10.1088/1742-6596/1603/1/012025
Fernández-García, A.J., Preciado, J.C., Melchor, F., Rodriguez-Echeverria, R., Conejero, J.M., Sánchez-Figueroa, F.: A real-life machine learning experience for predicting university dropout at different stages using academic data. IEEE Access 9, 133076–133090 (2021)
Gardner, J., Brooks, C., Andres, J.M., Baker, R.S.: Morf: a framework for predictive modeling and replication at scale with privacy-restricted MOOC data. In: Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, pp. 3235–3244, January 2019. https://doi.org/10.1109/BIGDATA.2018.8621874
Gundersen, O.E., Kjensmo, S.: State of the art: reproducibility in artificial intelligence. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI 2018/IAAI 2018/EAAI 2018, pp. 1644–1651. AAAI Press (2018)
Haibe-Kains, B., et al.: Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020). https://doi.org/10.1038/s41586-020-2766-y
van den Heuvel, W.-J., Tamburri, D.A.: Model-driven ML-ops for intelligent enterprise applications: vision, approaches and challenges. In: Shishkov, B. (ed.) BMSD 2020. LNBIP, vol. 391, pp. 169–181. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52306-0_11
Hutson, M.: Artificial intelligence faces reproducibility crisis. Science 359(6377), 725–726 (2018). https://doi.org/10.1126/science.359.6377.725
Jaiswal, A., Bagale, P.: A survey on big data in financial sector. In: 2017 International Conference on Networking and Network Applications (NaNA), pp. 337–340. IEEE (2017). https://doi.org/10.1109/NaNA.2017.46
Konkol, M., Nüst, D., Goulier, L.: Publishing computational research - a review of infrastructures for reproducible and transparent scholarly communication. Res. Integrity Peer Rev. 5, 1–8 (2020). https://doi.org/10.1186/S41073-020-00095-Y/TABLES/2
National Academies of Sciences, Engineering, and Medicine: Reproducibility and Replicability in Science. The National Academies Press, Washington, DC (2019). https://doi.org/10.17226/25303
Obermeyer, Z., Emanuel, E.J.: Predicting the future - big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016). https://doi.org/10.1056/NEJMp1606181
Raff, E.: A step toward quantifying independently reproducible machine learning research. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 5485–5495. Curran Associates Inc. (2019)
Rahad, K., Badreddin, O., Mohsin Reza, S.: The human in model-driven engineering loop: a case study on integrating handwritten code in model-driven engineering repositories. Softw. Pract. Exp. 51(6), 1308–1321 (2021). https://doi.org/10.1002/spe.2957
Rajbahadur, G.K., Oliva, G.A., Hassan, A.E., Dingel, J.: Pitfalls analyzer: quality control for model-driven data science pipelines. In: 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 12–22 (2019). https://doi.org/10.1109/MODELS.2019.00-19
Rupprecht, L., Davis, J.C., Arnold, C., Gur, Y., Bhagwat, D.: Improving reproducibility of data science pipelines through transparent provenance capture. Proc. VLDB Endow. 13, 3354–3368 (2020). https://doi.org/10.14778/3415478.3415556
Samuel, S., König-Ries, B.: Understanding experiments and research practices for reproducibility: an exploratory study. PeerJ 9, e11140 (2021)
Steeves, V., Rampin, R., Chirigati, F.: Using reprozip for reproducibility and library services. IASSIST Q. 42, 14–14 (2018). https://doi.org/10.29173/IQ18
Tantithamthavorn, C., Hassan, A.E.: An experience report on defect modelling in practice: pitfalls and challenges. In: Proceedings - International Conference on Software Engineering, pp. 286–295 (2018). https://doi.org/10.1145/3183519.3183547
Treveil, M., et al.: Introducing MLOps: How to Scale Machine Learning in the Enterprise. O’Reilly Media, Inc., Newton (2021). https://www.oreilly.com/library/view/introducing-mlops/9781492083283/
White, L., Togneri, R., Liu, W., Bennamoun, M.: DataDeps.jl: Repeatable data setup for reproducible data science. J. Open Res. Softw. 7(1), 33 (2019). https://doi.org/10.5334/jors.244
Williamson, B.: Digital education governance: data visualization, predictive analytics, and ‘real-time’ policy instruments. J. Educ. Policy 31, 123–141 (2016). https://doi.org/10.1080/02680939.2015.1035758
Willis, C., Stodden, V.: Trust but verify: how to leverage policies, workflows, and infrastructure to ensure computational reproducibility in publication. Harvard Data Sci. Rev. 2(4) (2020). https://doi.org/10.1162/99608f92.25982dcf
Yin, Z., Lan, H., Tan, G., Lu, M., Vasilakos, A.V., Liu, W.: Computing platforms for big biological data analytics: perspectives and challenges. Comput. Struct. Biotechnol. J. 15, 403–411 (2017). https://doi.org/10.1016/j.csbj.2017.07.004
Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLFlow. IEEE Data Eng. Bull. 41, 39–45 (2018). https://www-cs.stanford.edu/people/matei/papers/2018/ieee_mlflow.pdf
Šimko, T., Heinrich, L., Hirvonsalo, H., Kousidis, D., Rodríguez, D.: Reana: a system for reusable research data analyses. EPJ Web Conf. 214, 06034 (2019). https://doi.org/10.1051/epjconf/201921406034
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Melchor, F., Rodriguez-Echeverria, R., Conejero, J.M., Prieto, Á.E., Gutiérrez, J.D. (2022). A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects. In: Franch, X., Poels, G., Gailly, F., Snoeck, M. (eds) Advanced Information Systems Engineering. CAiSE 2022. Lecture Notes in Computer Science, vol 13295. Springer, Cham. https://doi.org/10.1007/978-3-031-07472-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-07472-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07471-4
Online ISBN: 978-3-031-07472-1
eBook Packages: Computer ScienceComputer Science (R0)