Abstract
In data science, data pre-processing and data exploration require various convoluted steps such as creating variables, merging data sets, filtering records, value transformation, value replacement and normalization. By analyzing the source code behind analytic pipelines, it is possible to infer the nature of how data objects are used and related to each other. To the best of our knowledge, there is scarce research on analyzing data science source code to provide a data-centric view. On the other hand, two important diagrams have proven to be essential to manage database and software development projects: (1) Entity-Relationship (ER) diagrams (to understand data structure and data interrelationships) and (2) flow diagrams (to capture main processing steps). These two diagrams have historically been used separately, complementing each other. In this work, we defend the idea that these two diagrams should be combined in a unified view of data pre-processing and data exploration. Heeding such motivation, we propose a hybrid diagram called FLOWER (FLOW+ER) that combines modern UML notation with data flow symbols, in order to understand complex data pipelines embedded in source code (most commonly Python). The goal of FLOWER is to assist data scientists by providing a reverse-engineered analytic view, with a data-centric angle. We present a preliminary demonstration of the concept of FLOWER, where it is incorporated into a prototype that traces a representative data pipeline and automatically builds a diagram capturing data relationships and data flow.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Full code is available on GitHub at https://github.com/Big-Data-Systems/FLOWERPrototype.
References
Mugan, J., et al.: Entity resolution using inferred relationships and behavior. In: IEEE International Conference on Big Data. IEEE Computer Society, 2014, pp. 555–560 (2014)
Guo, G.: An active workflow method for entity-oriented data collection. In: Woo, C., Lu, J., Li, Z., Ling, T.W., Li, G., Lee, M.L. (eds.) ER 2018. LNCS, vol. 11158, pp. 76–81. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01391-2_15
Batini, C., Nardelli, E., Tamassia, R.: A layout algorithm for data flow diagrams. IEEE Trans. Software Eng. 12(4), 538–546 (1986)
Sebrechts, M., et al.: Model-driven deployment and management of workflows on analytics frameworks. In: IEEE International Conference on Big Data, 2016, pp. 2819–2826 (2016)
Pham, M., Knoblock, C.A., Pujara, J.: Learning data transformations with minimal user effort. In: IEEE International Conference on Big Data (BigData), 2019, pp. 657–664 (2019)
Lanasri, D., Ordonez, C., Bellatreche, L., Khouri, S.: ER4ML: an ER modeling tool to represent data transformations in data science. Proc. ER Forum Poster Demos Session 2469, 123–127 (2019)
Eichler, R., Giebler, C., Gröger, C., Schwarz, H., Mitschang, B.: HANDLE - a generic metadata model for data lakes. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 73–88. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_7
Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with gemms. Complex Syst. Inform. Model Quart. 9, 67–83 (2016)
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12
Ordonez, C., Maabout, S., Matusevich, D.S., Cabrera, W.: Extending ER models to capture database transformations to build data sets for data mining. Data Knowl. Eng. 89, 38–54 (2013)
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I.-Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2019. LNCS, vol. 11708, pp. 179–188. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13
Scholly, E., et al.: Coining goldmedal: A new contribution to data lake generic metadata modeling. In: DOLAP, vol. 2840, 2021, pp. 31–40 (2021)
Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Modeling data lakes with data vault: practical experiences, assessment, and lessons learned. In: Laender, A.H.F., Pernici, B., Lim, E.-P., de Oliveira, J.P.M. (eds.) ER 2019. LNCS, vol. 11788, pp. 63–77. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33223-5_7
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mitchell, E., Berkani, N., Bellatreche, L., Ordonez, C. (2023). FLOWER: Viewing Data Flow in ER Diagrams. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2023. Lecture Notes in Computer Science, vol 14148. Springer, Cham. https://doi.org/10.1007/978-3-031-39831-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-39831-5_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39830-8
Online ISBN: 978-3-031-39831-5
eBook Packages: Computer ScienceComputer Science (R0)