Skip to main content

FLOWER: Viewing Data Flow in ER Diagrams

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2023)

Abstract

In data science, data pre-processing and data exploration require various convoluted steps such as creating variables, merging data sets, filtering records, value transformation, value replacement and normalization. By analyzing the source code behind analytic pipelines, it is possible to infer the nature of how data objects are used and related to each other. To the best of our knowledge, there is scarce research on analyzing data science source code to provide a data-centric view. On the other hand, two important diagrams have proven to be essential to manage database and software development projects: (1) Entity-Relationship (ER) diagrams (to understand data structure and data interrelationships) and (2) flow diagrams (to capture main processing steps). These two diagrams have historically been used separately, complementing each other. In this work, we defend the idea that these two diagrams should be combined in a unified view of data pre-processing and data exploration. Heeding such motivation, we propose a hybrid diagram called FLOWER (FLOW+ER) that combines modern UML notation with data flow symbols, in order to understand complex data pipelines embedded in source code (most commonly Python). The goal of FLOWER is to assist data scientists by providing a reverse-engineered analytic view, with a data-centric angle. We present a preliminary demonstration of the concept of FLOWER, where it is incorporated into a prototype that traces a representative data pipeline and automatically builds a diagram capturing data relationships and data flow.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Full code is available on GitHub at https://github.com/Big-Data-Systems/FLOWERPrototype.

References

  1. Mugan, J., et al.: Entity resolution using inferred relationships and behavior. In: IEEE International Conference on Big Data. IEEE Computer Society, 2014, pp. 555–560 (2014)

    Google Scholar 

  2. Guo, G.: An active workflow method for entity-oriented data collection. In: Woo, C., Lu, J., Li, Z., Ling, T.W., Li, G., Lee, M.L. (eds.) ER 2018. LNCS, vol. 11158, pp. 76–81. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01391-2_15

    Chapter  Google Scholar 

  3. Batini, C., Nardelli, E., Tamassia, R.: A layout algorithm for data flow diagrams. IEEE Trans. Software Eng. 12(4), 538–546 (1986)

    Article  Google Scholar 

  4. Sebrechts, M., et al.: Model-driven deployment and management of workflows on analytics frameworks. In: IEEE International Conference on Big Data, 2016, pp. 2819–2826 (2016)

    Google Scholar 

  5. Pham, M., Knoblock, C.A., Pujara, J.: Learning data transformations with minimal user effort. In: IEEE International Conference on Big Data (BigData), 2019, pp. 657–664 (2019)

    Google Scholar 

  6. Lanasri, D., Ordonez, C., Bellatreche, L., Khouri, S.: ER4ML: an ER modeling tool to represent data transformations in data science. Proc. ER Forum Poster Demos Session 2469, 123–127 (2019)

    Google Scholar 

  7. Eichler, R., Giebler, C., Gröger, C., Schwarz, H., Mitschang, B.: HANDLE - a generic metadata model for data lakes. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 73–88. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_7

    Chapter  Google Scholar 

  8. Quix, C., Hai, R., Vatov, I.: Metadata extraction and management in data lakes with gemms. Complex Syst. Inform. Model Quart. 9, 67–83 (2016)

    Article  Google Scholar 

  9. Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02463-4_12

    Chapter  Google Scholar 

  10. Ordonez, C., Maabout, S., Matusevich, D.S., Cabrera, W.: Extending ER models to capture database transformations to build data sets for data mining. Data Knowl. Eng. 89, 38–54 (2013)

    Google Scholar 

  11. Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Leveraging the data lake: current state and challenges. In: Ordonez, C., Song, I.-Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2019. LNCS, vol. 11708, pp. 179–188. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27520-4_13

    Chapter  Google Scholar 

  12. Scholly, E., et al.: Coining goldmedal: A new contribution to data lake generic metadata modeling. In: DOLAP, vol. 2840, 2021, pp. 31–40 (2021)

    Google Scholar 

  13. Giebler, C., Gröger, C., Hoos, E., Schwarz, H., Mitschang, B.: Modeling data lakes with data vault: practical experiences, assessment, and lessons learned. In: Laender, A.H.F., Pernici, B., Lim, E.-P., de Oliveira, J.P.M. (eds.) ER 2019. LNCS, vol. 11788, pp. 63–77. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33223-5_7

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elijah Mitchell .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mitchell, E., Berkani, N., Bellatreche, L., Ordonez, C. (2023). FLOWER: Viewing Data Flow in ER Diagrams. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2023. Lecture Notes in Computer Science, vol 14148. Springer, Cham. https://doi.org/10.1007/978-3-031-39831-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39831-5_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39830-8

  • Online ISBN: 978-3-031-39831-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics