Skip to main content

A Reference Data Model to Specify Event Logs for Big Data Pipeline Discovery

  • Conference paper
  • First Online:
Business Process Management Forum (BPM 2023)

Abstract

State-of-the-art approaches for managing Big Data pipelines assume their anatomy is known by design and expressed through ad-hoc Domain-Specific Languages (DSLs), with insufficient knowledge of the dark data involved in the pipeline execution. Dark data is data that organizations acquire during regular business activities but is not used to derive insights or for decision-making. The recent literature on Big Data processing agrees that a new breed of Big Data pipeline discovery (BDPD) solutions can mitigate this issue by solely analyzing the event log that keeps track of pipeline executions over time. Relying on well-established process mining techniques, BDPD can reveal fact-based insights into how data pipelines transpire and access dark data. However, to date, a standard format to specify the concept of Big Data pipeline execution in an event log does not exist, making it challenging to apply process mining to achieve the BDPD task. To address this issue, in this paper we formalize a universally applicable reference data model to conceptualize the core properties and attributes of a data pipeline execution. We provide an implementation of the model as an extension to the XES interchange standard for event logs, demonstrate its practical applicability in a use case involving a data pipeline for managing digital marketing campaigns, and evaluate its effectiveness in uncovering dark data manipulated during several pipeline executions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.gartner.com/en/information-technology/glossary/dark-data.

  2. 2.

    https://cordis.europa.eu/project/id/101016835.

  3. 3.

    https://xes-standard.org/.

  4. 4.

    https://www.simio.com/.

  5. 5.

    https://fluxicon.com/disco/.

References

  1. van der Aalst, W.M.P.: Process Mining: Data Science in Action. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49851-4

    Book  Google Scholar 

  2. Abb, L., Rehse, J.: A reference data model for process-related user interaction logs. In: 20th International Conference on Business Process Management (BPM 2022) (2022)

    Google Scholar 

  3. Acampora, G., Vitiello, A., Di Stefano, B., van der Aalst, W.M.P., Günther, C., Verbeek, E.: IEEE 1849: the XES standard. In: IEEE Computational Intelligence Magazine (2017)

    Google Scholar 

  4. Agostinelli, S., Benvenuti, D., De Luzi, F., Marrella, A.: Big data pipeline discovery through process mining: challenges and research directions. In: 1st Italian Forum on Business Process Management, Co-located with BPM 2021 (2021)

    Google Scholar 

  5. Augusto, A., et al.: Automated discovery of process models from event logs: review and benchmark. IEEE TKDE 31(4), 686–705 (2018)

    MathSciNet  Google Scholar 

  6. Benvenuti, D., Falleroni, L., Marrella, A., Perales, F.: An interactive approach to support event log generation for data pipeline discovery. In: 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC 2022) (2022)

    Google Scholar 

  7. Carmona, J., van Dongen, B.F., Weidlich, M.: Conformance checking: foundations, milestones and challenges. In: van der Aalst, W.M.P., Carmona, J. (eds.) Process Mining Handbook. LNBIP, vol. 448, pp. 155–190. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08848-3_5

    Chapter  Google Scholar 

  8. Corallo, A., Crespino, A.M., Vecchio, V.D., Lazoi, M., Marra, M.: Understanding and defining dark data for the manufacturing industry. IEEE Trans. Eng. Manag. 70(2), 700–712 (2021)

    Article  Google Scholar 

  9. van Dongen, B.F., Shabani, S.: Relational XES: data management for process mining. In: CAiSE Forum 2015, pp. 169–176 (2015)

    Google Scholar 

  10. Gimpel, G.: Bringing dark data into the light: illuminating existing IoT data lost within your organization. Bus. Horiz. 63(4), 519–530 (2020)

    Article  Google Scholar 

  11. Humayoun, S.R., et al.: Designing mobile systems in highly dynamic scenarios: the WORKPAD methodology. Knowl. Tech. Pol. 22, 25–43 (2009)

    Article  Google Scholar 

  12. Johannesson, P., Perjons, E.: An Introduction to Design Science. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10632-8

    Book  MATH  Google Scholar 

  13. Leemans, M., Liu, C.: XES Software Telemetry Extension. XES W. Group (2017)

    Google Scholar 

  14. de Leoni, M., Marrella, A.: Aligning real process executions and prescriptive process models through automated planning. Exp. Syst. App. 82, 162–183 (2017)

    Article  Google Scholar 

  15. Marrella, A., Mecella, M., Russo, A.: Collaboration on-the-field: suggestions and beyond. In: 8th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2011) (2011)

    Google Scholar 

  16. Marrella, A., Mecella, M., Russo, A.: Featuring automatic adaptivity through workflow enactment and planning. In: 7th International Conference on College Composition: Networking, Applications and Worksharing (CollaborateCom 2011), pp. 372–381. IEEE (2011)

    Google Scholar 

  17. Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM Comput. Surv. 37(4), 316–344 (2005)

    Article  Google Scholar 

  18. Munappy, A.R., Bosch, J., Olsson, H.H.: Data pipeline management in practice: challenges and opportunities. In: International Conference on Product-Focused Software Process Improvement (PROFES 2020), pp. 168–184 (2020)

    Google Scholar 

  19. Nikolov, N., et al.: Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers. Internet Things 16, 100440 (2021)

    Article  Google Scholar 

  20. Oleghe, O., Salonitis, K.: A framework for designing data pipelines for manufacturing systems. Procedia CIRP 93, 724–729 (2020)

    Article  Google Scholar 

  21. Pegoraro, M., Uysal, M.S., van der Aalst, W.M.: An XES extension for uncertain event data. arXiv preprint 2204.04135 (2022)

    Google Scholar 

  22. Plale, B., Kouper, I.: The centrality of data: data lifecycle and data pipelines. In: Data Analytics for Int. Transportation System. Elsevier (2017)

    Google Scholar 

  23. Rabl, T., Jacobsen, H.-A.: Big data generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 20–27. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-53974-9_3

    Chapter  Google Scholar 

  24. Rafiei, M., van der Aalst, W.M.P.: Privacy-preserving data publishing in process mining. In: Fahland, D., Ghidini, C., Becker, J., Dumas, M. (eds.) BPM 2020. LNBIP, vol. 392, pp. 122–138. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58638-6_8

    Chapter  Google Scholar 

  25. Roman, D., Prodan, R., Nikolov, N., Soylu, A., Matskin, M., Marrella, A., et al.: Big data pipelines on the computing continuum: tapping the dark data. Computer 55(11), 74–84 (2022)

    Article  Google Scholar 

  26. Schönig, S., Rogge-Solti, A., Cabanillas, C., Jablonski, S., Mendling, J.: Efficient and customisable declarative process mining with SQL. In: 28th International Conference on Advanced Information Systems Engineering (CAiSE 2016) (2016)

    Google Scholar 

  27. Schönig, S.: SQL queries for declarative process mining on event logs of relational databases. arXiv preprint 1512.00196 (2015)

    Google Scholar 

  28. Steinau, S., Marrella, A., Andrews, K., Leotta, F., Mecella, M., Reichert, M.: DALEC: a framework for the systematic evaluation of data-centric approaches to process management software. Software Syst. Model. 18(4), 2679–2716 (2019)

    Article  Google Scholar 

  29. Syamsiyah, A., van Dongen, B.F., van der Aalst, W.M.P.: DB-XES: enabling process discovery in the large. In: Sixth International Symposium on Data-Driven Process Discovery and Analysis SIMPDA 2016, vol. 1757, pp. 63–77 (2016)

    Google Scholar 

  30. Teymourlouei, H., Jackson, L.: Dark data: managing cybersecurity challenges and generating benefits. In: Arabnia, H.R., et al. (eds.) Advances in Parallel & Distributed Processing, and Applications. TCSCI, pp. 91–104. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69984-0_9

    Chapter  Google Scholar 

Download references

Acknowledgments

This work is supported by the H2020 project DataCloud (Grant number 101016835), and the Sapienza project DISPIPE.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Marrella .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Benvenuti, D. et al. (2023). A Reference Data Model to Specify Event Logs for Big Data Pipeline Discovery. In: Di Francescomarino, C., Burattin, A., Janiesch, C., Sadiq, S. (eds) Business Process Management Forum. BPM 2023. Lecture Notes in Business Information Processing, vol 490. Springer, Cham. https://doi.org/10.1007/978-3-031-41623-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41623-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41622-4

  • Online ISBN: 978-3-031-41623-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics