Abstract
State-of-the-art approaches for managing Big Data pipelines assume their anatomy is known by design and expressed through ad-hoc Domain-Specific Languages (DSLs), with insufficient knowledge of the dark data involved in the pipeline execution. Dark data is data that organizations acquire during regular business activities but is not used to derive insights or for decision-making. The recent literature on Big Data processing agrees that a new breed of Big Data pipeline discovery (BDPD) solutions can mitigate this issue by solely analyzing the event log that keeps track of pipeline executions over time. Relying on well-established process mining techniques, BDPD can reveal fact-based insights into how data pipelines transpire and access dark data. However, to date, a standard format to specify the concept of Big Data pipeline execution in an event log does not exist, making it challenging to apply process mining to achieve the BDPD task. To address this issue, in this paper we formalize a universally applicable reference data model to conceptualize the core properties and attributes of a data pipeline execution. We provide an implementation of the model as an extension to the XES interchange standard for event logs, demonstrate its practical applicability in a use case involving a data pipeline for managing digital marketing campaigns, and evaluate its effectiveness in uncovering dark data manipulated during several pipeline executions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
van der Aalst, W.M.P.: Process Mining: Data Science in Action. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49851-4
Abb, L., Rehse, J.: A reference data model for process-related user interaction logs. In: 20th International Conference on Business Process Management (BPM 2022) (2022)
Acampora, G., Vitiello, A., Di Stefano, B., van der Aalst, W.M.P., Günther, C., Verbeek, E.: IEEE 1849: the XES standard. In: IEEE Computational Intelligence Magazine (2017)
Agostinelli, S., Benvenuti, D., De Luzi, F., Marrella, A.: Big data pipeline discovery through process mining: challenges and research directions. In: 1st Italian Forum on Business Process Management, Co-located with BPM 2021 (2021)
Augusto, A., et al.: Automated discovery of process models from event logs: review and benchmark. IEEE TKDE 31(4), 686–705 (2018)
Benvenuti, D., Falleroni, L., Marrella, A., Perales, F.: An interactive approach to support event log generation for data pipeline discovery. In: 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC 2022) (2022)
Carmona, J., van Dongen, B.F., Weidlich, M.: Conformance checking: foundations, milestones and challenges. In: van der Aalst, W.M.P., Carmona, J. (eds.) Process Mining Handbook. LNBIP, vol. 448, pp. 155–190. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08848-3_5
Corallo, A., Crespino, A.M., Vecchio, V.D., Lazoi, M., Marra, M.: Understanding and defining dark data for the manufacturing industry. IEEE Trans. Eng. Manag. 70(2), 700–712 (2021)
van Dongen, B.F., Shabani, S.: Relational XES: data management for process mining. In: CAiSE Forum 2015, pp. 169–176 (2015)
Gimpel, G.: Bringing dark data into the light: illuminating existing IoT data lost within your organization. Bus. Horiz. 63(4), 519–530 (2020)
Humayoun, S.R., et al.: Designing mobile systems in highly dynamic scenarios: the WORKPAD methodology. Knowl. Tech. Pol. 22, 25–43 (2009)
Johannesson, P., Perjons, E.: An Introduction to Design Science. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10632-8
Leemans, M., Liu, C.: XES Software Telemetry Extension. XES W. Group (2017)
de Leoni, M., Marrella, A.: Aligning real process executions and prescriptive process models through automated planning. Exp. Syst. App. 82, 162–183 (2017)
Marrella, A., Mecella, M., Russo, A.: Collaboration on-the-field: suggestions and beyond. In: 8th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2011) (2011)
Marrella, A., Mecella, M., Russo, A.: Featuring automatic adaptivity through workflow enactment and planning. In: 7th International Conference on College Composition: Networking, Applications and Worksharing (CollaborateCom 2011), pp. 372–381. IEEE (2011)
Mernik, M., Heering, J., Sloane, A.M.: When and how to develop domain-specific languages. ACM Comput. Surv. 37(4), 316–344 (2005)
Munappy, A.R., Bosch, J., Olsson, H.H.: Data pipeline management in practice: challenges and opportunities. In: International Conference on Product-Focused Software Process Improvement (PROFES 2020), pp. 168–184 (2020)
Nikolov, N., et al.: Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers. Internet Things 16, 100440 (2021)
Oleghe, O., Salonitis, K.: A framework for designing data pipelines for manufacturing systems. Procedia CIRP 93, 724–729 (2020)
Pegoraro, M., Uysal, M.S., van der Aalst, W.M.: An XES extension for uncertain event data. arXiv preprint 2204.04135 (2022)
Plale, B., Kouper, I.: The centrality of data: data lifecycle and data pipelines. In: Data Analytics for Int. Transportation System. Elsevier (2017)
Rabl, T., Jacobsen, H.-A.: Big data generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.-A. (eds.) WBDB -2012. LNCS, vol. 8163, pp. 20–27. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-53974-9_3
Rafiei, M., van der Aalst, W.M.P.: Privacy-preserving data publishing in process mining. In: Fahland, D., Ghidini, C., Becker, J., Dumas, M. (eds.) BPM 2020. LNBIP, vol. 392, pp. 122–138. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58638-6_8
Roman, D., Prodan, R., Nikolov, N., Soylu, A., Matskin, M., Marrella, A., et al.: Big data pipelines on the computing continuum: tapping the dark data. Computer 55(11), 74–84 (2022)
Schönig, S., Rogge-Solti, A., Cabanillas, C., Jablonski, S., Mendling, J.: Efficient and customisable declarative process mining with SQL. In: 28th International Conference on Advanced Information Systems Engineering (CAiSE 2016) (2016)
Schönig, S.: SQL queries for declarative process mining on event logs of relational databases. arXiv preprint 1512.00196 (2015)
Steinau, S., Marrella, A., Andrews, K., Leotta, F., Mecella, M., Reichert, M.: DALEC: a framework for the systematic evaluation of data-centric approaches to process management software. Software Syst. Model. 18(4), 2679–2716 (2019)
Syamsiyah, A., van Dongen, B.F., van der Aalst, W.M.P.: DB-XES: enabling process discovery in the large. In: Sixth International Symposium on Data-Driven Process Discovery and Analysis SIMPDA 2016, vol. 1757, pp. 63–77 (2016)
Teymourlouei, H., Jackson, L.: Dark data: managing cybersecurity challenges and generating benefits. In: Arabnia, H.R., et al. (eds.) Advances in Parallel & Distributed Processing, and Applications. TCSCI, pp. 91–104. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69984-0_9
Acknowledgments
This work is supported by the H2020 project DataCloud (Grant number 101016835), and the Sapienza project DISPIPE.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Benvenuti, D. et al. (2023). A Reference Data Model to Specify Event Logs for Big Data Pipeline Discovery. In: Di Francescomarino, C., Burattin, A., Janiesch, C., Sadiq, S. (eds) Business Process Management Forum. BPM 2023. Lecture Notes in Business Information Processing, vol 490. Springer, Cham. https://doi.org/10.1007/978-3-031-41623-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-41623-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41622-4
Online ISBN: 978-3-031-41623-1
eBook Packages: Computer ScienceComputer Science (R0)