Skip to main content
Log in

Scalable graph-based OLAP analytics over process execution data

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

In today’s knowledge-, service-, and cloud-based economy, businesses accumulate massive amounts of data from a variety of sources. In order to understand businesses one may need to perform considerable analytics over large hybrid collections of heterogeneous and partially unstructured data that is captured related to the process execution. This data, usually modeled as graphs, increasingly come to show all the typical properties of big data: wide physical distribution, diversity of formats, non-standard data models, independently-managed and heterogeneous semantics. We use the term big process graph to refer to such large hybrid collections of heterogeneous and partially unstructured process related execution data. Online analytical processing (OLAP) of big process graph is challenging as the extension of existing OLAP techniques to analysis of graphs is not straightforward. Moreover, process data analysis methods should be capable of processing and querying large amount of data effectively and efficiently, and therefore have to be able to scale well with the infrastructure’s scale. While traditional analytics solutions (relational DBs, data warehouses and OLAP), do a great job in collecting data and providing answers on known questions, key business insights remain hidden in the interactions among objects: it will be hard to discover concept hierarchies for entities based on both data objects and their interactions in process graphs. In this paper, we introduce a framework and a set of methods to support scalable graph-based OLAP analytics over process execution data. The goal is to facilitate the analytics over big process graph through summarizing the process graph and providing multiple views at different granularity. To achieve this goal, we present a model for process OLAP (P-OLAP) and define OLAP specific abstractions in process context such as process cubes, dimensions, and cells. We present a MapReduce-based graph processing engine, to support big data analytics over process graphs. We have implemented the P-OLAP framework and integrated it into our existing process data analytics platform, ProcessAtlas, which introduces a scalable architecture for querying, exploration and analysis of large process data. We report on experiments performed on both synthetic and real-world datasets that show the viability and efficiency of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://hadoop.apache.org/.

  2. http://scienceblogs.com/.

  3. http://arxiv.org/.

  4. http://www.myexperiment.org/.

  5. https://www.box.com/.

  6. Organization such as International Federation of Library Associations and Institutions (IFLA)) initiated processes to collect and organize literature and information (http://www.ifla.org/).

  7. http://dblp.uni-trier.de/db/.

  8. http://snap.stanford.edu/data/amazon-meta.html.

  9. An Iceberg-Cube contains only those cells of the data cube that meet an aggregate condition. It is called an Iceberg-Cube because it contains only some of the cells of the full cube [23].

  10. Skyline [25] has been proposed as an important operator for multi-criteria decision making, data mining and visualization, and user preference queries.

  11. An information-network [46] is a network where each node represents an entity (which may have attributes, labels, and weights) and each link (which may have rich semantic information) represents a relationship between two entities.

  12. ProM is the world-leading process mining toolkit. It is an extensible framework that supports a wide variety of process mining techniques in the form of plug-ins.

  13. Linked Data is a method of publishing data on the web based on principles that significantly enhance the adaptability and usability of data, either by humans or machines [24].

References

  1. Aalst, W.M.P.V.D., Dongen, B.F.V., Günther, C.W., Rozinat, A., Verbeek, E., Weijters, T.: ProM: the process mining toolkit. In: Proceedings of the BPM (2009)

  2. Aalst, W.M.P.V.D., Dongen, B.F.V., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: a survey of issues and approaches. Data Knowl. Eng. 47, 237–267 (2003)

    Article  Google Scholar 

  3. Aalst, W.M.P.V.D.: Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer, Berlin (2011)

    Book  MATH  Google Scholar 

  4. Aalst, W.M.P.V.D.: Service mining: using process mining to discover, check, and improve service behavior. IEEE Trans. Serv. Comput. 99(PrePrints), 1 (2012)

    Google Scholar 

  5. Abadi, D., Marcus, A., Madden, S., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 411–422. VLDB Endowment (2007)

  6. Abelló, A., Romero, O.: On-line analytical processing. In: Encyclopedia of Database Systems, pp. 1949–1954. Springer, New York (2009)

  7. Aggarwal, C.C., Wang, H.: Managing and Mining Graph Data. Springer, New York (2010)

    Book  MATH  Google Scholar 

  8. Akal, F., Bhm, K., Schek, H.J.: OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In: Proceedings of the ADBIS, pp. 218–231 (2002)

  9. Alkhateeb, F., Baget, J.F., Euzenat, J.: Extending SPARQL with regular expression patterns (for querying RDF). J. Web Sem. 7(2), 57–73 (2009)

    Article  Google Scholar 

  10. Allahbakhsh, M., Ignjatovic, A., Benatallah, B., Beheshti, S.M.R., Bertino, E., Foo, N.: Collusion detection in online rating systems. In: Proceedings of the Web Technologies and Applications—15th Asia-Pacific Web Conference, APWeb 2013, Sydney, April 4–6, 2013, pp. 196–207 (2013)

  11. Allahbakhsh, M., Ignjatovic, A., Benatallah, B., Beheshti, S.M.R., Foo, N., Bertino, E.: Representation and querying of unfair evaluations in social rating systems. Comput. Secur. 41, 68–88 (2014)

    Article  Google Scholar 

  12. Anyanwu, K., Maduko, A., Sheth, A.: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW’07, pp. 797–806. ACM, New York (2007)

  13. Azvine, B., Nauck, D., Ho, C.: Intelligent business analytics: a tool to build decision-support systems for ebusinesses. BT Technol. J. 21(4), 65–71 (2003)

    Article  Google Scholar 

  14. Báez, M., Mussi, A., Casati, F., Birukou, A., Marchese, M.: Liquid journals: scientific journals in the Web 2.0 era. In: Proceedings of the JCDL, pp. 395–396 (2010)

  15. Balmin, A., Papadimitriou, T., Papakonstantinou, Y.: Hypothetical queries in an OLAP environment. In: Proceedings of the VLDB, pp. 220–231 (2000)

  16. Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: C-SPARQL: SPARQL for continuous querying. In: Proceedings of the WWW, pp. 1061–1062 (2009)

  17. Beeri, C., Eyal, A., Milo, T., Pilberg, A.: Monitoring business processes with queries. In: Proceedings of the VLDB (2007)

  18. Begel, A., Phang Khoo, Y., Zimmermann, T.: Codebook: discovering and exploiting relationships in software repositories. In: Proceedings of the ICSE’10, pp. 125–134 (2010)

  19. Beheshti, S.M.R., Benatallah, B., Motahari Nezhad, H.R., Allahbakhsh, M.: A framework and a language for on-line analytical processing on graphs. In: Proceedings of the Web Information Systems Engineering—WISE 2012–13th International Conference, Paphos, Cyprus, November 28–30, pp. 213–227 (2012)

  20. Beheshti, S.M.R., Benatallah, B., Motahari Nezhad, H.R., Sakr, S.: A query language for analyzing business processes execution. In: Proceedings of the Business Process Management—9th International Conference, BPM 2011, Clermont-Ferrand, France, August 30—September 2, pp. 281–297 (2011)

  21. Beheshti, S.M.R., Benatallah, B., Motahari-Nezhad, H.R.: Enabling the analysis of cross-cutting aspects in ad-hoc processes. In: Proceedings of the Advanced Information Systems Engineering—25th International Conference, CAiSE 2013, Valencia, June 17–21, pp. 51–67 (2013)

  22. Beheshti, S.M.R.: Organizing, Querying, and Analyzing Ad-hoc Processes Data. PhD Thesis, University of New South Wales Sydney (2012)

  23. Beyer, K.S., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg CUBEs. In: Proceedings of the SIGMOD 1999, ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, pp. 359–370. ACM Press, New York (1999)

  24. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Int. J. Semant. Web Inf. Syst. 5(3), 1–22 (2009)

    Article  Google Scholar 

  25. Börzsönyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: Proceedings of the ICDE, pp. 421–430 (2001)

  26. Brambilla, M., Fraternali, P., Vaca, C.: BPMN and design patterns for engineering social BPM solutions. In: Business Process Management Workshops. Lecture Notes in Business Information Processing, vol. 99, pp. 219–230. Springer, Berlin (2012)

  27. Buse, R.P.L., Zimmermann, T.: Information needs for software development analytics. In: Proceedings of the ICSE, pp. 987–996 (2012)

  28. Casati, F., Shan, M.C.: Semantic analysis of business process executions. In: Proceedings of the EDBT, pp. 287–296 (2002)

  29. Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. SIGMOD Rec. 26(1), 65–74 (1997)

    Article  Google Scholar 

  30. Chaudhuri, S., Dayal, U., Narasayya, V.: An overview of business intelligence technology. Commun. ACM 54(8), 88–98 (2011)

    Article  Google Scholar 

  31. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL translation. Data Knowl. Eng. 68(10), 973–1000 (2009)

    Article  Google Scholar 

  32. Chebotko, A., Lu, S., Fei, X., Fotouhi, F.: RDFProv: a relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Eng. 69(8), 836–865 (2010)

    Article  Google Scholar 

  33. Chen, C., Yan, X., Zhu, F., Han, J., Yu, P.S.: Graph OLAP: Towards online analytical processing on graphs. In: Proceedings of the ICDM, pp. 103–112 (2008)

  34. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  35. Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the World-Wide Web. Commun. ACM 54(4), 86–96 (2011)

    Article  Google Scholar 

  36. Dries, A., Nijssen, S., De Raedt, L.: A query language for analyzing networks. In: Proceedings of the CIKM’09, pp. 485–494. ACM, New York (2009)

  37. Egghe, L.: Theory and practise of the g-index. Scientometrics 69(1), 131–152 (2006)

    Article  MathSciNet  Google Scholar 

  38. Etcheverry, L., Vaisman, A.A.: Enhancing OLAP analysis with web cubes. In: Proceedings of the ESWC, pp. 469–483 (2012)

  39. Fritz, T., Murphy, G.C.: Using information fragments to answer the questions developers ask. In: Proceedings of the ICSE’10, pp. 175–184. ACM, New York (2010)

  40. Furtado, C., Lima, A.A.B., Pacitti, E., Valduriez, P., Mattoso, M.: Physical and virtual partitioning in OLAP database clusters. In: Proceedings of the SBAC-PAD, pp. 143–150 (2005)

  41. Golfarelli, M., Rizzi, S., Proli, A.: Designing what-if analysis: towards a methodology. In: Proceedings of the DOLAP, pp. 51–58 (2006)

  42. Gómez, L.I., Gómez, S.A., Vaisman, A.A.: A generic data model and query language for spatiotemporal OLAP cube analysis. In: Proceedings of the EDBT, pp. 300–311 (2012)

  43. Gottanka, R., Meyer, N.: ModelAsYouGo: (re-) design of S-BPM process models during execution time. In: S-BPM ONE Scientific Research. Lecture Notes in Business Information Processing, vol. 104, pp. 91–105. Springer, Berlin (2012)

  44. Gubichev, A., Bedathur, S.J., Seufert, S.: Fast and accurate estimation of shortest paths in large graphs. In: Proceedings of the CIKM’10, pp. 499–508 (2010)

  45. Han, J., Pei, J., Dong, G., Wang, K.: Efficient computation of iceberg cubes with complex measures. In: Proceedings of the SIGMOD Conference, pp. 1–12 (2001)

  46. Han, J., Sun, Y., Yan, X., Yu, P.S.: Mining knowledge from data: an information network analysis approach. In: Proceedings of the ICDE (2012)

  47. Han, J., Yan, X., Yu, P.S.: Scalable OLAP and mining of information networks. In: Proceedings of the EDBT (2009)

  48. Hassanzadeh, O., Duan, S., Fokoue, A., Kementsietsidis, A., Srinivas, K., Ward, M.J.: Helix: online enterprise data analytics. In: Proceedings of the WWW (Companion Volume), pp. 225–228 (2011)

  49. Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: A framework for semantic link discovery over relational data. In: Proceedings of the CIKM, pp. 1027–1036 (2009)

  50. Hirsch, J.E.: An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics 85(3), 741–754 (2010)

    Article  Google Scholar 

  51. Husain, M.F., Doshi, P., Khan, L., Thuraisingham, B.M.: Storage and retrieval of large RDF graph using Hadoop and MapReduce. In: Proceedings of the CloudCom, pp. 680–686 (2009)

  52. Husain, M.F., Khan, L., Kantarcioglu, M., Thuraisingham, B.M.: Data intensive query processing for large RDF graphs using cloud computing tools. In: Proceedings of the IEEE CLOUD, pp. 1–10 (2010)

  53. Jagadeesh Chandra Bose, R.P., Verbeek, H.M.W., Aalst, W.M.P.V.D.: Discovering hierarchical process models using ProM. In: Proceedings of the CAiSE Forum, pp. 33–40 (2011)

  54. Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph regularized transductive classification on heterogeneous information networks. In: Proceedings of the ECML/PKDD (1), pp. 570–586 (2010)

  55. Kämpgen, B., Harth, A.: Transforming statistical linked data for use in OLAP systems. In: Proceedings of the I-SEMANTICS, pp. 33–40 (2011)

  56. Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: the journey using a nested triplegroup algebra. In: Proceedings of the PVLDB 4(12), 1426–1429 (2011)

  57. Kmpgen, B., O’Riain, S., Harth, A:. Interacting with statistical linked data via OLAP operations. In: Proceedings of the ILD-ESWC (2012)

  58. Kochut, K.J., Janik, M.: SPARQLeR: Extended SPARQL for semantic association discovery. In: Proceedings of the ESWC’07, pp. 145–159. Springer, Berlin (2007)

  59. Kohavi, R., Rothleder, N.J., Simoudis, E.: Emerging trends in business analytics. Commun. ACM 45(8), 45–48 (2002)

    Article  Google Scholar 

  60. Koutsoukis, N.S., Mitra, G., Lucas, C.: Adapting on-line analytical processing for decision modelling: the interaction of information and decision technologies. Decis. Support Syst. 26(1), 1–30 (1999)

    Article  Google Scholar 

  61. Kurniawan, T.A., Ghose, A.K., Lê, L.S., Dam, H.K.: On formalizing inter-process relationships. In: Proceedings of the Business Process Management Workshops. Lecture Notes in Business Information Processing, vol. 100, pp. 75–86. Springer, Berlin (2012)

  62. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. TWEB, 1(1), 5 (2007)

  63. Lima, A.A.B., Mattoso, M., Valduriez, P.: Adaptive virtual partitioning for OLAP query processing in a database cluster. JIDM 1(1), 75–88 (2010)

    Google Scholar 

  64. Manola, F., Miller, E.: RDF Primer. W3C, http://www.w3.org/TR/rdf-primer/ (2004). Accessed 1 May 2014

  65. Mathiesen, P., Watson, J., Bandara, W., Rosemann, M.: Applying social technology to business process lifecycle management. In: Proceedings of the Business Process Management Workshops. Lecture Notes in Business Information Processing, vol. 99, pp. 231–241. Springer, Berlin (2012)

  66. Medeiros, A.K.A.D., Aalst, W.M.P.V.D., Pedrinaci, C.: Semantic process mining tools: core building blocks. In: Proceedings of the ECIS, pp. 1953–1964 (2008)

  67. Menzies, T., Zimmermann, T.: Goldfish bowl panel: software development analytics. In: Proceedings of the ICSE, pp. 1032–1033 (2012)

  68. Mhlen, M., Shapiro, R.: Business process analytics. In: Handbook on Business Process Management 2, International Handbooks on Information Systems, pp. 137–157. Springer, Berlin (2010)

  69. Molhanec, M.: Enterprise systems meet social BPM. In: Proceedings of the Advanced Information Systems Engineering Workshops. Lecture Notes in Business Information Processing, vol. 112, pp. 413–424. Springer, Berlin (2012)

  70. Momotko, M., Subieta, K.: Process query language: a way to make workflow processes more flexible. In: Proceedings of the ADBIS (2004)

  71. Motahari-Nezhad, H.R., Saint-Paul, R., Benatallah, B., Casati, F.: Deriving protocol models from imperfect service conversation logs. IEEE Trans. Knowl. Data Eng. 20, 1683–1698 (2008)

    Article  Google Scholar 

  72. Motahari-Nezhad, H.R., Saint-Paul, R., Casati, F., Benatallah, B.: Event correlation for process discovery from web service interaction logs. VLDB J. 20(3), 417–444 (2011)

    Article  Google Scholar 

  73. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099–1110. ACM, New York (2008)

  74. Ooi, B.C., Yu, B., Li, G.: One table stores all: enabling painless free-and-easy data publishing and sharing. In: Proceedings of the CIDR’07, pp. 142–153 (2007)

  75. Papastefanatos, G., Anagnostou, F., Vassiliou, Y., Vassiliadis, P.: Hecataeus: a what-if analysis tool for database schema evolution. In: Proceedings of the CSMR, pp. 326–328 (2008)

  76. Papastefanatos, G., Vassiliadis, P., Simitsis, A., Vassiliou, Y.: What-if analysis for data warehouse evolution. In: Proceedings of the DaWaK, pp. 23–33 (2007)

  77. Pistore, M., Barbon, F., Bertoli, P., Shaparau, D., Traverso, P.: Planning and monitoring web service composition. In: Proceedings of the AIMSA (2004)

  78. PrudHommeaux, E., Seaborne, A. et al.: Sparql query language for rdf. W3C recommendation, http://www.w3.org/TR/rdf-sparql-query/ (2008)

  79. Qu, Q., Zhu, F., Yan, X., Han, J., Yu, P.S., Li, H.: Efficient topological OLAP on information networks. In: Proceedings of the DASFAA (2011)

  80. Ravindra, P., Kim, H., Anyanwu, K.: An intermediate algebra for optimizing RDF graph pattern matching on MapReduce. In: Proceedings of the ESWC (2), pp. 46–61 (2011)

  81. Romero, O., Abelló, A.: A survey of multidimensional modeling methodologies. IJDWM 5(2), 1–23 (2009)

    Google Scholar 

  82. Rozsnyai, S., Slominski, A., Lakshmanan, G.T.: Automated correlation discovery for semi-structured business processes. In: Proceedings of the ICDE Workshops, pp. 261–266 (2011)

  83. Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: Proceedings of the International Workshop on Semantic Web Information Management, SWIM ’11, pp. 4:1–4:8. ACM, New York (2011)

  84. Sun, Y., Aggarwal, C.C., Han, J.: Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. PVLDB 5(5), 394–405 (2012)

    Google Scholar 

  85. Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: RankClus: integrating clustering with ranking for heterogeneous information network analysis. In: Proceedings of the EDBT, pp. 565–576 (2009)

  86. Sun, Y., Yu, Y., Han, J.: Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of the KDD, pp. 797–806 (2009)

  87. Thomsen, E.: OLAP Solutions: Building Multidimensional Information Systems, 2nd edn. John Wiley, New York (2002)

    Google Scholar 

  88. Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: Proceedings of the SIGMOD Conference, pp. 567–580 (2008)

  89. Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009)

  90. Wang, J., Jin, T., Wong, R. K., Wen, L.: Querying business process model repositories. World Wide Web 17(3), 427–454 (2014)

  91. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009). Original edition

    Google Scholar 

  92. Witkowski, A., Bellamkonda, S., Bozkaya, T., Dorman, G., Folkert, N., Gupta, A., Sheng, L., Subramanian, S.: Spreadsheets in RDBMS for OLAP. In: Proceedings of the SIGMOD Conference, pp. 52–63 (2003)

  93. Wynn, M.T., Dumas, M., Fidge, C.J., Hofstede, A.H.M.T., Aalst, W.M.P.V.D.: Business process simulation for operational decision support. In: Proceedings of the Business Process Management Workshops, pp. 66–77 (2007)

  94. Xin, D., Shao, Z., Han, J., Liu, H.: C-Cubing: Efficient computation of closed cubes by aggregation-based checking. In: Proceedings of the ICDE (2006)

  95. Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: Proceedings of the SIGMOD Conference, pp. 335–346 (2004)

  96. Yu, T.L., Goldberg, D.E.: Dependency structure matrix analysis: offline utility of the dependency structure matrix genetic algorithm. In: Proceedings of the GECCO (2), pp. 355–366 (2004)

  97. Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J.X., Zhang, Q.: Efficient computation of the skyline cube. In: Proceedings of the VLDB, pp. 241–252 (2005)

  98. Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the SIGMOD’11, pp. 853–864 (2011)

  99. Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, New York (2011)

    Google Scholar 

  100. Zou, L., Peng, P., Zhao, D.: Top-K possible shortest path query over a large uncertain graph. In: Proceedings of the WISE, pp. 72–86 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seyed-Mehdi-Reza Beheshti.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beheshti, SMR., Benatallah, B. & Motahari-Nezhad, H.R. Scalable graph-based OLAP analytics over process execution data. Distrib Parallel Databases 34, 379–423 (2016). https://doi.org/10.1007/s10619-014-7171-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-014-7171-9

Keywords

Navigation