Abstract
In today’s knowledge-, service-, and cloud-based economy, businesses accumulate massive amounts of data from a variety of sources. In order to understand businesses one may need to perform considerable analytics over large hybrid collections of heterogeneous and partially unstructured data that is captured related to the process execution. This data, usually modeled as graphs, increasingly come to show all the typical properties of big data: wide physical distribution, diversity of formats, non-standard data models, independently-managed and heterogeneous semantics. We use the term big process graph to refer to such large hybrid collections of heterogeneous and partially unstructured process related execution data. Online analytical processing (OLAP) of big process graph is challenging as the extension of existing OLAP techniques to analysis of graphs is not straightforward. Moreover, process data analysis methods should be capable of processing and querying large amount of data effectively and efficiently, and therefore have to be able to scale well with the infrastructure’s scale. While traditional analytics solutions (relational DBs, data warehouses and OLAP), do a great job in collecting data and providing answers on known questions, key business insights remain hidden in the interactions among objects: it will be hard to discover concept hierarchies for entities based on both data objects and their interactions in process graphs. In this paper, we introduce a framework and a set of methods to support scalable graph-based OLAP analytics over process execution data. The goal is to facilitate the analytics over big process graph through summarizing the process graph and providing multiple views at different granularity. To achieve this goal, we present a model for process OLAP (P-OLAP) and define OLAP specific abstractions in process context such as process cubes, dimensions, and cells. We present a MapReduce-based graph processing engine, to support big data analytics over process graphs. We have implemented the P-OLAP framework and integrated it into our existing process data analytics platform, ProcessAtlas, which introduces a scalable architecture for querying, exploration and analysis of large process data. We report on experiments performed on both synthetic and real-world datasets that show the viability and efficiency of the approach.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10619-014-7171-9/MediaObjects/10619_2014_7171_Fig11_HTML.gif)
Similar content being viewed by others
Notes
Organization such as International Federation of Library Associations and Institutions (IFLA)) initiated processes to collect and organize literature and information (http://www.ifla.org/).
An Iceberg-Cube contains only those cells of the data cube that meet an aggregate condition. It is called an Iceberg-Cube because it contains only some of the cells of the full cube [23].
Skyline [25] has been proposed as an important operator for multi-criteria decision making, data mining and visualization, and user preference queries.
An information-network [46] is a network where each node represents an entity (which may have attributes, labels, and weights) and each link (which may have rich semantic information) represents a relationship between two entities.
ProM is the world-leading process mining toolkit. It is an extensible framework that supports a wide variety of process mining techniques in the form of plug-ins.
Linked Data is a method of publishing data on the web based on principles that significantly enhance the adaptability and usability of data, either by humans or machines [24].
References
Aalst, W.M.P.V.D., Dongen, B.F.V., Günther, C.W., Rozinat, A., Verbeek, E., Weijters, T.: ProM: the process mining toolkit. In: Proceedings of the BPM (2009)
Aalst, W.M.P.V.D., Dongen, B.F.V., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: a survey of issues and approaches. Data Knowl. Eng. 47, 237–267 (2003)
Aalst, W.M.P.V.D.: Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer, Berlin (2011)
Aalst, W.M.P.V.D.: Service mining: using process mining to discover, check, and improve service behavior. IEEE Trans. Serv. Comput. 99(PrePrints), 1 (2012)
Abadi, D., Marcus, A., Madden, S., Hollenbach, K.: Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 411–422. VLDB Endowment (2007)
Abelló, A., Romero, O.: On-line analytical processing. In: Encyclopedia of Database Systems, pp. 1949–1954. Springer, New York (2009)
Aggarwal, C.C., Wang, H.: Managing and Mining Graph Data. Springer, New York (2010)
Akal, F., Bhm, K., Schek, H.J.: OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In: Proceedings of the ADBIS, pp. 218–231 (2002)
Alkhateeb, F., Baget, J.F., Euzenat, J.: Extending SPARQL with regular expression patterns (for querying RDF). J. Web Sem. 7(2), 57–73 (2009)
Allahbakhsh, M., Ignjatovic, A., Benatallah, B., Beheshti, S.M.R., Bertino, E., Foo, N.: Collusion detection in online rating systems. In: Proceedings of the Web Technologies and Applications—15th Asia-Pacific Web Conference, APWeb 2013, Sydney, April 4–6, 2013, pp. 196–207 (2013)
Allahbakhsh, M., Ignjatovic, A., Benatallah, B., Beheshti, S.M.R., Foo, N., Bertino, E.: Representation and querying of unfair evaluations in social rating systems. Comput. Secur. 41, 68–88 (2014)
Anyanwu, K., Maduko, A., Sheth, A.: SPARQ2L: towards support for subgraph extraction queries in RDF databases. WWW’07, pp. 797–806. ACM, New York (2007)
Azvine, B., Nauck, D., Ho, C.: Intelligent business analytics: a tool to build decision-support systems for ebusinesses. BT Technol. J. 21(4), 65–71 (2003)
Báez, M., Mussi, A., Casati, F., Birukou, A., Marchese, M.: Liquid journals: scientific journals in the Web 2.0 era. In: Proceedings of the JCDL, pp. 395–396 (2010)
Balmin, A., Papadimitriou, T., Papakonstantinou, Y.: Hypothetical queries in an OLAP environment. In: Proceedings of the VLDB, pp. 220–231 (2000)
Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: C-SPARQL: SPARQL for continuous querying. In: Proceedings of the WWW, pp. 1061–1062 (2009)
Beeri, C., Eyal, A., Milo, T., Pilberg, A.: Monitoring business processes with queries. In: Proceedings of the VLDB (2007)
Begel, A., Phang Khoo, Y., Zimmermann, T.: Codebook: discovering and exploiting relationships in software repositories. In: Proceedings of the ICSE’10, pp. 125–134 (2010)
Beheshti, S.M.R., Benatallah, B., Motahari Nezhad, H.R., Allahbakhsh, M.: A framework and a language for on-line analytical processing on graphs. In: Proceedings of the Web Information Systems Engineering—WISE 2012–13th International Conference, Paphos, Cyprus, November 28–30, pp. 213–227 (2012)
Beheshti, S.M.R., Benatallah, B., Motahari Nezhad, H.R., Sakr, S.: A query language for analyzing business processes execution. In: Proceedings of the Business Process Management—9th International Conference, BPM 2011, Clermont-Ferrand, France, August 30—September 2, pp. 281–297 (2011)
Beheshti, S.M.R., Benatallah, B., Motahari-Nezhad, H.R.: Enabling the analysis of cross-cutting aspects in ad-hoc processes. In: Proceedings of the Advanced Information Systems Engineering—25th International Conference, CAiSE 2013, Valencia, June 17–21, pp. 51–67 (2013)
Beheshti, S.M.R.: Organizing, Querying, and Analyzing Ad-hoc Processes Data. PhD Thesis, University of New South Wales Sydney (2012)
Beyer, K.S., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg CUBEs. In: Proceedings of the SIGMOD 1999, ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, pp. 359–370. ACM Press, New York (1999)
Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. Int. J. Semant. Web Inf. Syst. 5(3), 1–22 (2009)
Börzsönyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: Proceedings of the ICDE, pp. 421–430 (2001)
Brambilla, M., Fraternali, P., Vaca, C.: BPMN and design patterns for engineering social BPM solutions. In: Business Process Management Workshops. Lecture Notes in Business Information Processing, vol. 99, pp. 219–230. Springer, Berlin (2012)
Buse, R.P.L., Zimmermann, T.: Information needs for software development analytics. In: Proceedings of the ICSE, pp. 987–996 (2012)
Casati, F., Shan, M.C.: Semantic analysis of business process executions. In: Proceedings of the EDBT, pp. 287–296 (2002)
Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. SIGMOD Rec. 26(1), 65–74 (1997)
Chaudhuri, S., Dayal, U., Narasayya, V.: An overview of business intelligence technology. Commun. ACM 54(8), 88–98 (2011)
Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL translation. Data Knowl. Eng. 68(10), 973–1000 (2009)
Chebotko, A., Lu, S., Fei, X., Fotouhi, F.: RDFProv: a relational RDF store for querying and managing scientific workflow provenance. Data Knowl. Eng. 69(8), 836–865 (2010)
Chen, C., Yan, X., Zhu, F., Han, J., Yu, P.S.: Graph OLAP: Towards online analytical processing on graphs. In: Proceedings of the ICDM, pp. 103–112 (2008)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the World-Wide Web. Commun. ACM 54(4), 86–96 (2011)
Dries, A., Nijssen, S., De Raedt, L.: A query language for analyzing networks. In: Proceedings of the CIKM’09, pp. 485–494. ACM, New York (2009)
Egghe, L.: Theory and practise of the g-index. Scientometrics 69(1), 131–152 (2006)
Etcheverry, L., Vaisman, A.A.: Enhancing OLAP analysis with web cubes. In: Proceedings of the ESWC, pp. 469–483 (2012)
Fritz, T., Murphy, G.C.: Using information fragments to answer the questions developers ask. In: Proceedings of the ICSE’10, pp. 175–184. ACM, New York (2010)
Furtado, C., Lima, A.A.B., Pacitti, E., Valduriez, P., Mattoso, M.: Physical and virtual partitioning in OLAP database clusters. In: Proceedings of the SBAC-PAD, pp. 143–150 (2005)
Golfarelli, M., Rizzi, S., Proli, A.: Designing what-if analysis: towards a methodology. In: Proceedings of the DOLAP, pp. 51–58 (2006)
Gómez, L.I., Gómez, S.A., Vaisman, A.A.: A generic data model and query language for spatiotemporal OLAP cube analysis. In: Proceedings of the EDBT, pp. 300–311 (2012)
Gottanka, R., Meyer, N.: ModelAsYouGo: (re-) design of S-BPM process models during execution time. In: S-BPM ONE Scientific Research. Lecture Notes in Business Information Processing, vol. 104, pp. 91–105. Springer, Berlin (2012)
Gubichev, A., Bedathur, S.J., Seufert, S.: Fast and accurate estimation of shortest paths in large graphs. In: Proceedings of the CIKM’10, pp. 499–508 (2010)
Han, J., Pei, J., Dong, G., Wang, K.: Efficient computation of iceberg cubes with complex measures. In: Proceedings of the SIGMOD Conference, pp. 1–12 (2001)
Han, J., Sun, Y., Yan, X., Yu, P.S.: Mining knowledge from data: an information network analysis approach. In: Proceedings of the ICDE (2012)
Han, J., Yan, X., Yu, P.S.: Scalable OLAP and mining of information networks. In: Proceedings of the EDBT (2009)
Hassanzadeh, O., Duan, S., Fokoue, A., Kementsietsidis, A., Srinivas, K., Ward, M.J.: Helix: online enterprise data analytics. In: Proceedings of the WWW (Companion Volume), pp. 225–228 (2011)
Hassanzadeh, O., Kementsietsidis, A., Lim, L., Miller, R.J., Wang, M.: A framework for semantic link discovery over relational data. In: Proceedings of the CIKM, pp. 1027–1036 (2009)
Hirsch, J.E.: An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics 85(3), 741–754 (2010)
Husain, M.F., Doshi, P., Khan, L., Thuraisingham, B.M.: Storage and retrieval of large RDF graph using Hadoop and MapReduce. In: Proceedings of the CloudCom, pp. 680–686 (2009)
Husain, M.F., Khan, L., Kantarcioglu, M., Thuraisingham, B.M.: Data intensive query processing for large RDF graphs using cloud computing tools. In: Proceedings of the IEEE CLOUD, pp. 1–10 (2010)
Jagadeesh Chandra Bose, R.P., Verbeek, H.M.W., Aalst, W.M.P.V.D.: Discovering hierarchical process models using ProM. In: Proceedings of the CAiSE Forum, pp. 33–40 (2011)
Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph regularized transductive classification on heterogeneous information networks. In: Proceedings of the ECML/PKDD (1), pp. 570–586 (2010)
Kämpgen, B., Harth, A.: Transforming statistical linked data for use in OLAP systems. In: Proceedings of the I-SEMANTICS, pp. 33–40 (2011)
Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: the journey using a nested triplegroup algebra. In: Proceedings of the PVLDB 4(12), 1426–1429 (2011)
Kmpgen, B., O’Riain, S., Harth, A:. Interacting with statistical linked data via OLAP operations. In: Proceedings of the ILD-ESWC (2012)
Kochut, K.J., Janik, M.: SPARQLeR: Extended SPARQL for semantic association discovery. In: Proceedings of the ESWC’07, pp. 145–159. Springer, Berlin (2007)
Kohavi, R., Rothleder, N.J., Simoudis, E.: Emerging trends in business analytics. Commun. ACM 45(8), 45–48 (2002)
Koutsoukis, N.S., Mitra, G., Lucas, C.: Adapting on-line analytical processing for decision modelling: the interaction of information and decision technologies. Decis. Support Syst. 26(1), 1–30 (1999)
Kurniawan, T.A., Ghose, A.K., Lê, L.S., Dam, H.K.: On formalizing inter-process relationships. In: Proceedings of the Business Process Management Workshops. Lecture Notes in Business Information Processing, vol. 100, pp. 75–86. Springer, Berlin (2012)
Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. TWEB, 1(1), 5 (2007)
Lima, A.A.B., Mattoso, M., Valduriez, P.: Adaptive virtual partitioning for OLAP query processing in a database cluster. JIDM 1(1), 75–88 (2010)
Manola, F., Miller, E.: RDF Primer. W3C, http://www.w3.org/TR/rdf-primer/ (2004). Accessed 1 May 2014
Mathiesen, P., Watson, J., Bandara, W., Rosemann, M.: Applying social technology to business process lifecycle management. In: Proceedings of the Business Process Management Workshops. Lecture Notes in Business Information Processing, vol. 99, pp. 231–241. Springer, Berlin (2012)
Medeiros, A.K.A.D., Aalst, W.M.P.V.D., Pedrinaci, C.: Semantic process mining tools: core building blocks. In: Proceedings of the ECIS, pp. 1953–1964 (2008)
Menzies, T., Zimmermann, T.: Goldfish bowl panel: software development analytics. In: Proceedings of the ICSE, pp. 1032–1033 (2012)
Mhlen, M., Shapiro, R.: Business process analytics. In: Handbook on Business Process Management 2, International Handbooks on Information Systems, pp. 137–157. Springer, Berlin (2010)
Molhanec, M.: Enterprise systems meet social BPM. In: Proceedings of the Advanced Information Systems Engineering Workshops. Lecture Notes in Business Information Processing, vol. 112, pp. 413–424. Springer, Berlin (2012)
Momotko, M., Subieta, K.: Process query language: a way to make workflow processes more flexible. In: Proceedings of the ADBIS (2004)
Motahari-Nezhad, H.R., Saint-Paul, R., Benatallah, B., Casati, F.: Deriving protocol models from imperfect service conversation logs. IEEE Trans. Knowl. Data Eng. 20, 1683–1698 (2008)
Motahari-Nezhad, H.R., Saint-Paul, R., Casati, F., Benatallah, B.: Event correlation for process discovery from web service interaction logs. VLDB J. 20(3), 417–444 (2011)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099–1110. ACM, New York (2008)
Ooi, B.C., Yu, B., Li, G.: One table stores all: enabling painless free-and-easy data publishing and sharing. In: Proceedings of the CIDR’07, pp. 142–153 (2007)
Papastefanatos, G., Anagnostou, F., Vassiliou, Y., Vassiliadis, P.: Hecataeus: a what-if analysis tool for database schema evolution. In: Proceedings of the CSMR, pp. 326–328 (2008)
Papastefanatos, G., Vassiliadis, P., Simitsis, A., Vassiliou, Y.: What-if analysis for data warehouse evolution. In: Proceedings of the DaWaK, pp. 23–33 (2007)
Pistore, M., Barbon, F., Bertoli, P., Shaparau, D., Traverso, P.: Planning and monitoring web service composition. In: Proceedings of the AIMSA (2004)
PrudHommeaux, E., Seaborne, A. et al.: Sparql query language for rdf. W3C recommendation, http://www.w3.org/TR/rdf-sparql-query/ (2008)
Qu, Q., Zhu, F., Yan, X., Han, J., Yu, P.S., Li, H.: Efficient topological OLAP on information networks. In: Proceedings of the DASFAA (2011)
Ravindra, P., Kim, H., Anyanwu, K.: An intermediate algebra for optimizing RDF graph pattern matching on MapReduce. In: Proceedings of the ESWC (2), pp. 46–61 (2011)
Romero, O., Abelló, A.: A survey of multidimensional modeling methodologies. IJDWM 5(2), 1–23 (2009)
Rozsnyai, S., Slominski, A., Lakshmanan, G.T.: Automated correlation discovery for semi-structured business processes. In: Proceedings of the ICDE Workshops, pp. 261–266 (2011)
Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: Proceedings of the International Workshop on Semantic Web Information Management, SWIM ’11, pp. 4:1–4:8. ACM, New York (2011)
Sun, Y., Aggarwal, C.C., Han, J.: Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. PVLDB 5(5), 394–405 (2012)
Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: RankClus: integrating clustering with ranking for heterogeneous information network analysis. In: Proceedings of the EDBT, pp. 565–576 (2009)
Sun, Y., Yu, Y., Han, J.: Ranking-based clustering of heterogeneous information networks with star network schema. In: Proceedings of the KDD, pp. 797–806 (2009)
Thomsen, E.: OLAP Solutions: Building Multidimensional Information Systems, 2nd edn. John Wiley, New York (2002)
Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: Proceedings of the SIGMOD Conference, pp. 567–580 (2008)
Vassiliadis, P.: A survey of extract-transform-load technology. IJDWM 5(3), 1–27 (2009)
Wang, J., Jin, T., Wong, R. K., Wen, L.: Querying business process model repositories. World Wide Web 17(3), 427–454 (2014)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009). Original edition
Witkowski, A., Bellamkonda, S., Bozkaya, T., Dorman, G., Folkert, N., Gupta, A., Sheng, L., Subramanian, S.: Spreadsheets in RDBMS for OLAP. In: Proceedings of the SIGMOD Conference, pp. 52–63 (2003)
Wynn, M.T., Dumas, M., Fidge, C.J., Hofstede, A.H.M.T., Aalst, W.M.P.V.D.: Business process simulation for operational decision support. In: Proceedings of the Business Process Management Workshops, pp. 66–77 (2007)
Xin, D., Shao, Z., Han, J., Liu, H.: C-Cubing: Efficient computation of closed cubes by aggregation-based checking. In: Proceedings of the ICDE (2006)
Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: Proceedings of the SIGMOD Conference, pp. 335–346 (2004)
Yu, T.L., Goldberg, D.E.: Dependency structure matrix analysis: offline utility of the dependency structure matrix genetic algorithm. In: Proceedings of the GECCO (2), pp. 355–366 (2004)
Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J.X., Zhang, Q.: Efficient computation of the skyline cube. In: Proceedings of the VLDB, pp. 241–252 (2005)
Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the SIGMOD’11, pp. 853–864 (2011)
Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, New York (2011)
Zou, L., Peng, P., Zhao, D.: Top-K possible shortest path query over a large uncertain graph. In: Proceedings of the WISE, pp. 72–86 (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Beheshti, SMR., Benatallah, B. & Motahari-Nezhad, H.R. Scalable graph-based OLAP analytics over process execution data. Distrib Parallel Databases 34, 379–423 (2016). https://doi.org/10.1007/s10619-014-7171-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-014-7171-9