Evolution Analysis of Large Graphs with Gradoop

Rost, Christopher; Thor, Andreas; Fritzsche, Philip; Gomez, Kevin; Rahm, Erhard

doi:10.1007/978-3-030-43823-4_33

Evolution Analysis of Large Graphs with Gradoop

Christopher Rost⁸,
Andreas Thor⁹,
Philip Fritzsche⁸,
Kevin Gomez⁸ &
…
Erhard Rahm⁸

Conference paper
First Online: 28 March 2020

1813 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1167))

Abstract

The temporal analysis of evolving graphs is an important requirement in many domains. We are therefore extending the distributed graph analysis framework Gradoop and its graph data model to support temporal graph analysis. This paper contains an overview of our work in progress and an example use case from the financial domain demonstrating the flexibility of the temporal graph model and its operators.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Temporal graphs represent the evolution of entities and relationships among them throughout time. Many real-world scenarios dynamically change over time, e.g., friendships and likes in social networks, citations and authorship affiliations in literature or transactions between accounts in the financial domain [8]. Instead of neglecting this prevailing time dimension by using a static graph model, it is better to represent the continuously changing network in a temporal graph data model to enable studying the effect of time on the graph [18]. Since many existing graph database systems [3, 9, 13], graph processing frameworks [4, 6, 7, 10] and graph query languages [2, 5, 14] concentrate on managing and querying static graphs, there is a lack of native support of the additional time-domain, e.g., to study how communities or paths change over time or to retrieve a snapshot from a past state of the graph.

Sahu et al. show in [17] that graphs maintained and analyzed by companies of all scales have common characteristics. Besides the presence of a wide variety of entities, graphs in practice are very large (containing often over billions of edges) and therefore the need for scalable systems to handle these large graphs is existent. Besides, the biggest part of the graphs used by the companies contain frequent changes (i.e., vertices and edges are added, deleted or updated over time), and all changes are stored permanently in the dataset.

To deal with these characteristics, we developed a temporal property graph model that enables modeling a graph with bitemporal time semantics as well as a set of operators to build distributed analysis workflows considering the additional time dimensions in the graph. The model and its operators are implemented in Gradoop [10, 11], an open-source framework^{Footnote 1} for distributed graph analysis based on Apache Flink [4]. After giving an overview of Gradoop’s temporal extension we show its expressiveness by composing new and existing operators to answer an analytical question from a use-case of the financial domain.

2 A Brief Overview of Gradoop’s Temporal Extension

Gradoop is an implementation of the Extended Property Graph Model (EPGM) and supports many generic operators on graphs (for pattern matching, grouping, etc.) that can be used within workflows for graph analysis. Workflows representing graph analytical programs can be expressed in a declarative domain-specific language called GrALa for distributed execution. Since the EPGM is built on top of Apache Flink’s Dataset-API, each Gradoop operator is based on a subset of Flink’s transformations (map, flatmap, join, etc.) to achieve a parallel execution and scalability to large graphs. It combines and extends features of graph analytical systems with the benefits of distributed graph processing.

Extension of Data Model: Many applications require time-dependent graph models. We therefore developed the Temporal Property Graph Model (TPGM) [15, 16] that extends Gradoop’s EPGM by adding additional time attributes from and to, each for valid and transaction time semantics, to the schema of vertices, edges and logical graphs. This approach offers a flexible representation of temporal graphs with bitemporal time semantics where the time can be empty, a timestamp or a time interval. A graph of this model contains all historical and rollback information and therefore allows retrieving valid snapshots from the past, present or future for the application time dimension or past and present states from the transaction time domain. An important advantage of our extension is its backward compatibility to the original EPGM since every existing Gradoop operator (that builds upon the EPGM) can be applied to one or more temporal graphs by disregarding the temporal information of the graph elements. A more detailed description of the TPGM and its operators is given in [15].

Extension of Existing Gradoop Operators: Operators such as transformation, aggregate, subgraph, grouping and pattern matching may benefit from the temporal extension of EPGM. For example, the subgraph operator can identify all vertices and edges where the validity range exceeds a limit. Similarly, the pattern matching operator can extract all subgraphs where the query pattern is valid at a given point in time.

Introduction of New Temporal Operators: We introduce snapshot and difference as specific temporal operators of the TPGM. The snapshot operator allows retrieving a valid state of the entire temporal graph either at a specific point in time or a subgraph that is valid during a given time range by providing a temporal predicate function. Such predicate functions are adopted from the SQL standard for temporal databases [12]. The difference operator computes the changes between two snapshots X and Y by determining the union of X and Y and annotating each vertex and edge if it appears in Y only (i.e., if it has been added), in X only (deleted) or in both X and Y (persistent). Following the philosophy of Gradoop, both operators were implemented on top of Apache Flink: snapshot employs Flink’s filter transformation while difference is based on the flatMap transformation. Implementation details of these operators and benchmark results exposing a good scalability can be found in [15].

Support of Time-Specific Grouping and Aggregation: The temporal extension of Gradoop’s grouping operator offers a flexible mechanism to group (summarize) vertices and edges, which belong to a given time instance. Users can either define their own or use predefined functions to extract keys from a vertex or edge on which to group. Any information of a graph element can be used including all temporal information, such as the day of the week on which the validity of an edge begins or the rounded duration of a vertex validity. Additionally, multiple aggregate functions can be specified to compute aggregates within a vertex or edge group and store them as a new property on the super-vertex (the vertex representing the group) or super-edge respectively. Not only properties can be aggregated, but also information from the additional time dimensions of the graph. For example, the earliest or latest beginning of an edges validity or the average, minimum or maximum vertex duration can be calculated. The resulting grouped graph is again temporal, i.e., the valid times of the super-vertices and -edges are defined by the earliest beginning and latest ending of the elements that are responsible to the group.

Since timestamp values can be analyzed and grouped at different granularities (e.g. year, month, day, hour, minute etc.), time properties inherently lead to hierarchically organized dimensions. Graph summaries determined by the grouping operator can thus be additionally “rolled-up” on the time hierarchy to have aggregations on multiple levels of time-granularity. A detailed description of graph grouping with Gradoop including the roll-up feature and predefined aggregate functions can be found in our GitHub wiki^{Footnote 2}.

3 Temporal Graph Analysis Using Gradoop: A Use Case

Supporting graph analysis at large scale is necessary in various domains like Internet-of-Things (IoT), finance, and web to perform risk analysis, customer profiling, etc. In addition, time plays an important role in such analysis since analysts want to know, e.g., how a specific result of their query looks in the past or changes over time. As a result, a graph processing system has to offer a flexible and rich library of functionalities and algorithms to support a wide range of analysis respecting the additional time dimension.

To show the expressiveness and flexibility of Gradoop and its temporal model among its declarative operator principle, we choose a business case from the customer relationship management domain. Specifically, the scenario deals with interactions in a call center for 25 banks of the banks association of Turkey [1]. More than 7,500 agents are employed in about 16 service types (e.g., card, stock, ATM, online banking, etc.). Per month, about 46 million incoming calls are answered by agents, 24 million calls are outgoing calls to customers. These entities and their relations form a huge heterogeneous network that continuously evolves. Figure 1 shows a simplified example of the resulting graph schema. It includes different types of vertices (entities), like Bank and Customer, as well as edges (relations), like a call representing the telephone call between customers and call center agents. Each element includes a variety of properties describing it with additional information, e.g., an Agent vertex has a defined staff number, a name and city. We can put all the collected data in our temporal property graph model. Properties containing temporal information (e.g., the started at and duration properties of the calls edge) can be directly mapped to the valid-time attributes of the model, to enable various time-related analysis.

In the following, we study how an analytical question of this use case can be processed. We will utilize the modularity of our temporal graph operators as well as operators from the reference EPGM implementation and compose them within a simple but powerful workflow to show a way to answer them.

What is the average duration of calls per month, week and day between agents of different cities and customers of Istanbul, where both agents and customers joined the bank in 2018?

This question includes the need for aggregations over time hierarchies besides filters for a subset of entities on an extracted graph snapshot. The following exemplary workflow definition shows the use of four operators that result in a collection of graphs where each describes one out of the three time-granularities month, week and day.

The initial subgraph operator (line 2–5) applies a filtering using the given vertex and edge predicates to get a subgraph that contains only Agent vertices and Customer vertices with a property city that is equal to the string Istanbul. This operator is part of the EPGM. To receive customers that joined a bank in 2018, we apply the newly developed TPGM snapshot operator (line 6) with a predefined predicate. Since the result of the snapshot operator can contain dangling edges (i.e., their source or target vertices are not contained in the result set), we apply the verify operator (line 7) to remove these from the graph. The final grouping operator (line 8–12) summarizes the graph. The vertices will be grouped by their label and the property city (line 9). A property with the count is added to each grouped vertex as a result of the given Count() vertex aggregate function. The edges representing the calls are grouped by month, week and day of the calls beginning timestamp (from) through the usage of time-specific value transformation functions of the same name (line 11). Since we want to know the average call duration, the predefined aggregate function AvgDuration() is specified in addition to the Count() aggregate function (line 12). Equivalent to the vertices, new properties storing the aggregates are added to each super-edge.

The additional BY ROLLUP (line 11) leads to three different aggregations comparable to SQL. First, the graph will be grouped by day, then by week and besides, by the month of the call’s beginning. This leads to deeper insights into the evolution of the number and average duration of calls between agents of different cities and customers from the city Istanbul. The resulting three graphs are contained in a graph collection, which is the result of our workflow and exemplified in Fig. 2. The collection can be stored or visualized by one of Gradoop’s data sinks. Further, an analyst may use the subgraph operator again to filter this result for periods with a very low or high average call duration.

4 Conclusions

We reported work in progress on temporal graph analysis with the distributed graph analytics framework Gradoop. We introduced the Temporal Property Graph Model (TPGM) that extends Gradoop’s graph data model. The new temporal operators and further extensions enable a flexible answering of time-oriented analytical questions on evolving graphs, e.g., by chaining several operators. We demonstrated the use of declarative workflows for a time-related use case scenario of the financial domain. The described extensions are already implemented and available in Gradoop. In future work, we plan further temporal operators and algorithms to increase the functionality for temporal graph analytics.

Notes

References

The banks association of turkey: statistical report. http://www.tbb.org.tr/en/banks-and-banking-sector-information/statistical-reports/20
Angles, R., et al.: G-CORE: a core for future graph query languages. In: Proceedings of ACM SIGMOD, pp. 1421–1432 (2018)
Google Scholar
Arangodb. https://www.arangodb.com/
Apache Flink. https://flink.apache.org/
Francis, N., et al.: Cypher: an evolving query language for property graphs. In: Proceedings of ACM SIGMOD, pp. 1433–1445 (2018)
Google Scholar
Apache Giraph. https://giraph.apache.org/
Apache Spark GraphX. https://spark.apache.org/graphx
Holme, P., Saramäki, J.: Temporal networks. CoRR abs/1108.1780 (2011). http://arxiv.org/abs/1108.1780
Janusgraph. https://janusgraph.org/
Junghanns, M., Kießling, M., Teichmann, N., Gómez, K., Petermann, A., Rahm, E.: Declarative and distributed graph analytics with GRADOOP. PVLDB 11(12), 2006–2009 (2018)
Google Scholar
Junghanns, M., Petermann, A., Neumann, M., Rahm, E.: Management and analysis of big graph data: current systems and open challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 457–505. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_14
Chapter Google Scholar
Kulkarni, K., Michels, J.: Temporal features in SQL: 2011. ACM SIGMOD Rec. 41(3), 34–43 (2012)
Article Google Scholar
Neo4j. https://neo4j.com/
van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: a property graph query language. In: Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, p. 7. ACM (2016)
Google Scholar
Rost, C., Thor, A., Rahm, E.: Analyzing temporal graphs with GRADOOP. Datenbank-Spektrum 19(3), 199–208 (2019). https://doi.org/10.1007/s13222-019-00325-8
Article Google Scholar
Rost, C., Thor, A., Rahm, E.: Temporal graph analysis using GRADOOP. In: Meyer, H., Ritter, N., Thor, A., Nicklas, D., Heuer, A., Klettke, M. (eds.) Proceedings of BTW Workshops, pp. 109–118 (2019). https://doi.org/10.18420/btw2019-ws-11
Sahu, S., Mhedhbi, A., Salihoglu, S., Lin, J., Özsu, M.T.: The ubiquity of large graphs and surprising challenges of graph processing: extended survey. VLDB J. (2019). https://doi.org/10.1007/s00778-019-00548-x
Article Google Scholar
Wang, Y., Yuan, Y., Ma, Y., Wang, G.: Time-dependent graphs: definitions, applications, and algorithms. Data Sci. Eng. 4(4), 352–366 (2019). https://doi.org/10.1007/s41019-019-00105-0
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Leipzig, Leipzig, Germany
Christopher Rost, Philip Fritzsche, Kevin Gomez & Erhard Rahm
Leipzig University of Applied Sciences, Leipzig, Germany
Andreas Thor

Authors

Christopher Rost
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Thor
View author publications
You can also search for this author in PubMed Google Scholar
Philip Fritzsche
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Gomez
View author publications
You can also search for this author in PubMed Google Scholar
Erhard Rahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Christopher Rost , Andreas Thor , Philip Fritzsche , Kevin Gomez or Erhard Rahm .

Editor information

Editors and Affiliations

Institut National des Sciences Appliquées, Rennes, France
Peggy Cellier
Maastricht University, Maastricht, The Netherlands
Kurt Driessens

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rost, C., Thor, A., Fritzsche, P., Gomez, K., Rahm, E. (2020). Evolution Analysis of Large Graphs with Gradoop. In: Cellier, P., Driessens, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Communications in Computer and Information Science, vol 1167. Springer, Cham. https://doi.org/10.1007/978-3-030-43823-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-43823-4_33
Published: 28 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43822-7
Online ISBN: 978-3-030-43823-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)