Using entropy metrics for pruning very large graph cubes

doi:10.1016/j.is.2018.11.007

Information Systems

Volume 81, March 2019, Pages 49-62

https://doi.org/10.1016/j.is.2018.11.007 Get rights and content

Highlights

•
Propose entropy metrics for evaluating interactions within large graph datasets.
•
Propose techniques to weigh possible OLAP drill-down operations on graph cubes.
•
Present an efficient algorithm for fast selection of aggregated sub-graphs.
•
Evaluation of the presented techniques in large real and synthetic graph datasets.

Abstract

Emerging applications face the need to store and analyze interconnected data. Graph cubes permit multi-dimensional analysis of graph datasets based on attribute values available at the nodes and edges of these graphs. Like the data cube that contains an exponential number of aggregations, the graph cube results in an exponential number of aggregate graph cuboids. As a result, they are very hard to analyze. In this work, we first propose intuitive measures based on the information entropy in order to evaluate the rich information contained in the graph cube. We then introduce an efficient algorithm that suggests portions of a precomputed graph cube based on these measures. The proposed algorithm exploits novel entropy bounds that we derive between different levels of aggregation in the graph cube. Per these bounds we are able to prune large parts of the graph cube, saving costly entropy calculations that would be otherwise required. We experimentally validate our techniques on real and synthetic datasets and demonstrate the pruning power and efficiency of our proposed techniques.

Introduction

The data management community has long been interested in problems related to modeling, storing and querying graphdatabases [1], [2]. Recently, this interest has been renewed with the emergence of applications in social networking, location based services, biology and the semantic web, where data graphs of massive scale need to be analyzed. As a consequence, business intelligence techniques such as the data cube [3], which have been developed for flat, relational data, need to be revised in order to accommodate the needs of complex graph datasets.

Of particular interest in graph data are the relationships between nodes depicted via the edges of the graph. These relationships should be analyzed with respect to attribute values available at the nodes and edges. For example, a data scientist may want to investigate how users of a social network, depending on their gender, relate to other users based on their nationality. As we will see this inquiry can be accommodated by aggregating existing relationships (edges) in the data graph based on gender and nationality attribute values of their constituent nodes. This process forms a graph cuboid, as is depicted in Fig. 1.

Graph cubes have been recently proposed [4], [5], [6], [7], [8] in order to describe all possible such cuboids. They provide a solid foundation that an analyst may build upon, in a manner similar to what the data cube is for OLAP analysis [9], [10], [11], [12]. Nevertheless, graph cubes contain an exponential collection of cuboids. Moreover, a decision maker, familiar with the simpler framework of data cubes, may be overwhelmed when she tries to navigate not flat records, but rather complex graph cuboids containing aggregated views of graph nodes and relationships.

In this work, we first revisit the graph cube framework highlighting the relationships among the cuboids contained in it. These relationships are modeled as a graph cube lattice produced by taking the Cartesian product of simpler data cubes on the attributes of the nodes and edges of the data graph. We utilize the graph cube lattice as a foundation, where interesting relationships can be revealed using entropy-based calculations. A benefit of the lattice representation and of the metrics we propose is that certain entropy bounds can be established between the graph cuboids. This realization permits us to design an efficient algorithm that prunes significant parts of the graph cube from consideration. Compared to a straightforward evaluation of the entropy metrics on the whole lattice, the proposed algorithm saves up to 90% of computation time.

Our major contributions are summarized as follows:

•
We utilize two novel entropy measures, in particular external and internal entropy, as the means to evaluate the content of the graph cube. External entropy weighs a drill-down edge in the graph cube lattice in order to determine whether the addition of a new attribute that is used to form the more detailed child cuboid results in non-uniform interactions. In such cases, the internal entropy is used to examine each cuboid that is constituent to such a drill-down edge and select subgraphs that exhibit significant skew.
•
A straightforward approach that would evaluate the proposed entropy metrics over the whole graph cube would be very inefficient. In this work, we propose a bisection algorithm which, given a precomputed graph cube, prunes whole cuboids from consideration by exploiting certain bounds between their entropies. Our results in real and synthetic datasets demonstrate the effectiveness of our techniques in processing multi-terabyte graph cubes with tens of billions of records. These results further reveal that often, only small parts of the graph cube contain interesting aggregations, with respect to other aggregations available in the graph cube lattice.
•
We compare our techniques against an alternative method that prunes parts of the graph cube based on a minimum support threshold. We observe that our framework maintains the most varied parts of the data distribution independently of their frequencies.

The work presented in this manuscript is a consolidation of prior work by the authors that has appeared in [13] and [14], extended further with new algorithmic results that significantly increase the efficiency of the proposed techniques. More specifically, the work of [13] first introduced the idea of using information entropy in order to prune graph cubes. However, this work offered a limited definition of entropy suitable only for the SUM aggregation function. This article models the graph data as a probability distribution and extends the definitions of the entropy metrics so as to be applicable to different functions and not only SUM. The work of [14] was based on [13] and proposed a graph cube analysis workflow for identifying and visualizing prominent trends in large graph cubes. However, neither [13] nor [14] provide an efficient algorithm for computing the proposed entropy metrics. As a result, their methods are inefficient when used in very large graph datasets.

In this work, as highlighted above, we first introduce novel theoretical bounds on the proposed entropy rates. Per these bounds, we are able to present a novel bisection algorithm that reduces the entropy computation times by up to 90%, compared to the straightforward evaluation discussed in our prior work. In our experimental results we evaluate the benefits of this new algorithm compared to the techniques discussed in [13], [14]. This experimental evaluation is performed using a new implementation over Spark that permits parallel execution of the entropy calculations in a distributed system. Moreover, we provide a more thorough evaluation of our methods that includes additional real and synthetic datasets. Finally, in this work we further discuss extensions to the entropy metrics and provide experimental results for datasets that contain attributes on both their nodes and edges.

The rest of this paper is organized as follows: Section 2 uses a motivational example to present our data model and discusses the construction of the graph cube lattice from the constituent data cubes on the graph nodes. We then formally introduce in Section 3 the concepts of external and internal entropy and explain their calculations. Section 4 presents our proposed algorithm for entropy-aware selection of graph cuboids and discusses extensions to the proposed graph cube framework. In Section 5 we provide qualitative and quantitative indicators on the effectiveness of our techniques when used on real and synthetic datasets. Finally, in Section 6 we discuss related work, while in Section 7 we provide concluding remarks.

Section snippets

Motivational example

We consider a social network which depicts relationships between different users. Each user profile can be represented as a graph node with three attributes: gender (male, female), nation (Greece, Italy, USA) and profession (doctor, professor, musician). Every edge in the data graph is associated with a numeric value that indicates the number of interactions between the respective users.

A possible inquiry on this network is to examine how users depending on their gender, relate to other users

Main concepts

Almost always, the analysts are attracted to data that are far away from uniformity; data from which they can discover patterns and rules; data that are hidden in peaks and valleys. In order to explore such cases of data skew, we revisit the idea of the information entropy (or Shannon entropy [17]), which is the expected value of the information contained in the data, and transform it in a manner that is suitable for processing graph cuboids. The entropy captures the amount of uncertainty; it

Problem statement

In our framework, we seek to utilize the proposed entropy metrics in order select parts of the graph cube that satisfy the following objectives:

External-entropy Objective:

Given a graph cube lattice $G C L$ and an external entropy rate threshold $e H_{r}$ return a set of drill-down navigations $n a v G C L = {e = (C_{k}, C_{i}) | e H_{r a t e} (C_{k}, C_{i}) \leq e H_{r}}$ .

Internal Entropy Objective:

Given a cuboid $C_{i}$ and an internal entropy rate threshold $i H_{r}$ , return all edges in $C_{i}$ whose starting or ending internal entropy rates are less or equal

Experimental set up

In this section, we provide an experimental evaluation of the proposed framework.¹ We first present results using four real datasets. We then discuss additional experiments using synthetic datasets in Section 5.4. The first real dataset consists of data sampled from Twitter. The second dataset is from VKontakte (VK), the largest European on-line

Related work

The data cube operator, introduced in [3] defines a foundational framework for declaring all possible aggregations along a list of selected domains, often referred to as “dimensions”. The cube was proposed for flat, basket-type datasets. However, it has been recently extended for the case of interconnected datasets. The work in [8] introduced the graph cube that takes into account both attribute aggregation and structure summarization of the underlying graphs. This work mainly focuses on

Conclusions

In this work we first revisited the framework of graph cubes and proposed an intuitive representation of it as the Cartesian product of independent data cubes on the starting and ending nodes of the graph and, as an extension, of available attributes on the edges. We then addressed the enormous size and complexity of the resulting graph cubes by proposing an efficient algorithm that selects interesting parts of the aggregate graphs using information entropy calculations. Key to our algorithm is

Acknowledgments

This research is financed by the Research Centre of Athens University of Economics and Business, Greece , in the framework of the project entitled “Original Scientific Publications”.

References (44)

LiX. et al.
High-dimensional OLAP: A minimal cubing approach
MazónJ. et al.
A survey on summarizability issues in multidimensional modeling
Data Knowl. Eng.
(2009)
AmannB. et al.
Gram: A graph data model and query languages
GütingR.H.
Graphdb: Modeling and querying graphs in databases
GrayJ. et al.
Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total
ChenC. et al.
Graph OLAP: Towards online analytical processing on graphs
GhrabA. et al.
A framework for building OLAP cubes on graphs
K. Khan, K. Najeebullah, W. Nawaz, Y. Lee, OLAP on structurally significant data in graphs, CoRR...
ZhaoP. et al.
Graph Cube: On warehousing and olap multidimensional networks
InmonW.H.
Building the Data Warehouse
(1992)

KimballR. et al.

The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling

(2002)

RoussopoulosN. et al.

Cubetree: Organization of and bulk incremental updates on the data cube

SismanisY. et al.

Dwarf: Shrinking the PetaCube

BlecoD. et al.

Entropy-based selection of graph cuboids

BlecoD. et al.

Finding the needle in a haystack: Entropy guided exploration of very large graph cubes

LenzH. et al.

Summarizability in OLAP and statistical data bases

ShannonC.E.

A Mathematical Theory of Communication

SIGMOBILE Mob. Comput. Commun. Rev.

(2001)

PalpanasT. et al.

Entropy based approximate querying and exploration of datacubes

J. Leskovec, A. Krevl, SNAP Datasets: Stanford large network dataset collection, (Jun. 2014)....

B. Hall, A. Jaffe, M. Trajtenberg, The NBER Patent Citations Data File: Lessons, Insights and Methodological Tools,...

ZahariaM. et al.

Spark: Cluster computing with working sets

BeyerK. et al.

Bottom-up computation of sparse and iceberg cube

Cited by (6)

Example-Driven Exploratory Analytics over Knowledge Graphs
2023, Advances in Database Technology - EDBT
View selection over knowledge graphs in triple stores
2021, Proceedings of the VLDB Endowment
Efficient Exploration of Interesting Aggregates in RDF Graphs
2021, Proceedings of the ACM SIGMOD International Conference on Management of Data
An on-line analytical processing (OLAP) aggregation function for rising aspects in collaboration and social networks
2020, Journal of Computer Science
An efficient index for RDF query containment
2019, Proceedings of the ACM SIGMOD International Conference on Management of Data
Spade: A modular framework for analytical exploration of RDF graphs
2018, Proceedings of the VLDB Endowment

View full text