Using entropy metrics for pruning very large graph cubes
Introduction
The data management community has long been interested in problems related to modeling, storing and querying graphdatabases [1], [2]. Recently, this interest has been renewed with the emergence of applications in social networking, location based services, biology and the semantic web, where data graphs of massive scale need to be analyzed. As a consequence, business intelligence techniques such as the data cube [3], which have been developed for flat, relational data, need to be revised in order to accommodate the needs of complex graph datasets.
Of particular interest in graph data are the relationships between nodes depicted via the edges of the graph. These relationships should be analyzed with respect to attribute values available at the nodes and edges. For example, a data scientist may want to investigate how users of a social network, depending on their gender, relate to other users based on their nationality. As we will see this inquiry can be accommodated by aggregating existing relationships (edges) in the data graph based on gender and nationality attribute values of their constituent nodes. This process forms a graph cuboid, as is depicted in Fig. 1.
Graph cubes have been recently proposed [4], [5], [6], [7], [8] in order to describe all possible such cuboids. They provide a solid foundation that an analyst may build upon, in a manner similar to what the data cube is for OLAP analysis [9], [10], [11], [12]. Nevertheless, graph cubes contain an exponential collection of cuboids. Moreover, a decision maker, familiar with the simpler framework of data cubes, may be overwhelmed when she tries to navigate not flat records, but rather complex graph cuboids containing aggregated views of graph nodes and relationships.
In this work, we first revisit the graph cube framework highlighting the relationships among the cuboids contained in it. These relationships are modeled as a graph cube lattice produced by taking the Cartesian product of simpler data cubes on the attributes of the nodes and edges of the data graph. We utilize the graph cube lattice as a foundation, where interesting relationships can be revealed using entropy-based calculations. A benefit of the lattice representation and of the metrics we propose is that certain entropy bounds can be established between the graph cuboids. This realization permits us to design an efficient algorithm that prunes significant parts of the graph cube from consideration. Compared to a straightforward evaluation of the entropy metrics on the whole lattice, the proposed algorithm saves up to 90% of computation time.
Our major contributions are summarized as follows:
- •
We utilize two novel entropy measures, in particular external and internal entropy, as the means to evaluate the content of the graph cube. External entropy weighs a drill-down edge in the graph cube lattice in order to determine whether the addition of a new attribute that is used to form the more detailed child cuboid results in non-uniform interactions. In such cases, the internal entropy is used to examine each cuboid that is constituent to such a drill-down edge and select subgraphs that exhibit significant skew.
- •
A straightforward approach that would evaluate the proposed entropy metrics over the whole graph cube would be very inefficient. In this work, we propose a bisection algorithm which, given a precomputed graph cube, prunes whole cuboids from consideration by exploiting certain bounds between their entropies. Our results in real and synthetic datasets demonstrate the effectiveness of our techniques in processing multi-terabyte graph cubes with tens of billions of records. These results further reveal that often, only small parts of the graph cube contain interesting aggregations, with respect to other aggregations available in the graph cube lattice.
- •
We compare our techniques against an alternative method that prunes parts of the graph cube based on a minimum support threshold. We observe that our framework maintains the most varied parts of the data distribution independently of their frequencies.
The work presented in this manuscript is a consolidation of prior work by the authors that has appeared in [13] and [14], extended further with new algorithmic results that significantly increase the efficiency of the proposed techniques. More specifically, the work of [13] first introduced the idea of using information entropy in order to prune graph cubes. However, this work offered a limited definition of entropy suitable only for the SUM aggregation function. This article models the graph data as a probability distribution and extends the definitions of the entropy metrics so as to be applicable to different functions and not only SUM. The work of [14] was based on [13] and proposed a graph cube analysis workflow for identifying and visualizing prominent trends in large graph cubes. However, neither [13] nor [14] provide an efficient algorithm for computing the proposed entropy metrics. As a result, their methods are inefficient when used in very large graph datasets.
In this work, as highlighted above, we first introduce novel theoretical bounds on the proposed entropy rates. Per these bounds, we are able to present a novel bisection algorithm that reduces the entropy computation times by up to 90%, compared to the straightforward evaluation discussed in our prior work. In our experimental results we evaluate the benefits of this new algorithm compared to the techniques discussed in [13], [14]. This experimental evaluation is performed using a new implementation over Spark that permits parallel execution of the entropy calculations in a distributed system. Moreover, we provide a more thorough evaluation of our methods that includes additional real and synthetic datasets. Finally, in this work we further discuss extensions to the entropy metrics and provide experimental results for datasets that contain attributes on both their nodes and edges.
The rest of this paper is organized as follows: Section 2 uses a motivational example to present our data model and discusses the construction of the graph cube lattice from the constituent data cubes on the graph nodes. We then formally introduce in Section 3 the concepts of external and internal entropy and explain their calculations. Section 4 presents our proposed algorithm for entropy-aware selection of graph cuboids and discusses extensions to the proposed graph cube framework. In Section 5 we provide qualitative and quantitative indicators on the effectiveness of our techniques when used on real and synthetic datasets. Finally, in Section 6 we discuss related work, while in Section 7 we provide concluding remarks.
Section snippets
Motivational example
We consider a social network which depicts relationships between different users. Each user profile can be represented as a graph node with three attributes: gender (male, female), nation (Greece, Italy, USA) and profession (doctor, professor, musician). Every edge in the data graph is associated with a numeric value that indicates the number of interactions between the respective users.
A possible inquiry on this network is to examine how users depending on their gender, relate to other users
Main concepts
Almost always, the analysts are attracted to data that are far away from uniformity; data from which they can discover patterns and rules; data that are hidden in peaks and valleys. In order to explore such cases of data skew, we revisit the idea of the information entropy (or Shannon entropy [17]), which is the expected value of the information contained in the data, and transform it in a manner that is suitable for processing graph cuboids. The entropy captures the amount of uncertainty; it
Problem statement
In our framework, we seek to utilize the proposed entropy metrics in order select parts of the graph cube that satisfy the following objectives:
- External-entropy Objective:
Given a graph cube lattice and an external entropy rate threshold return a set of drill-down navigations .
- Internal Entropy Objective:
Given a cuboid and an internal entropy rate threshold , return all edges in whose starting or ending internal entropy rates are less or equal
Experimental set up
In this section, we provide an experimental evaluation of the proposed framework.1 We first present results using four real datasets. We then discuss additional experiments using synthetic datasets in Section 5.4. The first real dataset consists of data sampled from Twitter. The second dataset is from VKontakte (VK), the largest European on-line
Related work
The data cube operator, introduced in [3] defines a foundational framework for declaring all possible aggregations along a list of selected domains, often referred to as “dimensions”. The cube was proposed for flat, basket-type datasets. However, it has been recently extended for the case of interconnected datasets. The work in [8] introduced the graph cube that takes into account both attribute aggregation and structure summarization of the underlying graphs. This work mainly focuses on
Conclusions
In this work we first revisited the framework of graph cubes and proposed an intuitive representation of it as the Cartesian product of independent data cubes on the starting and ending nodes of the graph and, as an extension, of available attributes on the edges. We then addressed the enormous size and complexity of the resulting graph cubes by proposing an efficient algorithm that selects interesting parts of the aggregate graphs using information entropy calculations. Key to our algorithm is
Acknowledgments
This research is financed by the Research Centre of Athens University of Economics and Business, Greece , in the framework of the project entitled “Original Scientific Publications”.
References (44)
- et al.
High-dimensional OLAP: A minimal cubing approach
- et al.
A survey on summarizability issues in multidimensional modeling
Data Knowl. Eng.
(2009) - et al.
Gram: A graph data model and query languages
Graphdb: Modeling and querying graphs in databases
- et al.
Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total
- et al.
Graph OLAP: Towards online analytical processing on graphs
- et al.
A framework for building OLAP cubes on graphs
- K. Khan, K. Najeebullah, W. Nawaz, Y. Lee, OLAP on structurally significant data in graphs, CoRR...
- et al.
Graph Cube: On warehousing and olap multidimensional networks
Building the Data Warehouse
(1992)
The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
Cubetree: Organization of and bulk incremental updates on the data cube
Dwarf: Shrinking the PetaCube
Entropy-based selection of graph cuboids
Finding the needle in a haystack: Entropy guided exploration of very large graph cubes
Summarizability in OLAP and statistical data bases
A Mathematical Theory of Communication
SIGMOBILE Mob. Comput. Commun. Rev.
Entropy based approximate querying and exploration of datacubes
Spark: Cluster computing with working sets
Bottom-up computation of sparse and iceberg cube
Cited by (6)
Example-Driven Exploratory Analytics over Knowledge Graphs
2023, Advances in Database Technology - EDBTView selection over knowledge graphs in triple stores
2021, Proceedings of the VLDB EndowmentEfficient Exploration of Interesting Aggregates in RDF Graphs
2021, Proceedings of the ACM SIGMOD International Conference on Management of DataAn on-line analytical processing (OLAP) aggregation function for rising aspects in collaboration and social networks
2020, Journal of Computer ScienceAn efficient index for RDF query containment
2019, Proceedings of the ACM SIGMOD International Conference on Management of DataSpade: A modular framework for analytical exploration of RDF graphs
2018, Proceedings of the VLDB Endowment