Abstract
From sensor networks to transportation infrastructure to social networks, we are awash in data. Many of these real-world networks tend to be large (“big data”) and dynamic, evolving over time. Their evolution can be modeled as a series of graphs. Traditional systems that store and analyze one graph at a time cannot effectively handle the complexity and subtlety inherent in dynamic graphs. Modern analytics require systems capable of storing and processing series of graphs. We present such a system. G* compresses dynamic graph data based on commonalities among the graphs in the series for deduplicated storage on multiple servers. In addition to the obvious space-saving advantage, large-scale graph processing tends to be I/O bound, so faster reads from and writes to stable storage enable faster results. Unlike traditional database and graph processing systems, G* executes complex queries on large graphs using distributed operators to process graph data in parallel. It speeds up queries on multiple graphs by processing graph commonalities only once and sharing the results across relevant graphs. This architecture not only provides scalability, but since G* is not limited to processing only what is available in RAM, its analysis capabilities are far greater than other systems which are limited to what they can hold in memory. This paper presents G*’s design and implementation principles along with evaluation results that document its unique benefits over traditional graph processing systems.































Similar content being viewed by others
Notes
These systems cannot readily take advantage of commonalities among graphs and thereby suffer high space overhead. For example, one may consider using a relation to store edges of a series of graphs. In this case, for an edge contained in 100 snapshots, there will be 100 tuples for that edge, each differentiated by snapshot ID. This incurs high space overhead compared to our system, which supports deduplicated storage as described throughout this paper.
In this paper, we focus on managing graphs that correspond to periodic snapshots of an evolving network. Logging the input data allows G* to reconstruct graphs as of any points in the past by using periodic snapshots and log data. This feature is not further discussed in this paper.
The current G* implementation assigns each vertex to a server based on the hash value of the vertex ID. We are developing data distribution techniques that can reduce the edges whose end points are assigned to different servers.
As mentioned in the cost analysis of the put(v, d, g) method, updating the CGI for a version of vertex v also requires a lookup via maps(v).
References
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922–933 (2009)
Adler, M., Mitzenmacher, M.: Towards compressing web graphs. In: Proceedings of the 2001 Data Compression Conference (DCC), pp. 203–212 (2001)
Alashqur, A.M., Su, S., Lam, H.: OQL: a query language for manipulating object-oriented databases. In: Proceedings of the 15th International Conference on Very Large Data Bases (VLDB), pp. 433–442 (1989)
Apache Giraph: http://incubator.apache.org/giraph/. Accessed 23 Feb 2014
Barbay, J., He, M., Munro, I., Rao, S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 680–689 (2007)
Bogdanov, P., Mongiovì, M., Singh, A.K.: Mining heavy subgraphs in time-evolving networks. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), pp. 81–90 (2011)
Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)
Bui-Xuan, B.M., Ferreira, A., Jarry, A.: Computing shortest, fastest, and foremost journeys in dynamic networks. Int. J. Found. Comput. Sci. 14(2), 267–267 (2003)
Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time-varying graphs and dynamic networks. In: Proceedings of the 10th International Conference on Ad-hoc, Mobile, and Wireless Networks (ADHOC-NOW), pp. 346–359 (2011)
Chan, A., Dehne, F.K.H.A., Taylor, R.: CGMGRAPH/CGMLIB: implementing and testing CGM graph algorithms on PC clusters and shared memory machines. Int. J. High Perform. Comput. Appl. (IJHPCA) 19(1), 81–97 (2005)
Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1123–1126 (2010)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150 (2004)
G* Operator Reference Guide: http://www.cs.albany.edu/~gstar/operator-reference. Accessed 23 Feb 2014
Gregor, D., Lumsdaine, A.: The parallel BGL: a generic library for distributed graph computations. In: Proceedings of the 4th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing (POOSC) (2005)
Han, W.S., Lee, J., Pham, M.D., Yu, J.X.: iGraph: a framework for comparisons of disk-based graph indexing techniques. Proc. VLDB Endow. (PVLDB) 3(1), 449–459 (2010)
He, H., Singh, A.: Graphs-at-a-time: query language and access methods for graph databases. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 405–418 (2008)
Java Remote Method Invocation (RMI): http://download.oracle.com/javase/tutorial/rmi/index.html. Accessed 23 Feb 2014
Jin, R., Ruan, N., Dey, S., Yu, J.X.: SCARAB: scaling reachability computation on large graphs. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 169–180 (2012)
Kang, U., Tsourakakis, C., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: Proceedings of the 9th IEEE International Conference on Data Mining (ICDM), pp. 229–238 (2009)
Kang, U., Tsourakakis, C., Appel, A.P., Faloutsos, C., Leskovec, J.: HADI: mining radii of large graphs. ACM Trans. Knowl. Discov. Data (TKDD) 5(2), 8.1–8.24 (2011)
Kossinets, G., Watts, D.: Empirical analysis of an evolving social network. Science 311(5757), 88–90 (2006)
Kuhlman, C., Kumar, A., Marathe, M., Ravi, S.S., Rosenkrantz, D.: Finding critical nodes for inhibiting diffusion of complex contagions in social networks. In: Proceedings of the European Conference on European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECML PKDD), pp. 111–127 (2010)
Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 611–617 (2006)
Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (USENIX), pp. 31–46 (2012)
Lahiri, M., Berger-Wolf, T.Y.: Structure prediction in temporal networks using frequent subgraphs. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 35–42 (2007)
Leskovec, J., Backstrom, L., Kumar, R., Tomkins, A.: Microscopic evolution of social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 462–470 (2008)
Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graphs over Time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 177–187 (2005)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: GraphLab: a new framework for parallel machine learning. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 340–349 (2010)
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)
Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 419–432 (2008)
Neely, M.J., Modiano, E., Rohrs, C.E.: Dynamic power allocation and routing for time varying wireless networks. In: Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications IEEE Societies (INFOCOM) (2003)
Neo4j: http://neo4j.org/. Accessed 23 Feb 2014
Nicosia, V., Tang, J., Musolesi, M., Russo, G., Mascolo, C., Latora, V.: Components in time-varying graphs. CoRR abs/1106.2134 (2011)
Pan, R.K., Saramäki, J.: Path lengths, correlations, and centrality in temporal networks. CoRR abs/1101.5913 (2011)
Parr, T.: The Definitive ANTLR Reference: Building Domain-Specific Languages. Pragmatic Bookshelf, Raleigh (2008)
Phoebus: https://github.com/xslogic/phoebus. Accessed 23 Feb 2014
PostgreSQL 9.0: http://www.postgresql.org/. Accessed 23 Feb 2014
Ren, C., Lo, E., Kao, B., Zhu, X., Cheng, R.: On querying historical evolving graph sequences. Proc. VLDB Endow. (PVLDB) 4(11), 726–737 (2011)
Santoro, N., Quattrociocchi, W., Flocchini, P., Casteigts, A., Amblard, F.: Time-varying graphs and social network analysis: temporal indicators and metrics. CoRR abs/1102.0629 (2011)
Shun, J., Blelloch, G.: Ligra: a lightweight graph processing framework for shared memory. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 135–146 (2013)
Spillane, S., Birnbaum, J., Bokser, D., Kemp, D., Labouseur, A., Olsen Jr., P., Vijayan, J., Hwang, J.H.: A demonstration of the G* graph database system. In: Proceedings of the 29th International Conference on Data Engineering (ICDE), pp. 1356–1359 (2013)
Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/. Accessed 23 Feb 2014
Suel, T., Yuan, J.: Compressing the graph structure of the web. In: Proceedings of the 2001 Data Compression Conference (DCC), pp. 213–222 (2001)
Tan, C., Tang, J., Sun, J., Lin, Q., Wang, F.: Social action tracking via noise tolerant time-varying factor graphs. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1049–1058 (2010)
Tang, J., Musolesi, M., Mascolo, C., Latora, V.: Temporal distance metrics for social network analysis. In: Proceedings of the 2nd ACM Workshop on Online Social Networks (WOSN), pp. 31–36 (2009)
Tang, Z., Lin, H., Li, K., Han, W., Chen, W.: Acolyte: an in-memory social network query system. In: Proceedings of the 13th International Conference on Web Information Systems Engineering (WISE), pp. 755–763 (2012)
The Angrapa package: http://people.apache.org/~edwardyoon/site/hama_graph_tutorial.html. Accessed 23 Feb 2014
Trinity: http://research.microsoft.com/en-us/projects/trinity/. Accessed 23 Feb 2014
Twitter Streaming API: https://dev.twitter.com/docs/streaming-apis/streams/public. Accessed 23 Feb 2014
Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference (SE), pp. 42.1–42.6 (2010)
Yahoo! Network Flows Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g. Accessed 23 Feb 2014
Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. (PVLDB) 3(1), 340–351 (2010)
Acknowledgments
This research was supported by NSF CAREER award IIS-1149372 and also supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the “IT Consilience Creative Program” (NIPA-2013-H0203-13-1001) supervised by the NIPA (National IT Industry Promotion Agency).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Haixun Wang and Jeffrey Xu Yu.
Rights and permissions
About this article
Cite this article
Labouseur, A.G., Birnbaum, J., Olsen, P.W. et al. The G* graph database: efficiently managing large distributed dynamic graphs. Distrib Parallel Databases 33, 479–514 (2015). https://doi.org/10.1007/s10619-014-7140-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-014-7140-3