Skip to main content
Log in

The G* graph database: efficiently managing large distributed dynamic graphs

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

From sensor networks to transportation infrastructure to social networks, we are awash in data. Many of these real-world networks tend to be large (“big data”) and dynamic, evolving over time. Their evolution can be modeled as a series of graphs. Traditional systems that store and analyze one graph at a time cannot effectively handle the complexity and subtlety inherent in dynamic graphs. Modern analytics require systems capable of storing and processing series of graphs. We present such a system. G* compresses dynamic graph data based on commonalities among the graphs in the series for deduplicated storage on multiple servers. In addition to the obvious space-saving advantage, large-scale graph processing tends to be I/O bound, so faster reads from and writes to stable storage enable faster results. Unlike traditional database and graph processing systems, G* executes complex queries on large graphs using distributed operators to process graph data in parallel. It speeds up queries on multiple graphs by processing graph commonalities only once and sharing the results across relevant graphs. This architecture not only provides scalability, but since G* is not limited to processing only what is available in RAM, its analysis capabilities are far greater than other systems which are limited to what they can hold in memory. This paper presents G*’s design and implementation principles along with evaluation results that document its unique benefits over traditional graph processing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31

Similar content being viewed by others

Notes

  1. These systems cannot readily take advantage of commonalities among graphs and thereby suffer high space overhead. For example, one may consider using a relation to store edges of a series of graphs. In this case, for an edge contained in 100 snapshots, there will be 100 tuples for that edge, each differentiated by snapshot ID. This incurs high space overhead compared to our system, which supports deduplicated storage as described throughout this paper.

  2. In this paper, we focus on managing graphs that correspond to periodic snapshots of an evolving network. Logging the input data allows G* to reconstruct graphs as of any points in the past by using periodic snapshots and log data. This feature is not further discussed in this paper.

  3. The current G* implementation assigns each vertex to a server based on the hash value of the vertex ID. We are developing data distribution techniques that can reduce the edges whose end points are assigned to different servers.

  4. As mentioned in the cost analysis of the put(v, d, g) method, updating the CGI for a version of vertex v also requires a lookup via maps(v).

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. (PVLDB) 2(1), 922–933 (2009)

    Article  Google Scholar 

  2. Adler, M., Mitzenmacher, M.: Towards compressing web graphs. In: Proceedings of the 2001 Data Compression Conference (DCC), pp. 203–212 (2001)

  3. Alashqur, A.M., Su, S., Lam, H.: OQL: a query language for manipulating object-oriented databases. In: Proceedings of the 15th International Conference on Very Large Data Bases (VLDB), pp. 433–442 (1989)

  4. Apache Giraph: http://incubator.apache.org/giraph/. Accessed 23 Feb 2014

  5. Barbay, J., He, M., Munro, I., Rao, S.: Succinct indexes for strings, binary relations and multi-labeled trees. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 680–689 (2007)

  6. Bogdanov, P., Mongiovì, M., Singh, A.K.: Mining heavy subgraphs in time-evolving networks. In: Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), pp. 81–90 (2011)

  7. Boldi, P., Vigna, S.: The webgraph framework I: compression techniques. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 595–602 (2004)

  8. Bui-Xuan, B.M., Ferreira, A., Jarry, A.: Computing shortest, fastest, and foremost journeys in dynamic networks. Int. J. Found. Comput. Sci. 14(2), 267–267 (2003)

    Article  MathSciNet  Google Scholar 

  9. Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time-varying graphs and dynamic networks. In: Proceedings of the 10th International Conference on Ad-hoc, Mobile, and Wireless Networks (ADHOC-NOW), pp. 346–359 (2011)

  10. Chan, A., Dehne, F.K.H.A., Taylor, R.: CGMGRAPH/CGMLIB: implementing and testing CGM graph algorithms on PC clusters and shared memory machines. Int. J. High Perform. Comput. Appl. (IJHPCA) 19(1), 81–97 (2005)

    Article  Google Scholar 

  11. Chen, R., Weng, X., He, B., Yang, M.: Large graph processing in the cloud. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1123–1126 (2010)

  12. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–150 (2004)

  13. G* Operator Reference Guide: http://www.cs.albany.edu/~gstar/operator-reference. Accessed 23 Feb 2014

  14. Gregor, D., Lumsdaine, A.: The parallel BGL: a generic library for distributed graph computations. In: Proceedings of the 4th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing (POOSC) (2005)

  15. Han, W.S., Lee, J., Pham, M.D., Yu, J.X.: iGraph: a framework for comparisons of disk-based graph indexing techniques. Proc. VLDB Endow. (PVLDB) 3(1), 449–459 (2010)

    Article  Google Scholar 

  16. He, H., Singh, A.: Graphs-at-a-time: query language and access methods for graph databases. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 405–418 (2008)

  17. Java Remote Method Invocation (RMI): http://download.oracle.com/javase/tutorial/rmi/index.html. Accessed 23 Feb 2014

  18. Jin, R., Ruan, N., Dey, S., Yu, J.X.: SCARAB: scaling reachability computation on large graphs. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 169–180 (2012)

  19. Kang, U., Tsourakakis, C., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: Proceedings of the 9th IEEE International Conference on Data Mining (ICDM), pp. 229–238 (2009)

  20. Kang, U., Tsourakakis, C., Appel, A.P., Faloutsos, C., Leskovec, J.: HADI: mining radii of large graphs. ACM Trans. Knowl. Discov. Data (TKDD) 5(2), 8.1–8.24 (2011)

    Google Scholar 

  21. Kossinets, G., Watts, D.: Empirical analysis of an evolving social network. Science 311(5757), 88–90 (2006)

    Article  MathSciNet  Google Scholar 

  22. Kuhlman, C., Kumar, A., Marathe, M., Ravi, S.S., Rosenkrantz, D.: Finding critical nodes for inhibiting diffusion of complex contagions in social networks. In: Proceedings of the European Conference on European Conference on Machine Learning and Principles of Knowledge Discovery in Databases (ECML PKDD), pp. 111–127 (2010)

  23. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 611–617 (2006)

  24. Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (USENIX), pp. 31–46 (2012)

  25. Lahiri, M., Berger-Wolf, T.Y.: Structure prediction in temporal networks using frequent subgraphs. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 35–42 (2007)

  26. Leskovec, J., Backstrom, L., Kumar, R., Tomkins, A.: Microscopic evolution of social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 462–470 (2008)

  27. Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graphs over Time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 177–187 (2005)

  28. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: GraphLab: a new framework for parallel machine learning. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 340–349 (2010)

  29. Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)

  30. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 419–432 (2008)

  31. Neely, M.J., Modiano, E., Rohrs, C.E.: Dynamic power allocation and routing for time varying wireless networks. In: Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications IEEE Societies (INFOCOM) (2003)

  32. Neo4j: http://neo4j.org/. Accessed 23 Feb 2014

  33. Nicosia, V., Tang, J., Musolesi, M., Russo, G., Mascolo, C., Latora, V.: Components in time-varying graphs. CoRR abs/1106.2134 (2011)

  34. Pan, R.K., Saramäki, J.: Path lengths, correlations, and centrality in temporal networks. CoRR abs/1101.5913 (2011)

  35. Parr, T.: The Definitive ANTLR Reference: Building Domain-Specific Languages. Pragmatic Bookshelf, Raleigh (2008)

    MATH  Google Scholar 

  36. Phoebus: https://github.com/xslogic/phoebus. Accessed 23 Feb 2014

  37. PostgreSQL 9.0: http://www.postgresql.org/. Accessed 23 Feb 2014

  38. Ren, C., Lo, E., Kao, B., Zhu, X., Cheng, R.: On querying historical evolving graph sequences. Proc. VLDB Endow. (PVLDB) 4(11), 726–737 (2011)

    MATH  Google Scholar 

  39. Santoro, N., Quattrociocchi, W., Flocchini, P., Casteigts, A., Amblard, F.: Time-varying graphs and social network analysis: temporal indicators and metrics. CoRR abs/1102.0629 (2011)

  40. Shun, J., Blelloch, G.: Ligra: a lightweight graph processing framework for shared memory. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 135–146 (2013)

  41. Spillane, S., Birnbaum, J., Bokser, D., Kemp, D., Labouseur, A., Olsen Jr., P., Vijayan, J., Hwang, J.H.: A demonstration of the G* graph database system. In: Proceedings of the 29th International Conference on Data Engineering (ICDE), pp. 1356–1359 (2013)

  42. Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/. Accessed 23 Feb 2014

  43. Suel, T., Yuan, J.: Compressing the graph structure of the web. In: Proceedings of the 2001 Data Compression Conference (DCC), pp. 213–222 (2001)

  44. Tan, C., Tang, J., Sun, J., Lin, Q., Wang, F.: Social action tracking via noise tolerant time-varying factor graphs. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1049–1058 (2010)

  45. Tang, J., Musolesi, M., Mascolo, C., Latora, V.: Temporal distance metrics for social network analysis. In: Proceedings of the 2nd ACM Workshop on Online Social Networks (WOSN), pp. 31–36 (2009)

  46. Tang, Z., Lin, H., Li, K., Han, W., Chen, W.: Acolyte: an in-memory social network query system. In: Proceedings of the 13th International Conference on Web Information Systems Engineering (WISE), pp. 755–763 (2012)

  47. The Angrapa package: http://people.apache.org/~edwardyoon/site/hama_graph_tutorial.html. Accessed 23 Feb 2014

  48. Trinity: http://research.microsoft.com/en-us/projects/trinity/. Accessed 23 Feb 2014

  49. Twitter Streaming API: https://dev.twitter.com/docs/streaming-apis/streams/public. Accessed 23 Feb 2014

  50. Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: a data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference (SE), pp. 42.1–42.6 (2010)

  51. Yahoo! Network Flows Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g. Accessed 23 Feb 2014

  52. Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. (PVLDB) 3(1), 340–351 (2010)

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by NSF CAREER award IIS-1149372 and also supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the “IT Consilience Creative Program” (NIPA-2013-H0203-13-1001) supervised by the NIPA (National IT Industry Promotion Agency).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alan G. Labouseur.

Additional information

Communicated by Haixun Wang and Jeffrey Xu Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Labouseur, A.G., Birnbaum, J., Olsen, P.W. et al. The G* graph database: efficiently managing large distributed dynamic graphs. Distrib Parallel Databases 33, 479–514 (2015). https://doi.org/10.1007/s10619-014-7140-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-014-7140-3

Keywords

Navigation