Distributed block formation and layout for disk-based management of large-scale graphs

Yaşar, Abdurrahman; Gedik, Buğra; Ferhatosmanoğlu, Hakan

doi:10.1007/s10619-017-7191-3

Distributed block formation and layout for disk-based management of large-scale graphs

Published: 27 January 2017

Volume 35, pages 23–53, (2017)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Abdurrahman Yaşar¹,
Buğra Gedik² &
Hakan Ferhatosmanoğlu²

1295 Accesses
4 Citations
Explore all metrics

Abstract

We are witnessing an enormous growth in social networks as well as in the volume of data generated by them. An important portion of this data is in the form of graphs. In recent years, several graph processing and management systems emerged to handle large-scale graphs. The primary goal of these systems is to run graph algorithms and queries in an efficient and scalable manner. Unlike relational data, graphs are semi-structured in nature. Thus, storing and accessing graph data using secondary storage requires new solutions that can provide locality of access for graph processing workloads. In this work, we propose a scalable block formation and layout technique for graphs, which aims at reducing the I/O cost of disk-based graph processing algorithms. To achieve this, we designed a scalable MapReduce-style method called ICBL, which can divide the graph into a series of disk blocks that contain sub-graphs with high locality. Furthermore, ICBL can order the resulting blocks on disk to further reduce non-local accesses. We experimentally evaluated ICBL to showcase its scalability, layout quality, as well as the effectiveness of automatic parameter tuning for ICBL. We deployed the graph layouts generated by ICBL on the Neo4j open source graph database, http://www.neo4j.org/ (2015) graph database management system. Our results show that the layout generated by ICBL reduces the query running times over Neo4j more than \(2\times \) compared to the default layout.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

A survey of density based clustering algorithms

Article 29 September 2020

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

Notes

Acronym is formed by the initial letters of the four solution stages.

References

Aggarwal, C., Wang, H.: Graph data management and mining. In: Aggarwal, C. (ed.) A Survey of Algorithms and Applications. Springer, Berlin (2010)
Google Scholar
Akyurek, S., Salem, K.: Adaptive block rearrangement. ACM Trans. Comput. Syst. 13(2), 89–121 (1995). doi:10.1145/201045.201046
Article Google Scholar
Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., Hristidis, V.: BORG: block-reorganization for self-optimizing storage systems. In: Proceedings of the 7th Conference on File and Storage Technologies, pp. 183–196 (2009)
Boldi, P., Vigna, S.: The WebGraph framework I: compression techniques. In: Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), pp. 595–601 (2004)
Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th International Conference on World Wide Web (2011)
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: Fourth SIAM International Conference on Data Mining (2004)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI), pp. 137–150 (2004)
Dominguez-Sal, D., Martinez-Bazan, N., Muntes-Mulero, V., Baleta, P., Larriba-Pey, J.: A discussion on the design of graph database benchmarks. In: Nambiar, R., Poess, M. (eds.) Performance Evaluation, Measurement and Characterization of Complex Systems. Springer, Berlin (2011)
Google Scholar
Fortunato, S.: Community detection in graphs. Phys. Rep. 483(3–5), 75–174 (2009)
MathSciNet Google Scholar
Gedik, B., Bordawekar, R.: Disk-based management of interaction graphs. IEEE Trans. Knowl. Data Eng. 26(11), 2689–2702 (2014)
Article Google Scholar
Giraph: Apache Giraph. http://www.giraph.apache.org/. Accessed June 2015
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Symposium on Operating System Design and Implementation (OSDI), pp. 17–30 (2012)
Han, W.S., Lee, S., Park, K., Lee, J.H., Kim, M.S., Kim, J., Yu, H.: TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 77–85 (2013)
Hoque, I., Gupta, I.: Disk layout techniques for online social network data. IEEE Comput. 16(3), 24–36 (2012)
Article Google Scholar
Kang, U., Tong, H., Sun, J., Lin, C.Y., Faloutsos, C.: GBASE: a scalable and general graph management system. In: ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 1091–1099 (2011)
Karypis, G., Kumar, V.: Multilevel graph partitioning schemes. In: International Conference on Parallel Processing (ICPP), pp. 113–122 (1995)
Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: WWW’10: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600 (2010)
Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Symposium on Operating System Design and Implementation (OSDI), pp. 31–46 (2012)
Lasalle, D., Karypis, G.: Multi-threaded graph partitioning. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 225–236 (2013)
Leskovec, J., Krevl, A.: SNAP datasets: Stanford large network dataset collection (2015). http://www.snap.stanford.edu/data
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012). doi:10.14778/2212351.2212354
Article Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281–297 (1967)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: ACM International Conference on Management of Data (SIGMOD), pp. 135–146 (2010)
Mondal, J., Deshpande, A.: Managing large dynamic graphs efficiently. In: ACM International Conference on Management of Data (SIGMOD), pp. 145–156 (2012)
Nanavati, A.A., Siva, G., Das, G., Chakraborty, D., Dasgupta, K., Mukherjea, S., Joshi, A.: On the structural properties of massive telecom call graphs: findings and implications. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 435–444 (2006)
Neo4j: Neo4j open source graph database (2015). http://www.neo4j.org/
Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005). doi:10.1080/00107510500052444
Article Google Scholar
Nodine, M.H., Goodrich, M.T., Vitter, J.S.: Blocking for external graph searching. Algorithmica 16(2), 181–214 (1996)
Article MathSciNet MATH Google Scholar
Prabhakaran, V., Wu, M., Weng, X., McSherry, F., Zhou, L., Haridasan, M.: Managing large graphs on multi-cores with graph awareness. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, pp. 4–4 (2012)
Rajaraman, A., Ullman, J.D.: Data mining. In: Mining of Massive Datasets, pp. 1–17. Cambridge University Press, Cambridge (2011)
Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine on a memory cloud. In: ACM International Conference on Management of Data (SIGMOD) (2013)
Siek, J.G., Lee, L.Q., Lumsdaine, A.: Boost Graph Library. The User Guide and Reference Manual. Addison-Wesley, Boston (2002)
Google Scholar
Simmhan, Y., Kumbhare, A., Wickramaarachchi, C., et al.: Goffish: a sub-graph centric framework for large-scale graph analytics. In: European Conference on Parallel Processing (Euro-Par), pp. 451–462 (2015)
Steinhaus, R.: G-Store: a storage manager for graph data. Master’s Thesis, University of Oxford (2011)
Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From think like a vertex to think like a graph. Proc. Very Large Databases Conf. 7(3), 193–204 (2013)
Google Scholar
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 409–410 (1998)
Article Google Scholar
Xie, W., Wang, G., Bindel, D., Demers, A., Gehrke, J.: Fast iterative graph computation with block updates. Proc. Very Large Databases Conf. 6(14), 2014–2025 (2013). doi:10.14778/2556549.2556581
Google Scholar
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, pp. 2:1–2:6 (2013)
Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. Very Large Databases Conf. 7(14), 1981–1992 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Abdurrahman Yaşar
Department of Computer Engineering, Bilkent University, Bilkent, 06800, Ankara, Turkey
Buğra Gedik & Hakan Ferhatosmanoğlu

Authors

Abdurrahman Yaşar
View author publications
You can also search for this author in PubMed Google Scholar
Buğra Gedik
View author publications
You can also search for this author in PubMed Google Scholar
Hakan Ferhatosmanoğlu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdurrahman Yaşar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yaşar, A., Gedik, B. & Ferhatosmanoğlu, H. Distributed block formation and layout for disk-based management of large-scale graphs. Distrib Parallel Databases 35, 23–53 (2017). https://doi.org/10.1007/s10619-017-7191-3

Download citation

Published: 27 January 2017
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10619-017-7191-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed block formation and layout for disk-based management of large-scale graphs

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed block formation and layout for disk-based management of large-scale graphs

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

A survey of density based clustering algorithms

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation