Distributed graph cube generation using Spark framework

Kang, Seok; Lee, Suan; Kim, Jinho

doi:10.1007/s11227-019-02746-4

Distributed graph cube generation using Spark framework

Published: 10 January 2019

Volume 76, pages 8118–8139, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

359 Accesses
7 Citations
Explore all metrics

Abstract

Graph OLAP is a technology that generates aggregates or summaries of a large-scale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

References

Thomsen E (2002) OLAP solutions: building multidimensional information systems. Wiley, New York
Google Scholar
Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. ACM Sigmod Rec 26:65–74
Article Google Scholar
Beyer K and Ramakrishnan R (1999) Bottom-up computation of sparse and iceberg cube. In: ACM Sigmod Record
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1:29–53
Article Google Scholar
Zhao Y, Deshpande PM, Naughton JF (1997) An array-based algorithm for simultaneous multidimensional aggregates. In: ACM SIGMOD Record
Xin D, Han J, Li X, Wah BW (2003) Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol 29
Xin D, Shao Z, Han J, Liu H (2006) C-cubing: efficient computation of closed cubes by aggregation-based checking. In: Proceedings of the 22nd International Conference on Data Engineering. ICDE’06
Ng RT, Wagner A, Yin Y (2001) Iceberg-cube computation with PC clusters. In: ACM SIGMOD record
Han J, Pei J, Dong G, Wang K (2001) Efficient computation of iceberg cubes with complex measures. In: ACM SIGMOD record
Fang M, Shivakumar N, Garcia-Molina H, Motwani R, Ullman JD (1998) Computing iceberg queries efficiently. In: International Conference on Very Large Databases (VLDB’98), New York, August 1998
Agarwal S, Agrawal R, Deshpande PM, Gupta A, Naughton JF, Ramakrishnan R, Sarawagi S (1996) On the computation of multidimensional aggregates. In: VLDB
Li X, Han J, Gonzalez H (2004) High-dimensional OLAP: a minimal cubing approach. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol 30
Wang Z, Chu Y, Tan K-L, Agrawal D, Abbadi AEI, Xu X (2013) Scalable data cube analysis over big data. arXiv preprint arXiv:1311.5663
Nandi A, Yu C, Bohannon P, Ramakrishnan R (2012) Data cube materialization and mining over mapreduce. IEEE Trans Knowl Data Eng 24:1747–1759
Article Google Scholar
Lee S, Jo S, Kim J (2015) MRDataCube: data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), pp 95–102
Milo T, Altshuler E (2016) An efficient MapReduce cube algorithm for varied DataDistributions. In: Proceedings of the 2016 International Conference on Management of Data
Lee S, Kang S, Kim J, Yu EJ (2018) Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Clust Computing 1–25. https://doi.org/10.1007/s10586-018-1811-1
Yin M, Wu B, Zeng Z (2012) HMGraph OLAP: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP
Qu Q, Zhu F, Yan X, Han J, Philip SY, Li H (2011) Efficient topological OLAP on information networks. In: International Conference on Database Systems for Advanced Applications
Li C, Yu PS, Zhao L, Xie Y, Lin W (2011) Infonetolaper: integrating infonetwarehouse and infonetcube with infonetolap. In: Proceedings of the VLDB Endowment, vol 4
Cook DJ, Holder LB (2006) Mining graph data. Wiley, New York
Book Google Scholar
Chen C, Yan X, Zhu F, Han J, Philip SY (2008) Graph OLAP: towards online analytical processing on graphs. In: Eighth IEEE International Conference on Data Mining, ICDM’08, pp 103–112
Beheshti SMR, Benatallah B, Motahari-Nezhad HR, Allahbakhsh M (2012) A framework and a language for on-line analytical processing on graphs. In: International Conference on Web Information Systems Engineering
Zhao P, Li X, Xin D, Han J (2011) Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data
Ghrab A et al (2015) A framework for building OLAP cubes on graphs. In: East European Conference on Advances in Databases and Information Systems. Springer, Cham
Bleco D, Yannis K (2018) Finding the needle in a haystack: entropy guided exploration of very large graph cubes. In: EDBT/ICDT Workshops
Azirani E et al (2015) Efficient OLAP operations for RDF analytics. In: 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW). IEEE
Wang Z, Fan Q, Wang H, Tan K-L, Agrawal D, El Abbadi A (2014) Pagrol: parallel graph olap over large-scale attributed graphs. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE)
Denis B, Ghrab A, Skhiri S (2013) A distributed approach for graph-oriented multidimensional analysis. In: 2013 IEEE International Conference on Big Data
Spark A (2018) Apache Spark: unified analytics engine for big data. The Apache Software Foundation. http://spark.apache.org. Accessed 8 Jan 2019
Xin RS, Crankshaw D, Dave A, Gonzalez JE, Franklin MJ, Stoica I (2014) Graphx: unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394
Shoro AG, Soomro TR (2015) Big data analysis: Apache spark perspective. Global J Comput Sci Technol
Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Carlini E, Dazzi P, Esposito A, Lulli A, Ricci L (2014) Balanced graph partitioning with apache spark. In: European Conference on Parallel Processing
Zadeh RB, Meng X, Ulanov A, Yavuz B, Pu L, Venkataraman S, Sparks E, Staple A, Zaharia M (2016) Matrix computations and optimization in apache spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Yang L et al (2018) Min-forest: fast reachability indexing approach for large-scale graphs on spark platform. In: International Conference on Web Services. Springer, Cham
Lee S et al (2018) TensorLightning: a traffic-efficient distributed deep learning on commodity Spark clusters. IEEE Access 6:27671–27680
Article Google Scholar
Tian X et al (2017) Towards memory and computation efficient graph processing on spark. In: 2017 IEEE International Conference on Big Data. IEEE
Karim MR et al (2018) Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf Sci 432:278–300
Article MathSciNet Google Scholar
Jensen SK, Torben BP, Christian T (2018) ModelarDB: modular model-based time series management with spark and cassandra. Proc VLDB Endow 11(11):1688–1701
Article Google Scholar
Kim J et al (2017) Optimized combinatorial clustering for stochastic processes. Cluster Comput 20(2):1135–1148
Article Google Scholar
Alemi Mehdi, Haghighi Hassan, Shahrivari Saeed (2017) CCFinder: using Spark to find clustering coefficient in big graphs. J Supercomput 73(11):4683–4710
Article Google Scholar
Hadoop A (2018) Apache Hadoop. The Apache Software Foundation. http://hadoop.apache.org. Accessed 8 Jan 2019
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation
Leskovec J, Sosič R (2016) Snap: a general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol (TIST) 8(1):1
Article Google Scholar
Mühleisen H, Bizer C (2012) Web data commons—extracting structured data from two large web corpora. In: CEUR Workshop Proceedings LDOW 2012: Linked Data on the Web, vol 937. CEUR-ws.org
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 1383–1394

Download references

Acknowledgements

This research was supported by Korea Electric Power Corporation. (Grant Number: R18XA05) and by the Industrial Technology Innovation Program (Project#: 10052797), through the Korea Evaluation Institute of Industrial Technology (Keit), funded by the Ministry of Trade, Industry and Energy.

Author information

Authors and Affiliations

Department of Computer Science, Kangwon National University, Chuncheon, Kangwon, Korea
Seok Kang, Suan Lee & Jinho Kim

Authors

Seok Kang
View author publications
You can also search for this author in PubMed Google Scholar
Suan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jinho Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suan Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, S., Lee, S. & Kim, J. Distributed graph cube generation using Spark framework. J Supercomput 76, 8118–8139 (2020). https://doi.org/10.1007/s11227-019-02746-4

Download citation

Published: 10 January 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11227-019-02746-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed graph cube generation using Spark framework

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

MongoDB Vs PostgreSQL: A comparative study on performance aspects

The big data system, components, tools, and technologies: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed graph cube generation using Spark framework

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

MongoDB Vs PostgreSQL: A comparative study on performance aspects

The big data system, components, tools, and technologies: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation