Performance Analysis of Spark/GraphX on POWER8 Cluster

Que, Xinyu; Schneidenbach, Lars; Checconi, Fabio; Costa, Carlos H. Ã.; Buono, Daniele

doi:10.1007/978-3-319-46079-6_19

Xinyu Que¹⁶,
Lars Schneidenbach¹⁶,
Fabio Checconi¹⁶,
Carlos H. Ã. Costa¹⁶ &
…
Daniele Buono¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

International Conference on High Performance Computing

2545 Accesses
1 Citations

Abstract

POWER 8, the latest RISC (Reduced Instruction Set Computer) microprocessor of the IBM Power architecture family, was designed to significantly benefit emerging workloads, including Business Analytics, Cloud Computing and High Performance Computing. In this paper, we provide a thorough performance evaluation on a widely used large-scale graph processing framework, Spark/GraphX, on a POWER 8 cluster. Note that we use Spark and Java versions out of the box without any optimization. We examine the performance with several important graph kernels such as Breadth-First Search, Connected Components, and PageRank using both large real-world social graphs and synthetic graphs of billions of edges. We study the Spark/GraphX performance against some architectural aspects and perform the first Spark/GraphX scalability test with up to 16 POWER 8 nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
2.
A system monitor command used to report on various system loads.

References

Hadoop MapReduce. https://hadoop.apache.org/
OpenPOWER. http://openpowerfoundation.org/
Big Data and Analytics on IBM Power Systems (2015). https://www.ibm.com/developerworks/community/blogs/f0f3cd83-63c2-4744-9021-9ff31e7004a9/entry/Apache_Spark_Runs_2X_Faster_on_IBM_s_POWER8?lang=en
POWER8 - the first OpenPOWER processor (2015). http://openpowerfoundation.org/blogs/power8-the-first-openpower-processor/
Spark configuration (2016). http://spark.apache.org/docs/latest/configuration.html
Spark programming guid (2016). http://spark.apache.org/docs/latest/programming-guide.html
Abu-Doleh, A., Catalyurek, U.V.: Spaler: Spark And GraphX based de novo genome assembler. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 1013–1018. IEEE (2015)
Google Scholar
Brock, B., Liu, F., Rajamani, K.: Stac-a2™ benchmark on POWER8. In: Proceedings of the 8th Workshop on High Performance Computational Finance, WHPCF 2015, p. 1:1–1:8. ACM, New York (2015)
Google Scholar
Buono, D., Petrini, F., Checconi, F., Liu, X., Que, X., Long, C., Tuan, T.C.: Optimizing sparse matrix-vector multiplication for large-scale data analytics. In: Proceedings of the 30th ACM on International Conference on Supercomputing, ICS 2016. ACM (2016, to appear)
Google Scholar
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: Proceedings of the 4th ACM on International Conference on Data Mining (SDM 2004), Lake Buena Vista, pp. 442–446, April 2004
Google Scholar
Ewart, T., Yates, S., Cremonesi, F., Kumbhar, P., Schürmann, F., Delalondre, F.: Performance evaluation of the IBM POWER8 architecture to support computational neuroscientific application using morphologically detailed neurons. In: Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, PMBS 2015, p. 1:1–1:11. ACM, New York (2015)
Google Scholar
Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: PowerGraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI 2012, pp. 17–30. USENIX Association, Berkeley (2012)
Google Scholar
Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: GraphX: graph processing in a distributed dataflow framework. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI 2014, pp. 599–613. USENIX Association, Berkeley (2014)
Google Scholar
Heintz, B., Chandra, A.: Enabling scalable social group analytics via hypergraph analysis systems. In: 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2015). USENIX Association, Santa Clara, July 2015
Google Scholar
Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: a peta-scale graph mining system implementation and observations. In: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 229–238. IEEE Computer Society, Washington, DC (2009)
Google Scholar
Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: WWW, pp. 591–600. ACM, New York (2010)
Google Scholar
Langewisch, R.: A performance study of an implementation of the push-relabel maximum flow algorithm in Apache Spark’s GraphX (2015)
Google Scholar
Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., Ghahramani, Z.: Kronecker graphs: an approach to modeling networks. J. Mach. Learn. Res. 11, 985–1042 (2010)
MathSciNet MATH Google Scholar
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF 2015, pp. 53:1–53:8. ACM, New York (2015)
Google Scholar
Lim, S., Lee, S., Ganesh, G., Brown, T.C., Sukumar, S.R.: Graph processing platforms at scale: practices and experiences. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2015, 29–31 March 2015, Philadelphia, PA, USA, pp. 42–51 (2015)
Google Scholar
Liu, X., Buono, D., Checconi, F., Choi, J.W., Que, X., Petrini, F., Gunnels, J., Stuecheli, J.: An early performance study of large-scale POWER8 SMP systems. In: Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium. IPDPS 2015, IEEE Computer Society, Washington, DC (2016)
Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 135–146. ACM, New York (2010)
Google Scholar
Mushtaq, H., Al-Ars, Z.: Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline. In: IEEE International Conference on Bioinformatics and Biomedicine, pp. 1471–1477. IEEE Computer Society (2015)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report 1999–66, Stanford InfoLab, previous number=SIDL-WP-1999-0120, November 1999
Google Scholar
Que, X., Checconi, F., Petrini, F., Liu, X., Buono, D.: Exploring network optimizations for large-scale graph analytics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 26:1–26:10. ACM, New York (2015)
Google Scholar
Roy, A., Mihailovic, I., Zwaenepoel, W.: X-stream: Edge-centric graph processing using streaming partitions. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 472–488. ACM, New York (2013)
Google Scholar
Salihoglu, S., Widom, J.: GPS: a graph processing system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management, SSDBM 2013, pp. 22:1–22:12. ACM, New York (2013)
Google Scholar
Seshadhri, C., Pinar, A., Kolda, T.G.: An in-depth study of stochastic Kronecker graphs. In: International Conference on Data Mining, pp. 587–596. IEEE Computer Society, Los Alamitos (2011)
Google Scholar
Shun, J., Blelloch, G.E.: Ligra: a lightweight graph processing framework for shared memory. SIGPLAN Not. 48(8), 135–146 (2013)
Article Google Scholar
Sinharoy, B., Norstrand, J.A.V., Eickemeyer, R.J., Le, H.Q., Leenstra, J., Nguyen, D.Q., Konigsburg, B., Ward, K., Brown, M.D., Moreira, J.E., Levitan, D., Tung, S., Hrusecky, D., Bishop, J.W., Gschwind, M., Boersma, M., Kroener, M., Kaltenbach, M., Karkhanis, T., Fernsler, K.M.: IBM POWER8 processor core microarchitecture. IBM J. Res. Dev. 59(1), 2:1–2:21 (2015)
Article Google Scholar
Sud, A., Andersen, E., Curtis, S., Lin, M.C., Manocha, D.: Real-time path planning for virtual agents in dynamic environments. In: IEEE Virtual Reality, Charlotte, NC, March 2007
Google Scholar
Wu, M., Yang, F., Xue, J., Xiao, W., Miao, Y., Wei, L., Lin, H., Dai, Y., Zhou, L.: GraM: scaling graph computation to the trillions. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, pp. 408–421. ACM, New York (2015)
Google Scholar
Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: a resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, GRADES 2013, pp. 2:1–2:6. ACM, New York (2013)
Google Scholar
Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth (2012). CoRR
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, Berkeley (2010)
Google Scholar
Zhang, L., Kim, Y.J., Manocha, D.: A simple path non-existence algorithm using C-obstacle query. In: Proceedings of the International Workshop on the Algorithmic Foundations of Robotics (WAFR 2006), New York City, July 2006
Google Scholar

Download references

Author information

Authors and Affiliations

IBM TJ Watson, Yorktown Heights, NY, 10598, USA
Xinyu Que, Lars Schneidenbach, Fabio Checconi, Carlos H. Ã. Costa & Daniele Buono

Authors

Xinyu Que
View author publications
You can also search for this author in PubMed Google Scholar
Lars Schneidenbach
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Checconi
View author publications
You can also search for this author in PubMed Google Scholar
Carlos H. Ã. Costa
View author publications
You can also search for this author in PubMed Google Scholar
Daniele Buono
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xinyu Que , Lars Schneidenbach or Fabio Checconi .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Que, X., Schneidenbach, L., Checconi, F., Costa, C.H.Ã., Buono, D. (2016). Performance Analysis of Spark/GraphX on POWER8 Cluster. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_19
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics