Understanding parallelism in graph traversal on multi-core clusters

Lv, Huiwei; Tan, Guangming; Chen, Mingyu; Sun, Ninghui

doi:10.1007/s00450-012-0207-3

Understanding parallelism in graph traversal on multi-core clusters

Special Issue Paper
Published: 23 May 2012

Volume 28, pages 193–201, (2013)
Cite this article

Computer Science - Research and Development

Huiwei Lv^1,2,
Guangming Tan¹,
Mingyu Chen¹ &
…
Ninghui Sun¹

544 Accesses
4 Citations
Explore all metrics

Abstract

There is an ever-increasing need for exploring large-scale graph data sets in computational sciences, social networks, and business analytics. However, due to irregular and memory-intensive nature, graph applications are notoriously known for their poor performance on parallel computer systems. In this paper we propose a new hybrid MPI/Pthreads breadth-first search (BFS) algorithm featuring with (i) overlapping computation and communication by separating them into multiple threads, (ii) maximizing multi-threading parallelism on multi-cores with massive threads to improve throughputs, and (iii) exploiting pipeline parallelism using lock-free queues for asynchronous communication. By comparing it with traditional MPI-only BFS algorithm, we learned several valuable lessons that would help to understand and exploit parallelism in graph traversal applications. Experiments show our algorithm is 1.9× faster than the MPI-only version, capable of processing 1.45 billion edges per second on a 32-node SMP cluster. At a large scale, our algorithm is 1.49× than the MPI-only BFS algorithm in Combinatorial BLAS Library with 6,144 cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

Q. Huangfu & J. A. J. Hall

Notes

For fairness, we optimize the original program with bitmap technique as the MPI program does.

References

The Graph 500 List (2011). http://www.graph500.org/
The Linpack Benchmark (2011). http://www.top500.org/project/linpack
Agarwal V, Petrini F, Pasetto D, Bader DA (2010) Scalable graph exploration on multicore processors. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, SC’10. IEEE Comput Soc, Washington, pp 1–11
Chapter Google Scholar
Bader DA, Cong G (2006) Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. J Parallel Distrib Comput 66:1366–1378
Article MATH Google Scholar
Bader DA, Madduri K (2006) Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In: Proceedings of the 2006 international conference on parallel processing, ICPP’06. IEEE Comput Soc, Washington, pp 523–530
Chapter Google Scholar
Buluç A, Gilbert JR (2011) The Combinatorial BLAS: design, implementation, and applications. Int J High Perform Comput Appl. doi:10.1.1.185.4283
Google Scholar
Buluç A, Madduri K (2011) Parallel breadth-first search on distributed memory systems. Corros Rev. arXiv:1104.4518
Cappello F, Etiemble D (2000) MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks. In: Proceedings of the 2000 ACM/IEEE conference on supercomputing (CDROM), Supercomputing’00. IEEE Comput Soc, Washington
Google Scholar
Giacomoni J, Moseley T, Vachharajani M (2008) FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’08. ACM, New York, pp 43–52
Chapter Google Scholar
Kang S, Bader DA (2009) An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs. ACM SIGPLAN Not 44:15–24
Article Google Scholar
Leiserson CE, Schardl TB (2010) A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In: Proceedings of the 22nd ACM symposium on parallelism in algorithms and architectures, SPAA’10. ACM, New York, pp 303–314
Chapter Google Scholar
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: Jorge A, Torgo L, Brazdil P, Camacho R, Gama J (eds) Knowledge discovery in databases: PKDD 2005. Lecture notes in computer science, vol 3721. Springer, Berlin, pp 133–145
Chapter Google Scholar
Loft RD, Thomas SJ, Dennis JM (2001) Terascale spectral element dynamical core for atmospheric general circulation models. In: Proceedings of the 2001 ACM/IEEE conference on supercomputing (CDROM), Supercomputing’01. ACM, New York, p 18
Chapter Google Scholar
Lumsdaine A, Gregor D, Hendrickson B, Berry J (2007) Challenges in parallel graph processing. Parallel Process Lett 17(1):5–20
Article MathSciNet Google Scholar
Mizell D, Maschhoff K (2009) Early experiences with large-scale Cray XMT systems. In: Proceedings of the 2009 IEEE international symposium on parallel & distributed processing. IEEE Comput Soc, Washington, pp 1–9
Chapter Google Scholar
Molka D, Hackenberg D, Schone R, Muller MS (2009) Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In: Proceedings of the 2009 18th international conference on parallel architectures and compilation techniques. IEEE Comput Soc, Washington, pp 261–270
Chapter Google Scholar
Scarpazza DP, Villa O, Petrini F (2008) Efficient breadth-first search on the Cell/BE processor. IEEE Trans Parallel Distrib Syst 19:1381–1395
Article Google Scholar
Tan G, Sreedhar V, Gao G (2011) Analysis and performance results of computing betweenness centrality on IBM Cyclops64. J Supercomput 56:1–24
Article Google Scholar
Wu X, Taylor V (2011) Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers. ACM SIGMETRICS Perform Eval Rev 38:56–62
Article Google Scholar
Yoo A, Chow E, Henderson K, McLendon W, Hendrickson B, Catalyurek U (2005) A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In: Proceedings of the 2005 ACM/IEEE conference on supercomputing, SC’05. IEEE Comput Soc, Washington, p 25
Chapter Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge Erlin Yao and the anonymous reviewers for their helpful comments on previous drafts of this work.

This work is supported by National 863 Program (2009AA01A129), the National Natural Science Foundation of China (61003062, 60925009, 60921002, 60803030, 61033009, 60921002, and 60925009) and 973 Program (2011CB302502 and 2011CB302500).

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Huiwei Lv, Guangming Tan, Mingyu Chen & Ninghui Sun
Graduate School of Chinese Academy of Sciences, Beijing, China
Huiwei Lv

Authors

Huiwei Lv
View author publications
You can also search for this author in PubMed Google Scholar
Guangming Tan
View author publications
You can also search for this author in PubMed Google Scholar
Mingyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ninghui Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huiwei Lv.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lv, H., Tan, G., Chen, M. et al. Understanding parallelism in graph traversal on multi-core clusters. Comput Sci Res Dev 28, 193–201 (2013). https://doi.org/10.1007/s00450-012-0207-3

Download citation

Published: 23 May 2012
Issue Date: May 2013
DOI: https://doi.org/10.1007/s00450-012-0207-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Understanding parallelism in graph traversal on multi-core clusters

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Parallelizing the dual revised simplex method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Understanding parallelism in graph traversal on multi-core clusters

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Parallelizing the dual revised simplex method

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation