skip to main content
10.1145/2612262.2612271acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Overhead of a decentralized gossip algorithm on the performance of HPC applications

Published: 10 June 2014 Publication History

Abstract

Gossip algorithms can provide online information about the availability and the state of the resources in supercomputers. These algorithms require minimal computing and storage capabilities at each node and when properly tuned, they are not expected to overload the nodes or the network that connects these nodes. These properties make gossip interesting for future exascale systems. This paper examines the overhead of a decentralized gossip algorithm on the performance of parallel MPI applications running on up to 8192 nodes of an IBM BlueGene/Q supercomputer. The applications that were used in the experiments include PTRANS and MPI-FFT from the HPCC benchmark suite as well as the coupled weather and cloud simulation model COSMO-SPECS+FD4. In most cases, no gossip overhead was observed when the gossip messages were sent at intervals of 256ms or more. As expected, the overhead that is observed at higher rates is sensitive to the communication pattern of the application and the amount of gossip information being circulated.

References

[1]
L. Amar, A. Barak, Z. Drezner, and M. Okun. Randomized Gossip Algorithms for Maintaining a Distributed Bulletin Board with Guaranteed Age Properties. Concurrency and Computation: Practice and Experience, January 2009.
[2]
A. J. Ganesh, A.-M. Kermarrec, and L. Massoulie. Peer-to-peer membership management for gossip-based protocols. Computers, IEEE Transactions on, 52(2):139--149, 2003.
[3]
F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities. In Proc. 12th Intl. Symp. on High Performance Distributed Computing, pages 236--246. IEEE, 2003.
[4]
Hana Straková, Wilfried N. Gansterer, and Thomas Zemen. Distributed QR Factorization Based on Randomized Algorithms. In Parallel Processing and Applied Mathematics, volume 7203 of LNCS, pages 235--244. Springer, 2012.
[5]
D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proc. Symp. on Foundations of Computer Science, pages 482--491, 2003.
[6]
A. D. G. Dimakis, A. D. Sarwate, and M. J. Wainwright. Geographic Gossip: Efficient Averaging for Sensor Networks. IEEE Trans. Signal Processing, 56(3):1205--1216, 2008.
[7]
P. Kyasanur, R. R. Choudhury, and I. Gupta. Smart Gossip: An Adaptive Gossip-based Broadcasting Service for Sensor Networks. In Proc. Mobile Adhoc and Sensor Systems, pages 91--100. IEEE, 2006.
[8]
A Barak and A Shiloh. The MOSIX Cluster Operating System for Distributed Computing on Linux Clusters, Multi-Clusters and Clouds. White paper, http://www.mosix.org, 2014.
[9]
F. Wuhib, R. Stadler, and M. Spreitzer. A Gossip Protocol for Dynamic Resource Management in Large Cloud Environments. IEEE Trans. Network and Service Management, 9(2):213--225, 2012.
[10]
A. Barak, S. Guday, and R. Wheeler. The MOSIX Distributed Operating System Load Balancing for UNIX, volume 672 of LNCS. 1993.
[11]
Harshitha Menon and Laxmikant Kalé. A Distributed Dynamic Load Balancer for Iterative Applications. In Proc. SC'13. ACM, 2013.
[12]
Laxmikant V. Kalé and Sanjeev Krishnan. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In A. Paepcke, editor, Proceedings of OOPSLA'93, pages 91--108. ACM Press, September 1993.
[13]
Philip Soltero, Patrick Bridges, Dorian Arnold, and Michael Lang. A Gossip-based Approach to Exascale System Services. In Proc. 3rd Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS'13). ACM, 2013.
[14]
Abhinav Bhatele, Kathryn Mohror, Steven H. Langer, and Katherine E. Isaacs. There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. In Proc. SC '13. ACM, 2013.
[15]
Dong Chen, N. A. Eisley, P. Heidelberger, R. M. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. L. Satterfield, B. Steinmacher-Burow, and J. J. Parker. The IBM Blue Gene/Q interconnection network and message unit. In Proc. SC '11. ACM, 2011.
[16]
Todd Gamblin. PMPI wrapper generator, 2013. https://github.com/tgamblin/wrap.
[17]
HPC Challenge Benchmark Suite. http://icl.cs.utk.edu/hpcc/.
[18]
P. Luszczek, D. Bailey, J. Dongarra, J. Kepner, R. Lucas, R. Rabenseifner, and D. Takahashi. The HPC Challenge (HPCC) Benchmark Suite. In SC'06 Conference Tutorials, 2006.
[19]
Matthias Lieber, Verena Grützun, Ralf Wolke, Matthias S. Müller, and Wolfgang E. Nagel. Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4. In Proc. PARA 2010, volume 7133 of LNCS, pages 131--141. Springer, 2012.
[20]
Matthias Lieber, Wolfgang E. Nagel, and Hartmut Mix. Scalability Tuning of the Load Balancing and Coupling Framework FD4. In NIC Symposium 2014, volume 47 of NIC Series, pages 363--370, 2014.
[21]
Franz Franchetti, Yevgen Voronenko, and Gheorghe Almasi. Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P. In VECPAR 2012, volume 7851 of LNCS, pages 187--200. Springer, 2013.
[22]
FFMK Website. http://ffmk.tudos.org.

Cited By

View all
  • (2020)FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale ComputingSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_16(483-516)Online publication date: 31-Jul-2020
  • (2016)FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale ComputingSoftware for Exascale Computing - SPPEXA 2013-201510.1007/978-3-319-40528-5_18(405-426)Online publication date: 15-Sep-2016
  • (2015)Energy-efficient Algorithms for Ultrascale SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1502052:2(77-104)Online publication date: 6-Apr-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers
June 2014
76 pages
ISBN:9781450329507
DOI:10.1145/2612262
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • SPCL: Scalable Parallel Computing Laboratory

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. benchmarking
  2. cluster management
  3. gossip algorithm
  4. high performance computing

Qualifiers

  • Research-article

Funding Sources

Conference

ROSS '14
Sponsor:
  • SPCL

Acceptance Rates

ROSS '14 Paper Acceptance Rate 9 of 16 submissions, 56%;
Overall Acceptance Rate 58 of 169 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale ComputingSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_16(483-516)Online publication date: 31-Jul-2020
  • (2016)FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale ComputingSoftware for Exascale Computing - SPPEXA 2013-201510.1007/978-3-319-40528-5_18(405-426)Online publication date: 15-Sep-2016
  • (2015)Energy-efficient Algorithms for Ultrascale SystemsSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1502052:2(77-104)Online publication date: 6-Apr-2015
  • (2015)Resilient gossip algorithms for collecting online management information in exascale clustersConcurrency and Computation: Practice & Experience10.1002/cpe.346527:17(4797-4818)Online publication date: 10-Dec-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media