Performance drop at executing communication-intensive parallel algorithms

Moríñigo, José A.; García-Muller, Pablo; Rubio-Montero, Antonio J.; Gómez-Iglesias, Antonio; Meyer, Norbert; Mayo-García, Rafael

doi:10.1007/s11227-019-03142-8

Performance drop at executing communication-intensive parallel algorithms

Published: 06 January 2020

Volume 76, pages 6834–6859, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

José A. Moríñigo ORCID: orcid.org/0000-0003-2528-7485¹,
Pablo García-Muller¹,
Antonio J. Rubio-Montero¹,
Antonio Gómez-Iglesias²,
Norbert Meyer³ &
…
Rafael Mayo-García¹

236 Accesses
1 Citation
Explore all metrics

Abstract

This work summarizes the results of a set of executions completed on three fat-tree network supercomputers: Stampede at TACC (USA), Helios at IFERC (Japan) and Eagle at PSNC (Poland). Three MPI-based, communication-intensive scientific applications compiled for CPUs have been executed under weak-scaling tests: the molecular dynamics solver LAMMPS; the finite element-based mini-kernel miniFE of NERSC (USA); and the three-dimensional fast Fourier transform mini-kernel bigFFT of LLNL (USA). The design of the experiments focuses on the sensitivity of the applications to rather different patterns of task location, to assess the impact on the cluster performance. The accomplished weak-scaling tests stress the effect of the MPI-based application mappings (concentrated vs. distributed patterns of MPI tasks over the nodes) on the cluster. Results reveal that highly distributed task patterns may imply a much larger execution time in scale, when several hundreds or thousands of MPI tasks are involved in the experiments. Such a characterization serves users to carry out further, more efficient executions. Also researchers may use these experiments to improve their scalability simulators. In addition, these results are useful from the clusters administration standpoint since tasks mapping has an impact on the cluster throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking LAMMPS: Sensitivity to Task Location Under CPU-Based Weak-Scaling

ExaStamp: A Parallel Framework for Molecular Dynamics on Heterogeneous Clusters

Maximizing Application Performance in a Multi-core, NUMA-Aware Compute Cluster by Multi-level Tuning

References

TOP500 Supercomputers homepage. http://www.top500.org
Shalf J, Quinlan D, Janssen C (2011) Rethinking hardware–software codesign for exascale systems. Computer 44(11):22–30. https://doi.org/10.1109/MC.2011.300
Article Google Scholar
Exascale Computing Project (ECP) homepage. https://www.exascaleproject.org
EuroHPC homepage. http://eurohpc.eu
Partnership Research for Advance Computing in Europe. http://www.prace-ri.eu
National Supercomputing Center in Tianjin homepage. http://www.nscc-tj.gov.cn
Post-K Supercomputer. www.fujitsu.com/global/Images/post-k-supercomputer.pdf
Moríñigo JA, García-Muller P, Rubio-Montero AJ, Gómez-Iglesias A, Meyer N, Mayo-García R (2019) Benchmarking LAMMPS: sensitivity to task location under CPU-based weak-scaling. In: High Performance Computing, Proceedings of the 5th Latin American Conference (CARLA 2018), Bucaramanga, Colombia—Communication in Computer and Information Science, vol 979, pp 224–238. https://doi.org/10.1007/978-3-030-16205-4_17
Jeannot E, Mercier G, Tessier F (2014) Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans Parallel Distrib Syst 25(4):993–1002. https://doi.org/10.1109/TPDS.2013.104
Article Google Scholar
Chavarría-Miranda D, Nieplocha J, Tipparaju V (2006) Topology-aware tile mapping for clusters of SMPs. In: Proceedings of the 3rd Conference On Computing Frontiers, Ischia, Italy. https://doi.org/10.1145/1128022.1128073
Smith B, Bode B (2005) Performance effects of node mappings on the IBM BlueGene machine. In: Euro-Par 2005 Parallel Processing. Lecture notes in computer science, vol 3648, pp 1005–1013. https://doi.org/10.1007/11549468_110
Rodrigues ER, Madruga FL, Navaux POA, Panetta J (2009) Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: Proceedings of the IEEE Symposium Computers and Communications, Sousse, Tunisia, pp 811–817. https://doi.org/10.1109/iscc.2009.5202271
León EA, Karlin I, Moody AT (2016) System noise revisited: enabling application scalability and reproducibility with SMT. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Chicago, USA, pp 596–607. https://doi.org/10.1109/ipdps.2016.48
Chai L, Gao Q, Panda DK (2007) Understanding the impact of multi-core architecture in cluster computing: a case study with intel dual-core system. In: Proceedings of the 7th IEEE International Symposium Cluster Computing and the Grid (CCGrid), Rio De Janeiro, Brazil, pp 471–478. https://doi.org/10.1109/ccgrid.2007.119
Shainer G, Lui P, Liu T, Wilde T, Layton J (2011) The impact of inter-node latency versus intra-node latency on HPC applications. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, pp 455–460. https://doi.org/10.2316/p.2011.757-005
Xingfu W, Taylor V (2009) Using processor partitioning to evaluate the performance of MPI, OpenMP and hybrid parallel applications on dual- and quad-core cray XT4 systems. In: Compute the Future. Proceedings of the Cray User Group (CUG 2009), Atlanta, USA
Rodríguez-Pascual M, Moríñigo JA, Mayo-García R (2019) Effect of MPI tasks location on cluster throughput using NAS. Clust Comput 22(4):1187–1198. https://doi.org/10.1007/s10586-018-02898-7
Article Google Scholar
Moríñigo JA, Rodríguez-Pascual M, Mayo-García R Slurm (2016) Configuration impact on benchmarking. In: Slurm User Group Meeting, Athens, Greece. https://slurm.schedmd.com/publications.html
Xingfu W, Taylor V (2007) Processor partitioning: an experimental performance analysis of parallel applications on SMP clusters systems. In: 19th IASTED Conference Parallel Distributed Computing and Systems (PDCS07), Cambridge, USA, pp 13–18
Zhang C, Yuan X (2010) Processor affinity and MPI performance on SMP-CMP clusters. In: IEEE International Symposium Parallel and Distributed Processing, Workshops and PhD forum, Atlanta, USA, pp 1–8. https://doi.org/10.1109/IPDPSW.2010.5470774
McKenna G (2007) Performance analysis and optimisation of LAMMPS on XCmaster, HPCx and BlueGene. University of Edinburgh, EPCC, Edinburgh
Google Scholar
Liu J (2010) LAMMPS on advanced SGI architectures. White paper SGI
León EA, Rosenthal E (2014) Characterizing applications sensitivity to network performance. In: Supercomputing Conference (SC’14), Poster, New Orleans, USA
León EA, Karlin I, Bhatele A, Langer SH, Chambreau C, Howell LH, D’Hooge T, Leininger ML (2016) Characterizing parallel scientific applications on commodity clusters: an empirical study of a tapered fat-tree. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16), Salt Lake City, USA
Jain N, Bhatele A, Howell LH, Böhme D, Karlin I, León EA, Mubarak M, Wolfe N, Gamblin T, Leininger ML (2017) Predicting the performance impact of different fat-tree configurations. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17), Denver, USA, pp 50:1–50:13. https://doi.org/10.1145/3126908.3126967
Choi DJ, Lockwood G, Sinkovits RS, Tatineni M (2014) Performance of applications using dual-rail InfiniBand 3D torus network on the gordon supercomputer. In: Conference on Extreme Science and Engineering Discovery Environment (XSEDE’14), Atlanta, GA, USA, pp 43:1–43:6. https://doi.org/10.1145/2616498.2616541
Cornebize T, Heinrich F, Legrand A, Vienne J (2017) Emulating high performance linpack on a commodity server at the scale of a supercomputer, HAL-id: hal-01654804
Ferreira K, Grant RE, Levenhagen MJ, Levy S, Groves T (2019) Hardware MPI message matching behaviour to inform design, corrurrency and computation. Pract Exp. https://doi.org/10.1002/cpe.5150
Article Google Scholar
Pollard SA, Jain N, Herbein S, Bhatele A (2018) Evaluation of an interference-free node allocation policy on fat-tree clusters. In: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’18), Dallas, USA. https://doi.org/10.1109/SC.2018.00029
León EA, Chambreau C, Leininger ML (2017) What do scientific applications need? An empirical study of multirail network bandwidth. In: 7th International Conference on Advanced Communications and Computations (INFOCOMP 2017), Venice, Italy, pp 35–39
Dang HV, Snir M, Gropp W(2016) Towards millions of communicating threads. In: Proceedings of the 23nd European MPI Users’ Group Meeting (EuroMPI 2016), Edinburgh, UK, pp 1–14. https://doi.org/10.1145/2966884.2966914
Radulovic M, Asifuzzaman K, Carpenter P, Radojkovic P, Ayguadé E (2018) HPC benchmarking: scaling right and looking beyond the average. In: Proceedings of the 24th International European Conference on Parallel and Distributed Computing (EuroPAR 2018), LNCS, vol 11014, pp 135–146. https://doi.org/10.1007/978-3-319-96983-1_10
Stampede supercomputer. https://www.tacc.utexas.edu/systems/stampede
Helios supercomputer. http://www.iferc.org/CSC_Scope.html#Systems
Eagle supercomputer. https://wiki.man.poznan.pl/hpc/index.php?title=Eagle
Taffet P, Rao S, León EA, Karlin I (2019) Testing the limits of tapered fat tree networks. In: IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), Denver, USA, pp 47–52. https://doi.org/10.1109/PMBS49563.2019.00011
LAMMPS homepage. https://lammps.sandia.gov/bench.html#rhodo
Brooks BR, Brooks CL III, Mackerell AD Jr, Nilsson L, Petrella RJ et al (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30(10):1545–1614. https://doi.org/10.1002/jcc.21287
Article Google Scholar
Plimpton S (1995) Fast parallel algorithms for short-range molecular dynamics. J Comput Phys 117(1):1–19. https://doi.org/10.1006/jcph.1995.1039
Article MATH Google Scholar
Brown WM, Kohlmeyer A, Plimpton SJ, Tharrington AN (2012) Implementing molecular dynamics on hybrid high performance computers—particle–particle particle-mesh. Comput Phys Commun 183(3):449–459. https://doi.org/10.1016/j.cpc.2011.10.012
Article Google Scholar
Lin PT, Heroux MA, Barrett RF, Williams AB (2015) Assessing a mini-application as performance proxy for a finite element method engineering application. Concurr Comput 27(17):5374–5389. https://doi.org/10.1002/cpe.3587
Article Google Scholar
Richards DF, Glosli JN, Chan B, Dorr MR, Draeger EW et al (2009) Beyond homogeneous decomposition: scaling long-range forces on massively parallel systems. In: Proceedings of the International Conference on High Performance Computing Networking, Storage and Analysis (SC’09), art. nº 60. Portland, USA. https://doi.org/10.1145/1654059.1654121
Fast Fourier Transform of the West homepage. http://www.fftw.org
Plimpton S, Pollock R, Stevens M (1997) Particle-mesh Ewald and rRESPA for parallel molecular dynamics simulations. In: SIAM 8th Conference on Parallel Processing for Scientific Computing
Bhatia H, Jain N, Bhatele A, Livnat Y, Domke J, Pascucci V, Bremer P (2018) Interactive investigation of traffic congestion on fat-tree networks using TreeScope. Comput Graph Forum 37:561–572. https://doi.org/10.1111/cgf.13442
Article Google Scholar
Qiao P, Wang X, Yang X, Fan Y, Lan Z (2017) Preliminary interference study about job placement and routing algorithms in the fat-tree topology for HPC applications. In: IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, USA, pp 641–642. https://doi.org/10.1109/CLUSTER.2017.90

Download references

Acknowledgements

This work was partially funded by the Spanish Ministry of Economy and Competitiveness CODEC-OSE project (RTI2018-096006-B-I00) and the Comunidad de Madrid CABAHLA project (S2018/TCS-4423), both with European Regional Development Funds (ERDF). It also profited from H2020 co-funded projects Energy oriented Centre of Excellence for Computing Applications II (EoCoE-II, No. 824158) and Supercomputing and Energy in Mexico (Enerxico, No. 828947). Access to resources of CYTED Network RICAP (517RT0529) and Poznan Supercomputing and Networking Center, in particular the support of Marcin Pospieszny, system administrator at PSNC, is acknowledged.

Author information

Authors and Affiliations

Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas (CIEMAT), Avda. Complutense 40, 28040, Madrid, Spain
José A. Moríñigo, Pablo García-Muller, Antonio J. Rubio-Montero & Rafael Mayo-García
Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN, 37830, USA
Antonio Gómez-Iglesias
Poznan Supercomputing and Networking Center, Jana Pawla II 10, 61-139, Poznan, Poland
Norbert Meyer

Authors

José A. Moríñigo
View author publications
You can also search for this author in PubMed Google Scholar
Pablo García-Muller
View author publications
You can also search for this author in PubMed Google Scholar
Antonio J. Rubio-Montero
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Gómez-Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Meyer
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Mayo-García
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José A. Moríñigo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moríñigo, J.A., García-Muller, P., Rubio-Montero, A.J. et al. Performance drop at executing communication-intensive parallel algorithms. J Supercomput 76, 6834–6859 (2020). https://doi.org/10.1007/s11227-019-03142-8

Download citation

Published: 06 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11227-019-03142-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance drop at executing communication-intensive parallel algorithms

Abstract

Access this article

Similar content being viewed by others

Benchmarking LAMMPS: Sensitivity to Task Location Under CPU-Based Weak-Scaling

ExaStamp: A Parallel Framework for Molecular Dynamics on Heterogeneous Clusters

Maximizing Application Performance in a Multi-core, NUMA-Aware Compute Cluster by Multi-level Tuning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance drop at executing communication-intensive parallel algorithms

Abstract

Access this article

Similar content being viewed by others

Benchmarking LAMMPS: Sensitivity to Task Location Under CPU-Based Weak-Scaling

ExaStamp: A Parallel Framework for Molecular Dynamics on Heterogeneous Clusters

Maximizing Application Performance in a Multi-core, NUMA-Aware Compute Cluster by Multi-level Tuning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation