ABSTRACT
A majority of parallel applications executed on HPC clusters use MPI for communication between processes. Most users treat MPI as a black box, executing their programs using the cluster's default settings. While the default settings perform adequately for many cases, it is well known that optimizing the MPI environment can significantly improve application performance. Although the existing optimization tools are effective when used by performance experts, they require deep knowledge of MPI library behavior and the underlying hardware architecture in which the application will be executed. Therefore, an easy-to-use tool that provides recommendations for configuring the MPI environment to optimize application performance is highly desirable. This paper addresses this need by presenting an easy-to-use methodology and tool, named MPI Advisor, that requires just a single execution of the input application to characterize its predominant communication behavior and determine the MPI configuration that may enhance its performance on the target combination of MPI library and hardware architecture. Currently, MPI Advisor provides recommendations that address the four most commonly occurring MPI-related performance bottlenecks, which are related to the choice of: 1) point-to-point protocol (eager vs. rendezvous), 2) collective communication algorithm, 3) MPI tasks-to-cores mapping, and 4) Infiniband transport protocol. The performance gains obtained by implementing the recommended optimizations in the case studies presented in this paper range from a few percent to more than 40%. Specifically, using this tool, we were able to improve the performance of HPCG with MVAPICH2 on four nodes of the Stampede cluster from 6.9 GFLOP/s to 10.1 GFLOP/s. Since the tool provides application-specific recommendations, it also informs the user about correct usage of MPI.
- F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst. hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications. In Proceedings of the Euromicro Conference on Parallel, Distributed and Network-Based Computing, 2010. Google ScholarDigital Library
- M. Chaarawi, J. Squyres, E. Gabriel, and S. Feki. A Tool for Optimizing Runtime Parameters of Open MPI. In Proceedings of the European PVM/MPI Users' Group Meeting, pages 210--217, 2008. Google ScholarDigital Library
- J. Dongarra and M. A. Heroux. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report SAND2013-4744, Sandia National Laboratories, 2013.Google Scholar
- L. Fialho and J. Browne. Framework and Modular Infrastructure for Automation of Architectural Adaptation and Performance Optimization for HPC Systems. In Proceedings of the International Conference on Supercomputing, pages 261--277, 2014. Google ScholarDigital Library
- M. Geimer, P. Saviankou, A. Strube, Z. Szebenyi, F. Wolf, and B. J. N. Wylie. Further Improving the Scalability of the Scalasca Toolset. In Proceedings of the PARA 2010: State of the Art in Scientific and Parallel Computing, pages 463--473, 2012. Google ScholarDigital Library
- M. Gerndt and M. Ott. Automatic performance analysis with Periscope. Concurrency and Computation: Practice and Experience, 22(6):736--748, 2009. Google ScholarDigital Library
- M. E. Harding, T. Metzroth, J. Gauss, and A. A. Auer. Parallel Calculation of CCSD and CCSD(T) Analytic First and Second Derivatives. Journal of Chemical Theory and Computation, 4(1):64--74, 2008.Google ScholarCross Ref
- E. Jeannot, G. Mercier, and F. Tessier. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Transactions on Parallel and Distributed Systems, 25(4):993--1002, 2014. Google ScholarDigital Library
- T. Kielmann, R. F. H. Hofman, H. E. Bal, A. Plaat, and R. A. F. Bhoedjang. MagPIe: MPI's Collective Communication Operations for Clustered Wide Area Systems. SIGPLAN Notices, 34(8):131--140, 1999. Google ScholarDigital Library
- M. J. Koop, J. K. Sridhar, and D. K. Panda. Scalable MPI design over InfiniBand using eXtended Reliable Connection. In Proceedings of the IEEE International Conference on Cluster Computing, pages 203--212, 2008.Google ScholarCross Ref
- M. J. Koop, S. Sur, Q. Gao, and D. K. Panda. High Performance MPI Design Using Unreliable Datagram for Ultra-scale InfiniBand Clusters. In Proceedings of the International Conference on Supercomputing, 2007. Google ScholarDigital Library
- R. Miceli, G. Civario, A. Sikora, E. César, M. Gerndt, H. Haitof, C. Navarrete, S. Benkner, M. Sandrieser, L. Morin, and F. Bodin. AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications. In Proceedings of the PARA 2012: State of the Art in Scientific and Parallel Computing, pages 328--342, 2013. Google ScholarDigital Library
- R. Mijaković, A. P. Soto, I. A. C. Ureña, M. Gerndt, A. Sikora, and E. César. Specification of Periscope Tuning Framework Plugins. In Proceedings of the International Conference on Parallel Computing, pages 123--132, 2013.Google Scholar
- W. E. Nagel, A. Arnold, M. Weber, H. C. Hoppe, and K. Solchenbach. VAMPIR: Visualization and Analysis of MPI Resources. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 69--80, 1996.Google Scholar
- S. Pellegrini, T. Fahringer, H. Jordan, and H. Moritsch. Automatic Tuning of MPI Runtime Parameter Settings by Using Machine Learning. In Proceedings of the ACM International Conference on Computing Frontiers, pages 115--116, 2010. Google ScholarDigital Library
- S. Pellegrini, J. Wang, T. Fahringer, and H. Moritsch. Optimizing MPI Runtime Parameter Settings by Using Machine Learning. In Proceedings of the European PVM/MPI Users' Group Meeting, pages 196--206, 2009. Google ScholarDigital Library
- J. Reinders. VTune Performance Analyzer Essentials. Intel Press, Hillsboro, 1st edition, 2005.Google Scholar
- M. Schulz, J. Galarowicz, D. Maghrak, and W. Hachfeld. Open | SpeedShop: An open source infrastructure for parallel performance analysis. Scientific Programming, 16(2-3):105--121, 2008. Google ScholarDigital Library
- S. Shende and A. D. Malony. The Tau Parallel Performance System. International Journal of High Performance Computing Applications, 20(2):287--311, 2006. Google ScholarDigital Library
- C. S. Simmons and K. W. Schulz. A Distributed Memory Out-of-core Method on HPC Clusters and Its Application to Quantum Chemistry Applications. In Proceedings of the Conference of the Extreme Science and Engineering Discovery Environment, 2012. Google ScholarDigital Library
- J. S. Vetter and M. O. McCracken. Statistical Scalability Analysis of Communication Operations in Distributed Applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, 2001. Google ScholarDigital Library
- J. Vienne, J. Chen, M. Wasi-Ur-Rahman, N. S. Islam, H. Subramoni, and D. K. Panda. Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems. In Hot Interconnects, pages 48--55, 2012. Google ScholarDigital Library
- J. West, T. Evans, W. L. Barth, and J. Browne. Multilevel Workload Characterization With Applications in High Performance Computer Systems Management. Submitted to the 2015 IEEE Int. Symposium on Workload Characterization, 2015.Google Scholar
Index Terms
- MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning
Recommendations
Employing MPI_T in MPI Advisor to optimize application performance
MPI_T, the MPI Tool Information Interface, was introduced in the MPI 3.0 standard with the aim of enabling the development of more effective tools to support the Message Passing Interface MPI, a standardized and portable message-passing system that is ...
Kernel-Assisted MPI Collective Communication among Many-core Clusters
CCGRID '12: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)Architectural hierarchies and hardware complexity brought by multicore or many-core Clusters, greatly challenge MPI applications' performance in two ways: performance efficiency and cross-platform portability. The cross-platform portability assumption, '...
MT-MPI: multithreaded MPI for many-core environments
ICS '14: Proceedings of the 28th ACM international conference on SupercomputingMany-core architectures, such as the Intel Xeon Phi, provide dozens of cores and hundreds of hardware threads. To utilize such architectures, application programmers are increasingly looking at hybrid programming models, where multiple threads interact ...
Comments