Energy saving strategies for parallel applications with point-to-point communication phases

https://doi.org/10.1016/j.jpdc.2013.03.006Get rights and content

Highlights

  • We present a runtime procedure to detect communication phases.

  • We propose novel frequency scaling strategies to save energy.

  • We explore the opportunity of energy saving in quantum chemistry software GAMESS.

  • Different process binding strategies are proposed for GAMESS.

Abstract

Although high-performance computing traditionally focuses on the efficient execution of large-scale applications, both energy and power have become critical concerns when approaching exascale. Drastic increases in the power consumption of supercomputers affect significantly their operating costs and failure rates. In modern microprocessor architectures, equipped with dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (throttling), the power consumption may be controlled in software. Additionally, network interconnect, such as Infiniband, may be exploited to maximize energy savings while the application performance loss and frequency switching overheads must be carefully balanced. This paper advocates for a runtime assessment of such overheads by means of characterizing point-to-point communications into phases followed by analyzing the time gaps between the communication calls. Certain communication and architectural parameters are taken into consideration in the three proposed frequency scaling strategies, which differ with respect to their treatment of the time gaps. The experimental results are presented for NAS parallel benchmark problems as well as for the realistic parallel electronic structure calculations performed by the widely used quantum chemistry package GAMESS. For the latter, three different process-to-core mappings were studied as to their energy savings under the proposed frequency scaling strategies and under the existing state-of-the-art techniques. Close to the maximum energy savings were obtained with a low performance loss of 2% on the given platform.

Introduction

Power consumption has become a major concern for modern and future-generation supercomputers consuming tens of megawatts of power. For example, the petascale K supercomputer, which was at the top of the TOP500 list in 2011, consumes around 12 MW of power for its 705,024 cores. At the exascale, supercomputers having 100 million cores are supposed to operate on “only” 100 MW of power  [3]. Their tremendous power demands drastically increase the operating costs as well as limit scalability and sustainability. Therefore, energy conservation must be employed at all levels of high-performance computing (HPC): applications, system software, and hardware.

The current generation of Intel processors provides various P-states for dynamic voltage and frequency scaling (DVFS) and T-states for introducing processor idle cycles (throttling). For example, the Intel “Penryn” microarchitecture provides four P-states (f1,,f4, where fi>fj for i<j) and eight T-states Tj with j(j=0,,7) idle cycles per eight cycles in the CPU execution.Table 1 depicts the various P-states and the associated core voltages offered by the Intel Xeon E5450 processor, which is used in this work. The delay of switching from one state to another depends on the relative ordering of the current and desired states, as discussed, e.g., in  [24]. The user may write a specific value to model-specific registers (MSRs) to change the P- or T-state of the system. Note that throttling can be viewed as equivalent to dynamic frequency scaling (DFS)  [1] because, by inserting idle cycles in the CPU execution, a particular frequency is obtained without changing the operating voltage of the cores. Therefore, it is often less effective than DVFS.

Various methods to employ the DVFS exist. The more sophisticated techniques scale processor frequency on different intervals of application runtime while attempting to predict accurately the performance effects from the DVFS. Such techniques may be broadly classified into two types: one that first divides the application into execution intervals of predefined duration and then uses the performance counters to determine a suitable frequency for them  [10], [14], [15], and the other that first determines communication intervals in parallel applications that use either explicit message passing  [9], [20] or global address-space primitives  [36] and then scales the frequency for those intervals, usually based on the MIPS metric.

Modern interconnects, such as Infiniband,1 provide operating-system kernel bypass mechanisms for a full CPU offload during communications. Note that the term “CPU offload” means “offloading from the CPU” and is used here consistently with Infiniband vendor presentations (see footnote 1). With the CPU offload, more cycles are provided to the CPU for computational tasks or CPU is put in a low-power mode when waiting for data, as in blocking communications, for example. In this work, the CPU offloading during the communications and the performance-counter information on the computations are unified to extract the maximum energy savings without the need for any changes to the user application or to the communication library.

A wide range of HPC applications are developed using communication libraries implementing the Message Passing Interface2 (MPI), which has become a de facto standard for the design of parallel applications. It defines point-to-point and collective communication primitives for sending and receiving data explicitly within a parallel application. In the authors’ previous work  [32], energy efficiency in the MPI collective operations, such as all-to-all exchange, was addressed. Specifically, a collective operation was considered as a multitude of point-to-point communications grouped together, essentially presenting a single communication stream. Since the rank3 sequences of the point-to-point transfers within a collective communication are typically known during the collective algorithm execution, the places to apply DVFS and throttling may be determined in advance for a given message size. For example, an a priori algorithmic analysis of the MPI_Alltoall algorithms reveals that, for a few initial iterations, every core undergoes intranode communications after which the communication becomes purely internode. Thus, different throttling states may be preselected while the DVFS is lowered to the maximum at the start of MPI_Alltoall and raised back to the highest level in the end. A single point-to-point communication is, on the other hand, just one call per given processor rank. Therefore, its frequency scaling on the percall basis, as was done for collectives, may easily result in a significant parallel performance degradation due to the switching overhead. Nevertheless, it is desirable to decrease the energy consumption during the point-to-point communications, in addition to collectives, because they may constitute a significant portion of the execution (often more than 10%), and applications communicating heavily in the point-to-point fashion are abundant. Therefore, this work focuses on a class of point-to-point communications as provided by the MPI standard.

By considering test cases from the NAS benchmark suite  [2], this work validates the proposed runtime procedure that groups several point-to-point communications, aiming to reduce the overhead from the DVFS and throttling. Next, it applies this procedure to realistic electronic structure calculations performed by the widely-used GAMESS quantum chemistry package  [12], [29], which is capable of performing molecular structure and property calculations by a rich variety of ab initio methods. An estimated user base of 150,000 comes from more than 100 countries. The GAMESS communication library, which has been custom-built, is based on the partitioned global address space (PGAS) concepts.

The rest of paper is organized as follows. Section  2 describes the design of the runtime procedure for analyzing the point-to-point communications during the application execution. Section  3 discusses the communication model of GAMESS and techniques to apply efficiently the runtime communication analysis to GAMESS. Section  4 presents the experimental results while Sections  5 Related work, 6 Conclusions and future work provide the related work and conclusions, respectively.

Section snippets

Runtime communication analysis

To apply frequency scaling in point-to-point communications, it is helpful to first categorize them as to the reappearances of rank sequences and message sizes, more generally, to determine the point-to-point  recurring patterns. Then, by analyzing the obtained recurring patterns, it may be decided whether or not the frequency switching overhead is amortized and thus, whether or not the CPU frequency scaling is warranted. In this paper, the ranks recurring during a certain time period are

Overview of GAMESS

GAMESS is one of the most representative freely available quantum chemistry applications used worldwide to do ab initio electronic structure calculations. A wide range of quantum chemistry computations may be accomplished using GAMESS, ranging from basic Hartree–Fock and Density Functional Theory to high-accuracy multi-reference and coupled-cluster.

The central task of quantum chemistry is to find an (approximate) solution of the Schrödinger equation for a given molecular system. An approximate

Experimental results

The experiments were performed on four nodes of the computing platform Dynamo,4 which comprises 35 Infiniband DDR-connected compute nodes, each of which has 16 GB of main memory and two Intel Xeon E5450 quad-core processors arranged as two sockets with four P-states ranging from 2 to 3 GHz in steps of 0.33 GHz and eight levels of throttling from T0 to T7. For measuring the node power and energy consumption, a

Related work

There are two general approaches for obtaining energy saving during parallel application execution. The first approach is to focus on identifying stalls during the execution by measuring architectural parameters from performance counters as proposed in  [10], [14], [15]. Rountree et al.  [27], apart from using performance counters, perform the critical path analysis to determine which tasks may be slowed to minimize the performance loss in the parallel execution. However, for applications

Conclusions and future work

In this paper, a runtime procedure has been considered to detect communication phases in blocking point-to-point communications. Then, different strategies have been proposed to select a suitable frequency for the communication phases as well as for the time gaps between the communication calls. For the maximum energy savings, the time gaps are recorded and classified into intra and interphase, and the strategies differ as to their treatment of these time gaps in point-to-point communications.

Acknowledgments

The authors would also like to thank Dr. Rong Ge for her valuable feedback and for providing the CPU Miser software and to the anonymous referees for the many helpful comments and suggestions to improve the paper.

Vaibhav Sundriyal received his B.Tech in Electronics and Communication from Vellore Institute of Technology in India. He joined Iowa State University in Fall 2009 and since 2010, he is working towards his Ph.D. in Scalable Computing Laboratory in U.S. Department of Energy Ames Laboratory. His research interests include high-performance computing and design of energy and power aware runtime systems.

References (38)

  • G.D. Fletcher et al.

    The distributed data interface in GAMESS

    Computer Physics Communications

    (2000)
  • S. Sok et al.

    A dash of protons: a theoretical study on the hydrolysis mechanism of 1-substituted silatranes and their protonated analogs

    Computational and Theoretical Chemistry

    (2012)
  • M. Annavarami et al.

    Mitigating Amdahl’s Law through EPI throttling

  • D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A....
  • K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S....
  • M. Curtis-Maury et al.

    Prediction models for multi-dimensional power-performance optimization on many cores

  • H. David et al.

    RAPL: memory power estimation and capping

  • D.G. Fedorov et al.

    A new hierarchical parallelization scheme: generalized distributed data interface (GDDI), and an application to the fragment molecular orbital method (FMO)

    Journal of Computational Chemistry

    (2004)
  • V.W. Freeh et al.

    Using multiple energy gears in MPI programs on a power-scalable cluster

  • V.W. Freeh, D.K. Lowenthal, Using multiple energy gears in MPI programs on a power-scalable cluster, in: Proceedings of...
  • R. Ge, X. Feng, W. Feng, K.W. Cameron, CPU MISER: a performance-directed, run-time system for power-aware clusters, in:...
  • R. Ge et al.

    PowerPack: energy profiling and analysis of high-performance systems and applications

    IEEE Transactions on Parallel and Distributed Systems

    (2010)
  • M.S. Gordon et al.

    Advances in electronic structure theory: GAMESS a decade later

  • D. Gusfield

    Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology

    (1997)
  • C.H. Hsu, W. Feng, A power-aware run-time system for high-performance computing, in: Proceedings of the ACM/IEEE SC...
  • S. Huang, W. Feng, Energy-efficient cluster computing via accurate workload characterization, in: 9th IEEE/ACM...
  • N. Ioannou, M. Kauschke, M. Gries, M. Cintra, Phase-based application-driven hierarchical power management on the...
  • K. Kandalla, E.P. Mancini, S. Sur, D.K. Panda, Designing power-aware collective communication algorithms for infiniband...
  • W. Kim, M.S. Gupta, G. Wei, D. Brooks, System level analysis of fast, per-core DVFS using on-chip switching regulators,...
  • Cited by (10)

    • Communication-Awareness for Energy-Efficiency in Datacenters

      2016, Advances in Computers
      Citation Excerpt :

      Moreover, the channels in many current switches are always ON regardless of transmission rate (even when there is no packet to transmit). More advanced switches, such as Dynamic InfiniBand switches, can be adapted to transmission rate and save power [26–28]. Use of other technologies, such as optical cables and networks, is another way for energy proportionality [29–31].

    • Runtime Power Limiting in GAMESS on Dual-Socket Nodes

      2018, Proceedings - 2017 International Conference on Computational Science and Computational Intelligence, CSCI 2017
    • Runtime power limiting of parallel applications on Intel Xeon phi processors

      2017, Proceedings of E2SC 2016: 4th International Workshop on Energy Efficient Supercomputing - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
    View all citing articles on Scopus

    Vaibhav Sundriyal received his B.Tech in Electronics and Communication from Vellore Institute of Technology in India. He joined Iowa State University in Fall 2009 and since 2010, he is working towards his Ph.D. in Scalable Computing Laboratory in U.S. Department of Energy Ames Laboratory. His research interests include high-performance computing and design of energy and power aware runtime systems.

    Masha Sosonkina has received her B.S. and M.S. degrees in Applied Mathematics from Kiev National University in Ukraine. She has graduated from Virginia Tech in 1997 with a Ph.D. degree in Computer Science and Applications. Since 2003, Dr. Sosonkina is a scientist at the US Department of Energy Ames Laboratory and an adjunct faculty at Iowa State University.

    Her research interests include high-performance computing, large-scale simulations, parallel numerical algorithms, performance analysis, and adaptive algorithms.

    Alexander Gaenko received his B.S. and M.S. degrees in Chemistry (with specialization in Quantum Chemistry and Chemistry Software) from St. Petersburg State University in Russia. He got his Ph.D. degree in Chemistry from St. Petersburg State Technological Institute (Russia) in 2008. Since 2010, Dr. Gaenko is an assistant scientist at the US Department of Energy Ames Laboratory. His research interests include high-performance computing, doing computational science on large-scale architectures, component-oriented software development, performance analysis and partitioned global address space approaches.

    Zhao Zhang received the B.S. and M.S. degrees in computer science from Huazhong University of Science of Technology, China, in 1991 and 1994, respectively, and the Ph.D. degree in computer science from the College of William and Mary in 2002. He is an associate professor of computer engineering at Iowa State University. His research interests include computer architecture, parallel and distributed systems, and architectural support for system security. He is a member of the IEEE and a senior member of the ACM.

    This work was supported in part by Ames Laboratory and Iowa State University under the contract DE-AC02-07CH11358 with the U.S. Department of Energy, by the Air Force Office of Scientific Research under the AFOSR award FA9550-12-1-0476, and by the National Science Foundation grants NSF/OCI—0941434, 0904782, 1047772.

    View full text