Modeling and simulation of hierarchical task allocation system for energy-aware HPC clouds
Introduction
High-Performance Computing (HPC) and cloud systems become part of our life, playing a vital role in providing many widely adopted services from social media and entertainment to e-commerce and being one of the main tools for large scale scientific experiments. Apart from apparent benefits, they present several challenges to engineers and society, one of the most significant being excessive energy consumption. Although the energy consumed by HPC systems is rising slower now than ten years ago [1] it still amounts to approximately 1.5% of the total electrical energy usage. Such immense power demand raises operational costs (OPEX) and is accompanied by environmentally endangering CO2 emissions [2], [3]. While the adoption of green power sources may help protect the environment (see e.g., Netflix efforts to eliminate traditional energy sources [4]), an improvement of power efficiency remains the primary means of reducing OPEX. The power efficiency of HPC systems is typically measured in GFlops per watt, the state of art values found in the Green500 list [5] reach as high numbers as 16.9 GFlops/W for the Japan A64FX prototype. However, it must be noted that the most powerful clusters provide hundreds of PFlops consuming lots of energy [6]. Therefore, limiting energy consumption and related thermal emission has become a key part of the European energy policies [7], [8]. The efficiency of service provisioning by distributed computing clouds and data centers is significantly influenced by data processing speed and the bandwidth of a network. As the predicted traffic grows exponentially, the high-level energy-aware optimization of network usage becomes an important tool to exploit network power-saving features fully.
On the other hand, the utilization of computing servers in data centers rarely approaches 100% [9]. Typical usage profiles reach no higher than 10%–50% of server capacity for most of the operation time due to computing resources over-subscription strategy applied to guarantee the expected quality of service and provide adaptation of services to sudden changes in the workload. The resulting increased consumption of electricity requires advanced resource allocation methodologies to reduce the gap between the capacity provided by data centers and the requirements of users, especially during low workload periods. As low-level technologies and mechanisms (e.g., ACPI [10] enabled power saving functions of storage devices and CPUs) are widely adopted, today, the main challenge is to arrange and adapt all of them to develop high-level, energy-efficient, and flexible power control systems encompassing all elements of HPC systems. The rapid growth of energy demand by computing and networking infrastructures must be mitigated on both software and hardware levels.
This paper proposes a coordination framework for power saving task allocation to a distributed cloud system consisting of several computing clusters interconnected with data centers via the backbone network. Various entities may own the analyzed system elements; they all incorporate power-saving mechanisms, possibly controlled locally. The proposed solution allows for energy-efficient management of the available power budget while assuring the required quality of services. In the next section, the motivation for this research is presented shortly, then in Section 3, a short survey of approaches for energy-aware computation is provided and discussed, together with related work in Section 4. The optimization problem of power-saving tasks allocation is presented in Section 5, first in centralized form — Section 6 and then coordination scheme together with its sub-problems definition is presented in Section 7. Section 8 covers the numerical simulation verification and performance evaluation of the proposed approach. Finally, Section 9 concludes and proposes some areas of improving the discussed solution.
Section snippets
Motivation
Based on the previous research related to power saving in backbone network [11], [12], [13] and HPC systems [14], [15], [16], [17] as well as CPU level energy consumption control [18], [19], [20] it may be envisaged that low-level power saving control techniques, although very effective, cannot limit energy consumed by ICT equipment adequately to needs resulting from dynamic usage growth. The way to circumvent these problems may be to incorporate available mechanisms and systems into a
Local control mechanism
Modern devices use two basic mechanisms [22]: (1) smart standby — when a device can sleep for some time, (2) dynamic voltage and frequency scaling (DVFS) — when supply voltage and the clock frequency is adjusted. The former method disables some of the device components for a period when no workload is present, retaining its ability to start processing on workload arrival. The latter allows lowering device service rate (i.e., the processor frequency or network interface bandwidth) when the
Related work on power-saving resource allocation
While power-saving mechanisms described in Section 3 allow to limit the energy consumption of a single server, further reduction may be attained by their integration with HPC system management software. The main advantage of applying power-saving control to a cluster is the possibility to coordinate task execution in a manner maximizing the effectiveness of equipment used by adjusting its energy states and task allocation. The typical result is a consolidation of tasks by moving them to a set
Description of the problem
The subject of the work is energy-saving control of a distributed HPC system consisting of a number of computing clusters connected to some data centers providing the information needed for task processing. Such space decomposition is used when data must be stored in distinct locations due to their excessive size requiring initial preprocessing close to the source or reliability considerations resulting in various replication schemes. Typical examples may be large scale scientific experiments
Centralized cluster workload allocation
Before developing a decomposition scheme, the overall, centralized energy-aware HPC system control problem will be defined to describe all its components and control targets. Various approaches to the definition and solution of HPC systems’ energy-aware control were discussed in Section 4. The one presented here involves control of network equipment and several computing clusters. The resulting control framework provides top-level of coordination and control and cooperates with cluster
Two level cluster workload allocation
The solution proposed in the previous section models all aspects of the controlled HPC system in a single set of formulas (1)–(14). The allocations found this way minimize the power demanded by the whole system; however, their implementation requires deep and direct interaction with management systems of all clusters. When different institutions run the clusters, such a scheme may be difficult to accept. The reasons for this may be the incompatibility of management systems but also the
Simulation experiments
Several groups of simulation experiments were carried out to evaluate the proposed solution. Initially, the basic properties of algorithms were analyzed, and exhaustive tests were performed to assess algorithm sensitivity to parameters values. Then the calculation time and power effectiveness were evaluated. The simulations were run for several scenarios involving varying system resources, workload, and network topologies. Two variants of backbone networks were considered to check the influence
Conclusions and future work
The problem of energy saving allocation of tasks to the distributed computing system is relatively complex. The mathematical programming model of the problem has been developed, and its solution has been discussed in Section 6. The analysis provided in Sections 6 Centralized cluster workload allocation, 7 Two level cluster workload allocation demonstrates that solving this task by the global controller with an appropriate method (e.g., specialized MIP solver) is able to deliver an optimal
References (68)
- et al.
Dynamic power management in energy-aware computer networks and data intensive systems
Future Gener. Comput. Syst.
(2014) - et al.
Design and implementation of energy-aware application-specific CPU frequency governors for the heterogeneous distributed computing systems
Future Gener. Comput. Syst.
(2018) - et al.
Dynamic energy-aware scheduling for parallel task-based application in cloud computing
Future Gener. Comput. Syst.
(2018) - et al.
Semi-online task assignment policies for workload consolidation in cloud computing systems
Future Gener. Comput. Syst.
(2018) - et al.
Task allocation, migration and scheduling for energy-efficient real-time multiprocessor architectures
J. Syst. Archit.
(2019) - et al.
Recalibrating global data center energy-use estimates
Science
(2020) Hypotheses for primary energy use, electricity use and CO emissions of global computing and its shares of the total between 2020 and 2030
WSEAS Trans. Power Syst.
(2020)- et al.
Trends in data centre energy consumption under the European code of conduct for data centre energy efficiency
Energies
(2017) Netflix 2019 sustainability accounting standards board (SASB) report
(2020)The green500
(2019)
The top500
Energy efficiency
Code of conduct for energy efficiency in data centres
The case for energy-proportional computing
Computer
Advanced Configuration and Power Interface Specification, Revision 5.0
Energy-aware multilevel control system for a network of linux software routers: Design and implementation
IEEE Syst. J.
Shortest path green routing and the importance of traffic matrix knowledge
Energy and power efficiency in cloud
Resource management system for HPC computing
Energy aware data centers and networks: a survey
J. Telecommun. Inf. Technol.
Server power consumption: Measurements and modeling with MSRs
Energy-efficient CPU frequency control for the Linux system
Concurr. Comput.: Pract. Exper.
User’s manual for CPLEX
Large-scale Distributed Systems and Energy Efficiency: A Holistic View
Theoretical and technological limitations of power scaling in network devices
64 and IA-32 architectures software developer’s manual
Intelligent platform management interface specification, second generation
The green abstraction layer: A standard power-management interface for next-generation network devices
IEEE Internet Comput.
Large-scale validation and benchmarking of a network of power-conservative systems using ETSI’s Green Abstraction Layer
Trans. Emerg. Telecommun. Technol.
ETSI ES 203 237 V1.1.1 (2014–03) standard
Cited by (5)
Dependency Prediction of Long-Time Resource Uses in HPC Environment
2023, IEEE AccessReliable task allocation for soil moisture wireless sensor networks using differential evolution adaptive elite butterfly optimization algorithm
2023, Mathematical Biosciences and EngineeringLearning causal theories with non-reversible MCMC methods<sup>∗</sup>
2021, Control and CyberneticsApplication of the evolutionary algorithms for task allocation in uncertain environments with stochastic tuning
2021, AIIPCC 2021 - 2nd International Conference on Artificial Intelligence, Information Processing and Cloud Computing