Optimizing the configuration of a heterogeneous cluster with multiprocessing and execution-time estimation

doi:10.1016/j.parco.2005.04.004

Parallel Computing

Volume 31, Issue 7, July 2005, Pages 691-710

https://doi.org/10.1016/j.parco.2005.04.004 Get rights and content

Abstract

Although heterogeneous clusters are flexible and cost-effective, they entail intrinsic difficulties in optimization. Whereas it is simple to invoke multiple processes on fast processing elements (PEs) to alleviate load imbalance, the optimal process allocation is not obvious. Communication time is another problem. Though it is sometimes better to exclude slow PEs to avoid performance degradation, it is generally difficult to find the optimal PE configuration. In this study, the execution time is first modeled from the measurement results of various configurations. The derived models are then used to estimate the optimal PE configuration and process allocation. We implemented various models for HPL (High Performance Linpack benchmark) on a heterogeneous cluster, and estimated the optimal configurations for various problem sizes. In the case of a heterogeneous cluster of Athlon and Pentium-II, the execution time of the estimated optimal configuration was 0–7.4% longer than that of the actual optimal configuration. In a heterogeneous cluster of three kinds of processors that includes dual-processors, the excess time was 13.6–31.5%.

Introduction

It is reasonable to enhance the performance of an existing PC cluster by adding the latest high-performance processors. The resulting cluster becomes heterogeneous, consisting of a wide range of processing elements (PEs) from fast to slow. However, heterogeneous clusters inherently entail difficulties in optimization and suffer from load imbalance.

Although it is simple to invoke multiple processes on fast PEs to alleviate load imbalance, this approach (multiprocessing) has some drawbacks. The first problem is the overhead to execute multiple processes on the same processor. Another problem is that the ratio of PE performance is not always an integer, while the number of processes invariably is. Thus, the best process allocation among PEs is far from obvious.

Communication time is also very important. It is not always preferable to use all available PEs, since superfluous communications can prolong the total execution time. In particular, a slow PE can create a performance bottleneck in computation and communication. The total performance can be improved by excluding slow PEs, and instead using the best subset of PEs. However, it is generally difficult to find the best subset of available PEs, i.e., the best PE configuration for a heterogeneous cluster.

Many applications for parallel computers or homogeneous clusters are written to distribute workloads equally among PEs. Although it is desirable to rewrite the application for heterogeneous clusters, it requires much time and effort to adapt it to a heterogeneous environment. Moreover, the effort must be repeated for each application.

The purpose of this study is to execute conventional parallel applications efficiently on heterogeneous clusters without rewriting them. Our study adopts a multiprocessing approach, providing an effective way to estimate the best PE configuration and process allocation based on an execution-time model of the application. Our method does not aim to extract the maximum performance from a heterogeneous cluster, but rather to offer an easy and simple way to accelerate a wide range of conventional parallel applications in heterogeneous clusters. Although we examine HPL (High Performance Linpack benchmark) [1] as a sample application in this study, our approach is not limited to HPL alone but is expected to be widely applicable to many other applications.

Section 2 introduces some related studies, and then briefly summarizes the background of this study. In Section 3, the execution time is modeled from the measurement results of various configurations. The derived models are used to estimate the optimal PE configuration and process allocation. The evaluation results are found in Section 4. Section 5 concludes the study.

Section snippets

Background and related works

Though a block cyclic distribution is very popular in balancing the load of matrix-matrix multiplication and LU decomposition, such a distribution in its original form is not suited to a heterogeneous environment. Therefore, many researchers have studied alternative load-balancing schemes. For example, Kalinov and Lastovetsky [2] presented a “heterogeneous block cyclic distribution” for the Cholesky factorization of square dense matrices. Beaumont et al. [3] reported a “2D heterogeneous grid

Assumptions

To optimize the multiprocessing approach for heterogeneous clusters, it is necessary (1) to select the optimal subset of PEs and (2) to determine the optimal number of processes on each PE. This task is modeled as a combinatorial optimization problem to minimize the total execution time, where one must construct an objective function that estimates the total execution time from the given PE set and the given number of processes.

In this section, we construct the estimation model based on some

Evaluation

In this section, the estimation models are built and evaluated for a heterogeneous cluster. The specifications of the evaluation platform are listed in Table 1. As each Pentium-II node includes two processors, a total of eight Pentium-II processors are available in four nodes (Node 4–Node 7). Likewise, four Pentium-III processors are available in Node 2 and Node 3. All nodes have both 1000base-SX and 100base-TX interfaces, but only the 100base-TX is used in the following measurements to clearly

Conclusion

The results of this study are still preliminary, and many improvements are anticipated. Moreover, more extensive studies are required for various cluster configurations. Our aim, however, remains (1) to make the estimation model more elegant and unified, (2) to reduce the model construction time, and (3) to reduce the errors in estimation.

One of the major concerns of this approach might be the scalability. The estimated best configuration was not very satisfactory in Section 4.2, because of the

Acknowledgements

This work was partially supported by a Grant-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (JSPS), as well as a grant from the Hori Information Science Promotion Foundation. Further support was also provided by the 21st Century COE Program “Intelligent Human Sensing” from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References (8)

A. Petitet, R.C. Whaley, J. Dongarra, A. Cleary, HPL—a portable implementation of the high-performance Linpack...
A. Kalinov et al.
Heterogeneous distribution of computations while solving linear algebra problems on networks of heterogeneous computers
O. Beaumont et al.
A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers)
IEEE Transaction on Computers
(2001)
L.S. Blackford et al.
ScaLAPACK Users’ Guide
(1997)

There are more references available in the full text version of this article.

Cited by (15)

Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: Application to evolutionary modeling of clinical pathways
2022, Journal of Computational Science
Citation Excerpt :
In such cases, the procedure of discovery requires fine-tuning both in terms of performance and in terms of quality. Currently, there exist many works focused on algorithm performance prediction using empirical equations [3] or data-driven models [4,5]. Still, most of them are mainly focused on solutions for computationally intensive numerical algorithms with explicit and measurable quality metrics.
The paper proposes and investigates an approach for surrogate-assisted performance prediction of data-driven knowledge discovery algorithms. The approach is based on the identification of surrogate models for prediction of the target algorithm’s quality and performance. The proposed approach was implemented and investigated as applied to an evolutionary algorithm for discovering clusters of interpretable clinical pathways in electronic health records of patients with acute coronary syndrome. Several clustering metrics and execution time were used as the target quality and performance metrics respectively. An analytical software prototype based on the proposed approach for the prediction of algorithm characteristics and feature analysis was developed to provide a more interpretable prediction of the target algorithm’s performance and quality that can be further used for parameter tuning.
Execution time estimation for workflow scheduling
2017, Future Generation Computer Systems
Citation Excerpt :
Table 1 shows a classification of workflow scheduling algorithms; the section below discusses the classes in details. Several papers are difficult to be classified because of two reasons: first, some approaches stay between two classes (e.g. in [11,12,17–19] the execution time is calculated as a mean of a random variable, but the variance of the time is not used); second, some researchers focus more on the architecture of scheduling software than on the algorithms, so one can use different estimate representations [9,10,14]. Ordinal time: The first class of schedulers uses task-level scheduling heuristics (which do not take into account the execution time of the workflow).
Estimation of the execution time is an important part of the workflow scheduling problem. The aim of this paper is to highlight common problems in estimating the workflow execution time and propose a solution that takes into account the complexity and the stochastic aspects of the workflow components as well as their runtime. The solution proposed in this paper addresses the problems at different levels from a task to a workflow, including the error measurement and the theory behind the estimation algorithm. The proposed makespan estimation algorithm can be integrated easily into a wide class of schedulers as a separate module. We use a dual stochastic representation, characteristic/distribution function, in order to combine task estimates into the overall workflow makespan. Additionally, we propose the workflow reductions—operations on a workflow graph that do not decrease the accuracy of the estimates but simplify the graph structure, hence increasing the performance of the algorithm. Another very important feature of our work is that we integrate the described estimation schema into earlier developed scheduling algorithm GAHEFT and experimentally evaluate the performance of the enhanced solution in the real environment using the CLAVIRE platform.
Approach to automation of cloud learning resources' design for courses in computational science based on eScience resources with the use of the CLAVIRE platform
2015, Procedia Computer Science
This paper describes the set of methods and cloud tools used to simplify the rapid design of learning resources for courses in computational science. We have developed and added new tools to our cloud platform–CLAVIRE–to simplify and speed up the sharing of scientific executable resources, design and implementation of courses’ structure and virtual learning labs, and preparation of the text resources for the theoretical part of the course and the case studies and seminars. We have applied our approach to design a course in eScience technologies based on the sequences of application packages and cloud services developed for task solving in different application domains and integrated into the CLAVIRE platform. Our approach allows us to significantly speed up the design and implementation of learning resources, and does not reduce the value of teachers’ (experts’) participation.
Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: Application to evolutionary modeling of clinical pathways
2020, arXiv
Architecture of middleware to provide the multiscale modelling using coupling templates
2017, Communications in Computer and Information Science
Resource consumption prediction using neuro-fuzzy modeling
2016, Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS

View all citing articles on Scopus

^☆: A preliminary version of this work was presented at the 13th Heterogeneous Computing Workshop (HCW 2004), Santa Fe, New Mexico, USA, April 2004.

¹: The author is presently affiliated with the Asahi Kasei Information Systems Co., Ltd.

View full text

Parallel Computing

Optimizing the configuration of a heterogeneous cluster with multiprocessing and execution-time estimation☆

Abstract

Introduction

Section snippets

Background and related works

Assumptions

Evaluation

Conclusion

Acknowledgements

Heterogeneous distribution of computations while solving linear algebra problems on networks of heterogeneous computers

A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers)

IEEE Transaction on Computers

ScaLAPACK Users’ Guide

Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: Application to evolutionary modeling of clinical pathways

Execution time estimation for workflow scheduling

Approach to automation of cloud learning resources' design for courses in computational science based on eScience resources with the use of the CLAVIRE platform

Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: Application to evolutionary modeling of clinical pathways

Architecture of middleware to provide the multiscale modelling using coupling templates

Resource consumption prediction using neuro-fuzzy modeling