Elsevier

Parallel Computing

Volume 31, Issue 7, July 2005, Pages 691-710
Parallel Computing

Optimizing the configuration of a heterogeneous cluster with multiprocessing and execution-time estimation

https://doi.org/10.1016/j.parco.2005.04.004Get rights and content

Abstract

Although heterogeneous clusters are flexible and cost-effective, they entail intrinsic difficulties in optimization. Whereas it is simple to invoke multiple processes on fast processing elements (PEs) to alleviate load imbalance, the optimal process allocation is not obvious. Communication time is another problem. Though it is sometimes better to exclude slow PEs to avoid performance degradation, it is generally difficult to find the optimal PE configuration. In this study, the execution time is first modeled from the measurement results of various configurations. The derived models are then used to estimate the optimal PE configuration and process allocation. We implemented various models for HPL (High Performance Linpack benchmark) on a heterogeneous cluster, and estimated the optimal configurations for various problem sizes. In the case of a heterogeneous cluster of Athlon and Pentium-II, the execution time of the estimated optimal configuration was 0–7.4% longer than that of the actual optimal configuration. In a heterogeneous cluster of three kinds of processors that includes dual-processors, the excess time was 13.6–31.5%.

Introduction

It is reasonable to enhance the performance of an existing PC cluster by adding the latest high-performance processors. The resulting cluster becomes heterogeneous, consisting of a wide range of processing elements (PEs) from fast to slow. However, heterogeneous clusters inherently entail difficulties in optimization and suffer from load imbalance.

Although it is simple to invoke multiple processes on fast PEs to alleviate load imbalance, this approach (multiprocessing) has some drawbacks. The first problem is the overhead to execute multiple processes on the same processor. Another problem is that the ratio of PE performance is not always an integer, while the number of processes invariably is. Thus, the best process allocation among PEs is far from obvious.

Communication time is also very important. It is not always preferable to use all available PEs, since superfluous communications can prolong the total execution time. In particular, a slow PE can create a performance bottleneck in computation and communication. The total performance can be improved by excluding slow PEs, and instead using the best subset of PEs. However, it is generally difficult to find the best subset of available PEs, i.e., the best PE configuration for a heterogeneous cluster.

Many applications for parallel computers or homogeneous clusters are written to distribute workloads equally among PEs. Although it is desirable to rewrite the application for heterogeneous clusters, it requires much time and effort to adapt it to a heterogeneous environment. Moreover, the effort must be repeated for each application.

The purpose of this study is to execute conventional parallel applications efficiently on heterogeneous clusters without rewriting them. Our study adopts a multiprocessing approach, providing an effective way to estimate the best PE configuration and process allocation based on an execution-time model of the application. Our method does not aim to extract the maximum performance from a heterogeneous cluster, but rather to offer an easy and simple way to accelerate a wide range of conventional parallel applications in heterogeneous clusters. Although we examine HPL (High Performance Linpack benchmark) [1] as a sample application in this study, our approach is not limited to HPL alone but is expected to be widely applicable to many other applications.

Section 2 introduces some related studies, and then briefly summarizes the background of this study. In Section 3, the execution time is modeled from the measurement results of various configurations. The derived models are used to estimate the optimal PE configuration and process allocation. The evaluation results are found in Section 4. Section 5 concludes the study.

Section snippets

Background and related works

Though a block cyclic distribution is very popular in balancing the load of matrix-matrix multiplication and LU decomposition, such a distribution in its original form is not suited to a heterogeneous environment. Therefore, many researchers have studied alternative load-balancing schemes. For example, Kalinov and Lastovetsky [2] presented a “heterogeneous block cyclic distribution” for the Cholesky factorization of square dense matrices. Beaumont et al. [3] reported a “2D heterogeneous grid

Assumptions

To optimize the multiprocessing approach for heterogeneous clusters, it is necessary (1) to select the optimal subset of PEs and (2) to determine the optimal number of processes on each PE. This task is modeled as a combinatorial optimization problem to minimize the total execution time, where one must construct an objective function that estimates the total execution time from the given PE set and the given number of processes.

In this section, we construct the estimation model based on some

Evaluation

In this section, the estimation models are built and evaluated for a heterogeneous cluster. The specifications of the evaluation platform are listed in Table 1. As each Pentium-II node includes two processors, a total of eight Pentium-II processors are available in four nodes (Node 4–Node 7). Likewise, four Pentium-III processors are available in Node 2 and Node 3. All nodes have both 1000base-SX and 100base-TX interfaces, but only the 100base-TX is used in the following measurements to clearly

Conclusion

The results of this study are still preliminary, and many improvements are anticipated. Moreover, more extensive studies are required for various cluster configurations. Our aim, however, remains (1) to make the estimation model more elegant and unified, (2) to reduce the model construction time, and (3) to reduce the errors in estimation.

One of the major concerns of this approach might be the scalability. The estimated best configuration was not very satisfactory in Section 4.2, because of the

Acknowledgements

This work was partially supported by a Grant-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (JSPS), as well as a grant from the Hori Information Science Promotion Foundation. Further support was also provided by the 21st Century COE Program “Intelligent Human Sensing” from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References (8)

  • A. Petitet, R.C. Whaley, J. Dongarra, A. Cleary, HPL—a portable implementation of the high-performance Linpack...
  • A. Kalinov et al.

    Heterogeneous distribution of computations while solving linear algebra problems on networks of heterogeneous computers

  • O. Beaumont et al.

    A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers)

    IEEE Transaction on Computers

    (2001)
  • L.S. Blackford et al.

    ScaLAPACK Users’ Guide

    (1997)
There are more references available in the full text version of this article.

Cited by (15)

  • Surrogate-assisted performance prediction for data-driven knowledge discovery algorithms: Application to evolutionary modeling of clinical pathways

    2022, Journal of Computational Science
    Citation Excerpt :

    In such cases, the procedure of discovery requires fine-tuning both in terms of performance and in terms of quality. Currently, there exist many works focused on algorithm performance prediction using empirical equations [3] or data-driven models [4,5]. Still, most of them are mainly focused on solutions for computationally intensive numerical algorithms with explicit and measurable quality metrics.

  • Execution time estimation for workflow scheduling

    2017, Future Generation Computer Systems
    Citation Excerpt :

    Table 1 shows a classification of workflow scheduling algorithms; the section below discusses the classes in details. Several papers are difficult to be classified because of two reasons: first, some approaches stay between two classes (e.g. in [11,12,17–19] the execution time is calculated as a mean of a random variable, but the variance of the time is not used); second, some researchers focus more on the architecture of scheduling software than on the algorithms, so one can use different estimate representations [9,10,14]. Ordinal time: The first class of schedulers uses task-level scheduling heuristics (which do not take into account the execution time of the workflow).

  • Resource consumption prediction using neuro-fuzzy modeling

    2016, Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS
View all citing articles on Scopus

A preliminary version of this work was presented at the 13th Heterogeneous Computing Workshop (HCW 2004), Santa Fe, New Mexico, USA, April 2004.

1

The author is presently affiliated with the Asahi Kasei Information Systems Co., Ltd.

View full text