Exploiting idle cycles to execute data mining applications on clusters of PCs

https://doi.org/10.1016/j.jss.2006.05.035Get rights and content

Abstract

In this paper we present and evaluate Inhambu, a distributed object-oriented system that supports the execution of data mining applications on clusters of PCs and workstations. This system provides a resource management layer, built on the top of Java/RMI, that supports the execution of the data mining tool called Weka. We evaluate the performance of Inhambu by means of several experiments in homogeneous, heterogeneous and non-dedicated clusters. The obtained results are compared with those achieved by a similar system named Weka-Parallel. Inhambu outperforms its counterpart for coarse grain applications, mainly for heterogeneous and non-dedicated clusters. Also, our system provides additional advantages such as application checkpointing, support for dynamic aggregation of hosts to the cluster, automatic restarting of failed tasks, and a more effective usage of the cluster. Therefore, Inhambu is a promising tool for efficiently executing real-world data mining applications. The software is delivered at the project’s web site available at http://incubadora.fapesp.br/projects/inhambu/.

Introduction

As proposed by Fayyad et al. (1996), Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Although the terms KDD and Data Mining (DM) are sometimes employed interchangeably, DM is often considered as a step in the KDD process that centers on the automated discovery of patterns and relationships in data (Hand et al., 2001, Witten and Frank, 2000). Under this perspective, our work is primarily focused on DM, which usually involves computationally expensive tasks. For this reason, several techniques have been proposed to improve the performance of DM algorithms, such as parallel processing (Freitas and Lavington, 1998), implementations based on cluster of workstations (Baraglia et al., 2000), and computational grids (Canataro and Talia, 2003). These techniques can leverage the deployment of DM applications to production scales.

The potential benefits of applying parallel processing to DM applications are not limited to reducing execution times, i.e. qualitative improvements can also be achieved. As DM methods often are based on machine learning techniques (Witten and Frank, 2000), computational power may be potentially turned into accuracy gains, for instance by means of executing approximation based algorithms with a greater number of steps, training algorithms in larger datasets, and executing more experiments when non-deterministic algorithms are employed.

During the last years, computer clusters have become increasingly used for high performance computing. Also, assembling computer clusters composed of commodity PCs and workstations is an easy way to provide computational power at low cost, since companies and universities often have hundreds or thousands of commodity PCs. In such environments, individual computers often present low levels of utilization, offering great amounts of computational power that can be used to perform intensive computations, such those required by DM algorithms.

While clusters comprised by commodity PCs can be used to carry out the execution of large applications, some challenges may arise due to their heterogeneous and non-dedicated nature. For scalability and protection of investment, companies and universities often add new computers to the existing pool without discarding the existing ones. This practice usually leads to heterogeneity. Also, non-dedicated environments often suffer from fluctuating levels of performance, failures, and unavailability of resources. All of these may render it difficult to support the execution of DM applications on such a computing infrastructure. From this standpoint, the development of tools capable of supporting the efficient usage of the existing resources for DM is important. Currently, DM practitioners and researchers usually lack from tools that support the execution of DM applications on commodity clusters which encompass the following requirements:

  • Implement a good collection of algorithms. Researchers and practitioners usually execute several DM algorithms in order to assess their performance on particular applications.

  • Ease of use. DM tools should be easy to install and easy to use, and DM users should not be concerned with operating systems and networking configuration. Ideally, new tools should employ known interfaces and interaction patterns.

  • Resource management. In order to extend the benefits of high performance computing to DM users, DM tools should provide support for dynamic discovery, allocation and management of existing resources in clusters of PCs and/or workstations.

  • Heterogeneity. Commodity clusters may be composed of heterogeneous computers with different capacities and resources which should be taken into account for the purpose of scheduling DM tasks.

  • Fault tolerance. The execution of computing intensive tasks which may take several hours (or a few days) and involve several machines requires DM tools to survive from failures. After crashing, DM tools should be able to resume their execution.

Inhambu is a system that leverages the exploitation of idle resources in a cluster composed of commodity PCs, supporting the execution of DM applications based on the Weka (Witten and Frank, 2000) tool. A brief introduction to Inhambu has been previously presented in Senger et al. (2004). In our current work, we further elaborate on the project and implementation of Inhambu, proposing policies for scheduling, load sharing, heterogeneity, overloading avoidance, and fault tolerance. In addition, we compare its performance with a similar system named Weka-Parallel (Celis and Musicant, 2002). We benefit from our previous experience on designing load sharing policies and scheduling tasks of a generic application on heterogeneous, non-dedicated clusters (Senger and Sato, 2003), showing how a DM framework (Weka) focused on classification algorithms can be adapted for harvesting the computer power of commodity clusters. In brief, classification (Duda et al., 2001) involves the assignment of instances1 of a dataset to one of a finite number of categories (classes). Algorithms designed for classification are used in several applications such as in bioinformatics, business, text mining, and web mining.

The remainder of this paper is organized as follows. In Section 2, we provide an overview of related work. In Section 3, the architecture of Inhambu, as well as its main components, functionalities, and policies implemented are described. In Section 4, we detail how the Weka tool is adapted to be executed with the support of Inhambu. Afterwards, we empirically evaluate the performance of Inhambu in several experiments, which demanded more than 30 thousand hours of CPU, comparing it against Weka-Parallel. The obtained results are reported in Section 5. Finally, Section 6 concludes the paper.

Section snippets

Related work

The last decade has witnessed a considerable amount of research publications on Parallel and Distributed Knowledge Discovery. For instance, more than three hundred bibliographical references are given by Liu and Kargupta. Most of them address the parallelization of particular DM algorithms or specific classes of algorithms. Our work differs from such approaches, since we focus on the parallel execution of Weka, which is a system that implements a wide range of classification algorithms, on

The architecture of Inhambu

This section presents an overview of the Inhambu’s architecture, as well as its main components and strategies. As depicted in Fig. 1, the architecture of Inhambu implements an application layer which consists in a modified implementation of Weka, and a resource management layer which provides the necessary functionalities for the execution of Weka in a distributed environment. In the application layer, specific components are implemented and deployed at the client and server sides. The client

Parallel and distributed cross-validation

By the time of this writing, the latest version of Weka is 3.4.4, which contains implementations of 68 classification algorithms, 5 algorithms for clustering, 3 algorithms for finding association rules, and 12 algorithms for attribute selection. In fact, it is continuously growing, incorporating more and more data mining algorithms. All these algorithms can be either applied directly to a dataset by means of a GUI, by means of a command line interface, or even called from Java programs.

Evaluating the performance of Inhambu

In this section, we evaluate both the performance and main functionalities of Inhambu, comparing it with the Weka-Parallel system (Celis and Musicant, 2002). Our experiments were carried out with more than 30 thousand hours of computing time equivalent to a 2 gigahertz IA-32 CPU.

Conclusion

We have presented and assessed Inhambu, a distributed object-oriented system that supports the execution of data mining applications on clusters of PCs and workstations. The main features and capabilities of Inhambu include: (i) the support for dynamic detection and management of idle cycles to execute data mining tasks, (ii) the management of differentiated capacities of heterogeneous computers, (iii) the dynamic aggregation of computers to the cluster, and (iv) the automatic restarting of

References (39)

  • A. Giersch et al.

    Scheduling tasks sharing files on heterogeneous master-slave platforms

    J. Syst. Architect.

    (2006)
  • Baraglia, R., Laforenza, D., Orlando, S., Palmerini, P., Perego, R., 2000. Implementation issues in the design of I/O...
  • S. Bouchenak et al.

    Experiences implementing efficient Java thread serialization, mobility and persistence

    Software – Practice & Experience

    (2004)
  • R.K. Brunner et al.

    Adapting to load on workstation clusters

  • M. Canataro et al.

    The knowledge grid

    Commun. ACM

    (2003)
  • T.L. Casavant et al.

    A taxonomy of scheduling in general-purpose distributed computing systems

    IEEE Trans. Software Eng.

    (1988)
  • C. Cassandras

    Discrete Event Systems: Modeling and Performance Analysis

    (1993)
  • Celis, S., Musicant, D.R., 2002. Weka-parallel: machine learning in parallel. Technical Report, Carleton College...
  • Czajkowski, K., Foster, I., Karonis, N., Kesselman, C., Martin, S., Smith, W., Tuecke, S., 1998. A resource management...
  • Czajkowski, K., Foster, I., Kesselman, C., Sander, V., Tuecke, S., 2002. SNAP: a protocol for negotiating service level...
  • M.V. Devarakonda et al.

    Predictability of process resource usage: a measurement-based study on Unix

    IEEE Trans. Software Eng.

    (1989)
  • Dinda, P., O’Halloran, D., 1998. The statistical properties of host load. In: Proceedings of 4th Workshop on Languages,...
  • R.O. Duda et al.

    Pattern Classification

    (2001)
  • Elnozahy, E.N., Johnson, D.B., Zwaenpoel, W., 1992. The performance of consistent checkpointing. In: Proceedings of the...
  • U.M. Fayyad et al.

    From data mining to knowledge discovery: an overview

  • D. Ferrari et al.

    An empirical investigation of load indices for load balancing applications

  • Frank, E., Witten, I.H., 1998. Generating accurate rule sets without global optimization. In: Proceedings of 15th...
  • A.A. Freitas et al.

    Mining Very Large Databases with Parallel Processing

    (1998)
  • Garbacki, P., Biskupski, B., Bal, H., 2005. Transparent fault tolerance for grid applications. In: Proceedings of the...
  • Cited by (11)

    • Performance evaluation of a SaaS cloud under different levels of workload computational demand variability and tardiness bounds

      2019, Simulation Modelling Practice and Theory
      Citation Excerpt :

      A job of this type is commonly referred to as a bag-of-tasks [10]. Examples of bag-of-tasks applications include Monte Carlo simulations, data mining algorithms, massive searches and image processing applications [11–13]. The QoS requirements defined in a SLA between a SaaS cloud provider and the end-users, usually concern the timeliness of the jobs.

    View all citing articles on Scopus

    This project is supported by the Brazilian research funding agency CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil), under contract number 401439/2003-8.

    View full text