Exploiting idle cycles to execute data mining applications on clusters of PCs☆
Introduction
As proposed by Fayyad et al. (1996), Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Although the terms KDD and Data Mining (DM) are sometimes employed interchangeably, DM is often considered as a step in the KDD process that centers on the automated discovery of patterns and relationships in data (Hand et al., 2001, Witten and Frank, 2000). Under this perspective, our work is primarily focused on DM, which usually involves computationally expensive tasks. For this reason, several techniques have been proposed to improve the performance of DM algorithms, such as parallel processing (Freitas and Lavington, 1998), implementations based on cluster of workstations (Baraglia et al., 2000), and computational grids (Canataro and Talia, 2003). These techniques can leverage the deployment of DM applications to production scales.
The potential benefits of applying parallel processing to DM applications are not limited to reducing execution times, i.e. qualitative improvements can also be achieved. As DM methods often are based on machine learning techniques (Witten and Frank, 2000), computational power may be potentially turned into accuracy gains, for instance by means of executing approximation based algorithms with a greater number of steps, training algorithms in larger datasets, and executing more experiments when non-deterministic algorithms are employed.
During the last years, computer clusters have become increasingly used for high performance computing. Also, assembling computer clusters composed of commodity PCs and workstations is an easy way to provide computational power at low cost, since companies and universities often have hundreds or thousands of commodity PCs. In such environments, individual computers often present low levels of utilization, offering great amounts of computational power that can be used to perform intensive computations, such those required by DM algorithms.
While clusters comprised by commodity PCs can be used to carry out the execution of large applications, some challenges may arise due to their heterogeneous and non-dedicated nature. For scalability and protection of investment, companies and universities often add new computers to the existing pool without discarding the existing ones. This practice usually leads to heterogeneity. Also, non-dedicated environments often suffer from fluctuating levels of performance, failures, and unavailability of resources. All of these may render it difficult to support the execution of DM applications on such a computing infrastructure. From this standpoint, the development of tools capable of supporting the efficient usage of the existing resources for DM is important. Currently, DM practitioners and researchers usually lack from tools that support the execution of DM applications on commodity clusters which encompass the following requirements:
- •
Implement a good collection of algorithms. Researchers and practitioners usually execute several DM algorithms in order to assess their performance on particular applications.
- •
Ease of use. DM tools should be easy to install and easy to use, and DM users should not be concerned with operating systems and networking configuration. Ideally, new tools should employ known interfaces and interaction patterns.
- •
Resource management. In order to extend the benefits of high performance computing to DM users, DM tools should provide support for dynamic discovery, allocation and management of existing resources in clusters of PCs and/or workstations.
- •
Heterogeneity. Commodity clusters may be composed of heterogeneous computers with different capacities and resources which should be taken into account for the purpose of scheduling DM tasks.
- •
Fault tolerance. The execution of computing intensive tasks which may take several hours (or a few days) and involve several machines requires DM tools to survive from failures. After crashing, DM tools should be able to resume their execution.
Inhambu is a system that leverages the exploitation of idle resources in a cluster composed of commodity PCs, supporting the execution of DM applications based on the Weka (Witten and Frank, 2000) tool. A brief introduction to Inhambu has been previously presented in Senger et al. (2004). In our current work, we further elaborate on the project and implementation of Inhambu, proposing policies for scheduling, load sharing, heterogeneity, overloading avoidance, and fault tolerance. In addition, we compare its performance with a similar system named Weka-Parallel (Celis and Musicant, 2002). We benefit from our previous experience on designing load sharing policies and scheduling tasks of a generic application on heterogeneous, non-dedicated clusters (Senger and Sato, 2003), showing how a DM framework (Weka) focused on classification algorithms can be adapted for harvesting the computer power of commodity clusters. In brief, classification (Duda et al., 2001) involves the assignment of instances1 of a dataset to one of a finite number of categories (classes). Algorithms designed for classification are used in several applications such as in bioinformatics, business, text mining, and web mining.
The remainder of this paper is organized as follows. In Section 2, we provide an overview of related work. In Section 3, the architecture of Inhambu, as well as its main components, functionalities, and policies implemented are described. In Section 4, we detail how the Weka tool is adapted to be executed with the support of Inhambu. Afterwards, we empirically evaluate the performance of Inhambu in several experiments, which demanded more than 30 thousand hours of CPU, comparing it against Weka-Parallel. The obtained results are reported in Section 5. Finally, Section 6 concludes the paper.
Section snippets
Related work
The last decade has witnessed a considerable amount of research publications on Parallel and Distributed Knowledge Discovery. For instance, more than three hundred bibliographical references are given by Liu and Kargupta. Most of them address the parallelization of particular DM algorithms or specific classes of algorithms. Our work differs from such approaches, since we focus on the parallel execution of Weka, which is a system that implements a wide range of classification algorithms, on
The architecture of Inhambu
This section presents an overview of the Inhambu’s architecture, as well as its main components and strategies. As depicted in Fig. 1, the architecture of Inhambu implements an application layer which consists in a modified implementation of Weka, and a resource management layer which provides the necessary functionalities for the execution of Weka in a distributed environment. In the application layer, specific components are implemented and deployed at the client and server sides. The client
Parallel and distributed cross-validation
By the time of this writing, the latest version of Weka is 3.4.4, which contains implementations of 68 classification algorithms, 5 algorithms for clustering, 3 algorithms for finding association rules, and 12 algorithms for attribute selection. In fact, it is continuously growing, incorporating more and more data mining algorithms. All these algorithms can be either applied directly to a dataset by means of a GUI, by means of a command line interface, or even called from Java programs.
Evaluating the performance of Inhambu
In this section, we evaluate both the performance and main functionalities of Inhambu, comparing it with the Weka-Parallel system (Celis and Musicant, 2002). Our experiments were carried out with more than 30 thousand hours of computing time equivalent to a 2 gigahertz IA-32 CPU.
Conclusion
We have presented and assessed Inhambu, a distributed object-oriented system that supports the execution of data mining applications on clusters of PCs and workstations. The main features and capabilities of Inhambu include: (i) the support for dynamic detection and management of idle cycles to execute data mining tasks, (ii) the management of differentiated capacities of heterogeneous computers, (iii) the dynamic aggregation of computers to the cluster, and (iv) the automatic restarting of
References (39)
- et al.
Scheduling tasks sharing files on heterogeneous master-slave platforms
J. Syst. Architect.
(2006) - Baraglia, R., Laforenza, D., Orlando, S., Palmerini, P., Perego, R., 2000. Implementation issues in the design of I/O...
- et al.
Experiences implementing efficient Java thread serialization, mobility and persistence
Software – Practice & Experience
(2004) - et al.
Adapting to load on workstation clusters
- et al.
The knowledge grid
Commun. ACM
(2003) - et al.
A taxonomy of scheduling in general-purpose distributed computing systems
IEEE Trans. Software Eng.
(1988) Discrete Event Systems: Modeling and Performance Analysis
(1993)- Celis, S., Musicant, D.R., 2002. Weka-parallel: machine learning in parallel. Technical Report, Carleton College...
- Czajkowski, K., Foster, I., Karonis, N., Kesselman, C., Martin, S., Smith, W., Tuecke, S., 1998. A resource management...
- Czajkowski, K., Foster, I., Kesselman, C., Sander, V., Tuecke, S., 2002. SNAP: a protocol for negotiating service level...
Predictability of process resource usage: a measurement-based study on Unix
IEEE Trans. Software Eng.
Pattern Classification
From data mining to knowledge discovery: an overview
An empirical investigation of load indices for load balancing applications
Mining Very Large Databases with Parallel Processing
Cited by (11)
Performance evaluation of a SaaS cloud under different levels of workload computational demand variability and tardiness bounds
2019, Simulation Modelling Practice and TheoryCitation Excerpt :A job of this type is commonly referred to as a bag-of-tasks [10]. Examples of bag-of-tasks applications include Monte Carlo simulations, data mining algorithms, massive searches and image processing applications [11–13]. The QoS requirements defined in a SLA between a SaaS cloud provider and the end-users, usually concern the timeliness of the jobs.
Performance improvement of data mining in weka through GPU acceleration
2014, Procedia Computer ScienceScalability limits of Bag-of-Tasks applications running on hierarchical platforms
2011, Journal of Parallel and Distributed ComputingBounds on the scalability of bag-of-tasks applications running on master-slave platforms
2012, Parallel Processing LettersScheduling real-time bag-of-tasks applications with approximate computations in SaaS clouds
2020, Concurrency and Computation: Practice and Experience
- ☆
This project is supported by the Brazilian research funding agency CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil), under contract number 401439/2003-8.