Abstract
The goal of clustering is to identify subsets called clusters which usually correspond to objects that are more similar to each other than they are to objects from other clusters. We have proposed the MACLAW method, a cooperative coevolution algorithm for data clustering, which has shown good results (Blansché and Gançarski, Pattern Recognit. Lett. 27(11), 1299–1306, 2006). However the complexity of the algorithm increases rapidly with the number of clusters to find. We propose in this article a parallelization of MACLAW, based on a message-passing paradigm, as well as the analysis of the application performances with experiment results. We show that we reach near optimal speedups when searching for 16 clusters, a typical problem instance for which the sequential execution duration is an obstacle to the MACLAW method. Further, our approach is original because we use the P2P-MP1 grid middleware (Genaud and Rattanapoka, Lecture Notes in Comput. Sci., vol. 3666, pp. 276–284, 2005) which both provides the message passing library and infrastructure services to discover computing resources. We also put forward that the application can be tightly coupled with the middleware to make the parallel execution nearly transparent for the user.
Similar content being viewed by others
References
Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA
Blansché A, Gançarski P (2006) MACLAW: a modular approach for clustering with local attribute weighting. Pattern Recognit Lett 27(11):1299–1306
Cappello F et al (2005) Grid’5000: a large scale, reconfigurable, controlable and monitorable grid platform. In: Proceedings of the 6th IEEE/ACM international workshop on grid computing Grid’2005, November 2005. http://www.grid5000.org
Carpenter B, Getov V, Judd G, Skjellum T, Fox G (2000) MPJ: MPI-like message passing for Java. Concurr Pract Experience 12(11), September
Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37:943–952
Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems, SIGKDD Springer, New York, pp 245–260
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan1 M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97
Forman G, Zhang B (2000) Linear speedup for a parallel non-approximate recasting of centerbased clustering algorithms, including k-means, k-harmonic means, and em. In: ACM SIGKDD workshop on distributed and parallel knowledge discovery, KDD-2000
Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Roy Stat Soc 66(4):815–849
Frigui H, Nasraoui O (2004) Unsupervised learning of prototypes and attribute weights. Pattern Recognit 34:567–581
Gabriel E, Resch M, Beisel T, Keller R (1998) Distributed computing in an heterogeneous computing environment. In: EuroPVM/MPI. Lecture notes in comput sci, vol 1497. Springer, New York, pp 180–187
Genaud S, Rattanapoka C (2005) A peer-to-peer framework for robust execution of message passing parallel programs. In: Di Martino B et al (eds) EuroPVM/MPI 2005. Lecture notes in comput sci, vol 3666. Springer, New York, pp 276–284, September
Genaud S, Rattanapoka C (2007) Fault management in P2P-MPI. In: Proceedings of international conference on grid and pervasive computing, GPC’07. Lecture notes in comput sci. Springer, May
Genaud S, Rattanapoka C (2007) P2P-MPI: a peer-to-peer framework for robust execution of message passing parallel programs. J Grid Comput 5:27–42
Gnanadesikan R, Kettenring JR, Tsao SL (1995) Weighting and selection of variables for cluster analysis. J Classif 12(1):113–136
Howe N, Cardie C (1997) Examining locally varying weights for nearest neighbor algorithms. In: ICCBR, pp 455–466
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(2):657–668
JXTA http://www.jxta.org
Karonis NT, Toonen BT, Foster I (2003) MPICH-G2: a grid-enabled implementation of the message passing interface. J Parallel Distributed Comput special issue on Comput Grids 63(5):551–563, May
Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. ACM SIGPLAN Notices 34(8):131–140, August
Kruengkrai C, Jaruskulchai C (2002) A parallel learning algorithm for text classification. In: Eighth ACM SIGKDD international conference on knowledge discovery and data mining, July
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Berkeley, CA, 1967. University of California Press, pp 281–297
MPI (1995) A message passing interface standard, version 1.1. Technical report, University of Tennessee, Knoxville, TN, USA, Jun
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD explorations, newsletter of the ACM special interest group on knowledge discovery and data mining 6(1):90–106
Shudo K, Tanaka Y, Sekiguchi S (2005) P3: P2P-based middleware enabling transfer and aggregation of computational resource. In: 5th intl workshop on global and peer-to-peer computing, in conjunc with CCGrid05. IEEE, May
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Genaud, S., Gançarski, P., Latu, G. et al. Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI. J Supercomput 43, 21–41 (2008). https://doi.org/10.1007/s11227-007-0136-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-007-0136-2