Skip to main content
Log in

Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The goal of clustering is to identify subsets called clusters which usually correspond to objects that are more similar to each other than they are to objects from other clusters. We have proposed the MACLAW method, a cooperative coevolution algorithm for data clustering, which has shown good results (Blansché and Gançarski, Pattern Recognit. Lett. 27(11), 1299–1306, 2006). However the complexity of the algorithm increases rapidly with the number of clusters to find. We propose in this article a parallelization of MACLAW, based on a message-passing paradigm, as well as the analysis of the application performances with experiment results. We show that we reach near optimal speedups when searching for 16 clusters, a typical problem instance for which the sequential execution duration is an obstacle to the MACLAW method. Further, our approach is original because we use the P2P-MP1 grid middleware (Genaud and Rattanapoka, Lecture Notes in Comput. Sci., vol. 3666, pp. 276–284, 2005) which both provides the message passing library and infrastructure services to discover computing resources. We also put forward that the application can be tightly coupled with the middleware to make the parallel execution nearly transparent for the user.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA

  2. Blansché A, Gançarski P (2006) MACLAW: a modular approach for clustering with local attribute weighting. Pattern Recognit Lett 27(11):1299–1306

    Article  Google Scholar 

  3. Cappello F et al (2005) Grid’5000: a large scale, reconfigurable, controlable and monitorable grid platform. In: Proceedings of the 6th IEEE/ACM international workshop on grid computing Grid’2005, November 2005. http://www.grid5000.org

  4. Carpenter B, Getov V, Judd G, Skjellum T, Fox G (2000) MPJ: MPI-like message passing for Java. Concurr Pract Experience 12(11), September

  5. Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37:943–952

    Article  MATH  Google Scholar 

  6. Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems, SIGKDD Springer, New York, pp 245–260

    Google Scholar 

  7. Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan1 M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97

    Article  Google Scholar 

  8. Forman G, Zhang B (2000) Linear speedup for a parallel non-approximate recasting of centerbased clustering algorithms, including k-means, k-harmonic means, and em. In: ACM SIGKDD workshop on distributed and parallel knowledge discovery, KDD-2000

  9. Friedman JH, Meulman JJ (2004) Clustering objects on subsets of attributes. J Roy Stat Soc 66(4):815–849

    Article  MATH  MathSciNet  Google Scholar 

  10. Frigui H, Nasraoui O (2004) Unsupervised learning of prototypes and attribute weights. Pattern Recognit 34:567–581

    Article  Google Scholar 

  11. Gabriel E, Resch M, Beisel T, Keller R (1998) Distributed computing in an heterogeneous computing environment. In: EuroPVM/MPI. Lecture notes in comput sci, vol 1497. Springer, New York, pp 180–187

    Google Scholar 

  12. Genaud S, Rattanapoka C (2005) A peer-to-peer framework for robust execution of message passing parallel programs. In: Di Martino B et al (eds) EuroPVM/MPI 2005. Lecture notes in comput sci, vol 3666. Springer, New York, pp 276–284, September

    Google Scholar 

  13. Genaud S, Rattanapoka C (2007) Fault management in P2P-MPI. In: Proceedings of international conference on grid and pervasive computing, GPC’07. Lecture notes in comput sci. Springer, May

  14. Genaud S, Rattanapoka C (2007) P2P-MPI: a peer-to-peer framework for robust execution of message passing parallel programs. J Grid Comput 5:27–42

    Article  Google Scholar 

  15. Gnanadesikan R, Kettenring JR, Tsao SL (1995) Weighting and selection of variables for cluster analysis. J Classif 12(1):113–136

    Article  MATH  Google Scholar 

  16. Howe N, Cardie C (1997) Examining locally varying weights for nearest neighbor algorithms. In: ICCBR, pp 455–466

  17. Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(2):657–668

    Article  Google Scholar 

  18. JXTA http://www.jxta.org

  19. Karonis NT, Toonen BT, Foster I (2003) MPICH-G2: a grid-enabled implementation of the message passing interface. J Parallel Distributed Comput special issue on Comput Grids 63(5):551–563, May

    Article  MATH  Google Scholar 

  20. Kielmann T, Hofman RFH, Bal HE, Plaat A, Bhoedjang RAF (1999) MagPIe: MPI’s collective communication operations for clustered wide area systems. ACM SIGPLAN Notices 34(8):131–140, August

    Article  Google Scholar 

  21. Kruengkrai C, Jaruskulchai C (2002) A parallel learning algorithm for text classification. In: Eighth ACM SIGKDD international conference on knowledge discovery and data mining, July

  22. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Berkeley, CA, 1967. University of California Press, pp 281–297

  23. MPI (1995) A message passing interface standard, version 1.1. Technical report, University of Tennessee, Knoxville, TN, USA, Jun

  24. Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. SIGKDD explorations, newsletter of the ACM special interest group on knowledge discovery and data mining 6(1):90–106

    Google Scholar 

  25. Shudo K, Tanaka Y, Sekiguchi S (2005) P3: P2P-based middleware enabling transfer and aggregation of computational resource. In: 5th intl workshop on global and peer-to-peer computing, in conjunc with CCGrid05. IEEE, May

  26. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Gançarski.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Genaud, S., Gançarski, P., Latu, G. et al. Exploitation of a parallel clustering algorithm on commodity hardware with P2P-MPI. J Supercomput 43, 21–41 (2008). https://doi.org/10.1007/s11227-007-0136-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-007-0136-2

Keywords

Navigation