Skip to main content
Log in

Performance-based data distribution for data mining applications on grid computing environments

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Effective data distribution techniques can significantly reduce the total execution time of a program on grid computing environments, especially for data mining applications. In this paper, we describe a linear programming formulation for the data distribution problem on grids. Furthermore, a heuristic method, named Heuristic Data Distribution Scheme (HDDS), is proposed to solve this problem. We implement two types of data mining applications, Association Rule Mining and Decision Tree Construction, and conduct experiments on grid testbeds. Experimental results show that data mining programs using the proposed HDDS to distribute data could execute more efficiently than traditional schemes could.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal R, Jagadish H (1988) Partition techniques for large-grained parallelism. IEEE Trans Comput 37(12):1627–1634

    Article  Google Scholar 

  2. Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969

    Article  Google Scholar 

  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proc 20th very large data bases conf, pp 487–499

  4. Allcock B, Tuecke S, Foster I, Chervenak A, Kesselman C (2000) Protocols and services for distributed data-intensive science. ACAT2000 Proceedings, pp 161–163

  5. Allcock W, Chervenak A, Foster I, Kesselman C, Salisbury C, Tuecke S (2001) The data grid: towards an architecture for the distributed management and analysis of large scientific datasets. J Netw Comput Appl 23:187–200

    Google Scholar 

  6. Alsabti K, Ranka S, Singh V (1998) CLOUDS: a decision tree classifier for large datasets. In: Proc KDD‘98, 4th intl conf on knowledge discovery and data mining, New York City, pp 2–8

  7. Baker MA, Fox GC (1999) Metacomputing: harnessing informal supercomputers. In: High performance cluster computing. Prentice-Hall, New York. ISBN 0-13-013784-7

    Google Scholar 

  8. Beaumont O, Casanova H, Legrand A, Robert Y, Yang Y (2005) Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans Parallel Distrib Syst 16(3):207–218

    Article  Google Scholar 

  9. Benoît G (2002) Data mining. In: Cronin B (ed) Annual review of information science and technology, vol 36. American Society for Information Science and Technology, Silver Spring, pp 265–310

    Google Scholar 

  10. Bharadwaj V, Ghose D, Mani V, Robertazzi TG (1996) Scheduling divisible loads in parallel and distributed systems. IEEE Press, New York

    Google Scholar 

  11. Bharadwaj V, Ghose D, Robertazzi TG (2003) Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Comput 6(1):7–18

    Article  Google Scholar 

  12. Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P (2004) Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cybern B 34(6):2451–2465

    Article  Google Scholar 

  13. Comino N, Narasimhan VL (2002) A novel data distribution technique for host-client type parallel applications. IEEE Trans Parallel Distrib Syst 13(2):97–110

    Article  Google Scholar 

  14. Di Fatta G, Berthold MR (2006) Dynamic load balancing for the distributed mining of molecular structures. IEEE Trans Parallel Distrib Syst 17(8):773–785

    Article  Google Scholar 

  15. Divisible Load Theory, http://www.ee.sunysb.edu/~tom/MATBE/index.html

  16. Dynamic Load Distribution, http://homepages.mcs.vuw.ac.nz/~kris/thesis/node11.html

  17. Foster I (2002) The grid: a new infrastructure for 21st century science. Phys Today 55(2):42–47

    Article  Google Scholar 

  18. Foster I, Karonis N (1998) A grid-enabled MPI: message passing in heterogeneous distributed computing systems. In: Proc 1998 SC conference, November 1998

  19. Foster I, Kesselman C (1997) Globus: a metacomputing infrastructure toolkit. Int J Supercomput Appl 11(2):115–128

    Article  Google Scholar 

  20. Foster I, Kesselman C (eds) (1999) The grid: blueprint for a new computing infrastructure, 1st edn. Morgan Kaufmann, San Mateo

    Google Scholar 

  21. Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J Supercomput Appl 15(3)

  22. Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Globus project

  23. Fox G (2003) Education and the enterprise with the Grid. In: Berman F, Fox G, Hey T (eds) Grid computing: making the global infrastructure a reality. Wiley, New York

    Google Scholar 

  24. Grimshaw AS (1992) Meta-systems: an approach combining parallel processing and heterogeneous distributed computing systems. Workshop on heterogeneous processing, international parallel processing symposium, pp 54–59

  25. Hagiwara J, Doi T, Shindo T, Yaginuma Y, Maeda K (1997) Commercial applications on the AP3000 Parallel Computer. IEEE massively parallel programming models’97

  26. Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Mateo

    Google Scholar 

  27. Hinke TH, Novotny J (2000) Data mining on NASA’s information power grid. HPDC

  28. Hinke TH, Novotny J (2000) Data mining on NASA’s information power grid. HPDC

  29. Huang F, Li Z, Sun X (2008) A data mining model in knowledge grid. In: The 4th international conference on wireless communications, networking and mobile computing (WiCOM’08), pp 1–4, 12–14 Oct 2008

  30. Introduction to Grid Computing with Globus, http://www.ibm.com/redbooks

  31. KISTI Grid Testbed, http://Gridtest.hpcnet.ne.kr/

  32. MPICH, http://www-unix.mcs.anl.gov/mpi/mpich/

  33. MPICH-G2, http://www.hpclab.niu.edu/mpi/

  34. Narlikar G (1998) A parallel, multithreaded decision tree builder. Tech Report CMU-CS-98-184, December 1998

  35. Network Weather Service, http://nws.cs.ucsb.edu/

  36. Open Grid Forum, http://www.ogf.org/

  37. Orlando S, Palmerini P, Perego R, Silverstri F (2002) Scheduling high performance data mining tasks on a data grid environment. Proceedings of Europar

  38. Robertazzi TG (2003) Ten reasons to use divisible load theory. Computer 36(5):63–68

    Article  Google Scholar 

  39. Shafer J, Agrawal R, Mehta M (1996) SPRINT: a scalable parallel classifier for data mining. In: Proc of VLDB

  40. Shih W-C, Yang C-T, Tseng S-S (2009) Using a performance-based skeleton to implement divisible load applications on grid computing environments. J Inf Sci Eng (JISE) 25(1):59–81

    Google Scholar 

  41. Sun ONE Grid Engine, http://wwws.sun.com/software/Gridware/

  42. Sunderam VS (1990) PVM: A framework for parallel distributed computing. Concurr Pract Exp 2(4):315–339

    Article  Google Scholar 

  43. Talia D (2002) High-performance data mining and knowledge discovery. Euro-Par, Paderborn, Germany, August 2002

  44. Taniar’s D Homepage, http://www-personal.monash.edu.au/~dtaniar/VPAC/parsprint.zip

  45. TeraGrid, http://www.teraGrid.org/

  46. The Globus Project, http://www.globus.org/

  47. THU Bandwidth Statistics GUI, http://140.128.102.187/nws/show.jsp

  48. Yang C-T, Shih W-C, Tseng S-S (2008) A heuristic data distribution scheme for data mining applications on grid environments. In: IEEE international conference on fuzzy systems, 2008 (FUZZ-IEEE 2008), Jun 1–6, 2008, Hong Kong, pp 2398–2404

  49. Zaki MJ (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25

    Article  Google Scholar 

  50. Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. ICDE 1999, pp 198–205

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao-Tung Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shih, WC., Yang, CT. & Tseng, SS. Performance-based data distribution for data mining applications on grid computing environments. J Supercomput 52, 171–198 (2010). https://doi.org/10.1007/s11227-009-0286-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-009-0286-5

Keywords

Navigation