Abstract
Advances in computer technology, encompassed with fast emerging of multicore processor technology, have made the many-core personal computers available and more affordable. The availability of network of workstations and cluster of many-core SMPs have made them an attractive solution for high performance computing by providing computational power equal or superior to supercomputers or mainframes at an affordable cost using commodity components. In order to search alternative ways to extract unused and idle computing power from these computing resources targeting to improve overall performance, as well as to fully utilize the underlying new hardware platforms, these are major topics in this field of research. In this research paper, the design rationale and implementation of an effective toolkit for performance measurement and analysis of parallel applications in cluster environments is introduced; not only generating parallel applications’ timing graph representation, but also to provide application execution’s performance data charts. The goal in developing this toolkit is to permit application developers have a better understanding of the application’s behavior among selected computing nodes purposed for that particular execution. Additionally, multiple execution results of a given application under development can be combined and overlapped, permitting application developers to perform “what-if” analysis, i.e., to deeper understand the utilization of allocated computational resources. Experimentations using this toolkit have shown its effectiveness on the development and performance tuning of parallel applications, extending the use in teaching of message passing, and shared memory model parallel programming courses.
Similar content being viewed by others
References
ANL—Argonne National Laboratory. MPICH: a portable implementation of MPI. Available via http://www-unix.mcs.anl.gov/mpi/mpich1/
Anderson TE, Culler DE, Patterson DA (1995) A case for NOW (Network of Workstations). IEEE Micro 15(1):54–64
Anik S, Hwu W-W (1992) Executing nested parallel loops on shared-memory multiprocessors. In: Proceedings of the 21st annual international conference on parallel processing (ICPP’92), USA
Beguelin A, Dongarra J (1991) Solving computational Grand Challenges using a network of heterogeneous supercomputers. In: Proc of 5th SIAM conference on parallel processing
CACTI Tool webpage. Available via http://www.cacti.net/
Cain HW, Miller BP, Wylie BJ (2000) A Callgraph-based search strategy for automated performance diagnosis. In: Proc of Euro-Par 2000, Munich, Germany
Chan F, Cao J, Chan ATS, Zhang K (2005) Visual programming support for graph-oriented parallel/distributed processing. Softw Pract Exp 35:1409–1439
DIMEMAS Tool webpage. Available via http://www.cepba.upc.es/dimemas/
El-Rewini H, Lewis TG, Ali HH (1994) Task scheduling in parallel and distributed systems. Prentice Hall, New York
Ganglia Cluster Toolkit. Available via http://ganglia.sourceforge.net/
Geist G, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam V (1994) PVM: Parallel Virtual Machine—a user’s guide and tutorial for networked parallel computing. MIT Press, Cambridge
Gropp W, Lusk E, Skjellum A (1994) Using MPI: portable parallel programming with the Message Passing Interface. MIT Press, Cambridge
Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. Available in Argonne National Laboratory’s site at http://www.mcs.anl.gov/mpi/mpicharticle/paper.html
Hoeflinger J et al. (2001) An integrated performance visualizer for MPI/OpenMP programs. In: Proceedings of WOMPAT’2001 international workshop on OpenMP applications and tools. Lecture notes in computer science, vol 2104. Springer, Berlin, pp 40–52
Karypis G, Kumar V (1998) Analysis of multilevel graphs partitioning. Technical Report 98-037, University of Minnesota
Kwok YK, Ahmad I (1999) Benchmarking and comparison of the task graph scheduling algorithms. J Parallel Distrib Comput 59:381–422
Li K-C, Gaudiot J-L, Sato LM (2002) Performance prediction methodology for parallel programs with MPI in NOW environments. In: Das SK, Bhattacharya S (eds) IWDC’2002 international workshop on distributed computing, Kolkata, India. Lecture notes in computer science, vol 2571. Springer, Heidelberg
Li K-C, Chang H-C, Yang C-T, Chang L-J, Cheng H-Y, Lee L-T (2005) Implementation of visual MPI parallel program performance analysis tool for cluster environments. In: AINA’2005 The 19th IEEE international conference on advanced information networking and applications, Taiwan
Liu LT, Culler D, Yoshikawa C (1996) Benchmarking message passing performance using MPI. In: Proceedings of ICPP’1996 international conference on parallel processing. IEEE Comput. Soc., Los Alamitos, pp 101–110
Lisper B (2003) Fully automatic, parametric worst-case execution time analysis. In: Gustafsson J (ed) Proceedings of the third international workshop on worst-case execution time (WCET) analysis, Porto, Portugal, pp 77–80
MRTG webpage (2007) Available via http://www.mrtg.org
Nagel LW (1975) SPICE2—A Computer program to simulate semiconductor circuits. Memo ERL-M520, University of California, Berkeley, ERL
NWS Information Service (2007) Available at http://nws.cs.ucsb.edu/ewiki/
OpenMP webpage (2007) Available via http://www.openmp.org
Pthreads tutorial webpage (2008) Available via https://computing.llnl.gov/?set=training&page=index
Puschner P, Schedl A (1997) Computing maximum task execution times—a graph-based approach. J Real-Time Syst 13(1):67–91
Quarles TL (1989) Analysis of performance and convergence issues for circuit simulation. Memo ERL-M89, University of California, Berkeley, ERL
RRD tool webpage. Available via http://www.rrdtool.org
Smith L, Bull M (2000) Development of Mixed Mode MPI/OpenMP Applications. In: Proc of the workshop on OpenMP applications and tools (WOMPAT2000)
SUN Microsystems. JAVA2 Second Edition (J2SE). http://java.sun.com
VAMPIR tool. Pallas Products webpage. Available via http://www.vampir.eu
Xu Q, Subhlok J (2005) Automatic clustering of grid nodes. In: Proceedings of the 6th IEEE/ACM international workshop on grid computing, USA
Yang C-T, Cheng K-W, Li K-C (2004) An efficient parallel loop self-scheduling on grid computing environments. In: Jin H, Gao G, Xu Z, Chen H (eds) NPC’2004 IFIP international conference on network and parallel computing. Lecture notes in computer science, vol 3222. Springer, Berlin
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, KC., Weng, TH. Performance-based parallel application toolkit for high-performance clusters. J Supercomput 48, 43–65 (2009). https://doi.org/10.1007/s11227-008-0204-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-008-0204-2