Abstract
Due to the increase of the diversity of parallel architectures, and the increasing development time for parallel applications, performance portability has become one of the major considerations when designing the next generation of parallel program execution models, APIs, and runtime system software. This paper analyzes both code portability and performance portability of parallel programs for fine-grained multi-threaded execution and architecture models. We concentrate on one particular event-driven fine-grained multi-threaded execution model—EARTH, and discuss several design considerations of the EARTH model and runtime system that contribute to the performance portability of parallel applications. We believe that these are important issues for future high end computing system software design. Four representative benchmarks were conducted on several different parallel architectures, including two clusters listed in the 23rd supercomputer TOP500 list. The results demonstrate that EARTH based programs can achieve robust performance portability across the selected hardware platforms without any code modification or tuning.
Similar content being viewed by others
References
The 23rd TOP500 Supercomputer list for June 2004: http://www.top500.org/list/2004/06
Sun, Y., Bader, D.: Broadcast on clusters of SMPs with optimal concurrency. AHPCC Technical Report 2000-013, June 2000
Reussner, R., Hunzelmann, G.: Achieving performance portability with SKaMPI for high-performance MPI programs. In: ICCS ’01: Proceedings of the International Conference on Computational Science—Part II, pp. 841–850. Springer-Verlag, London (2001)
Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley, Reading (1995)
Borkar, S.Y., Mulder, H., Dubey, P., Pawlowski, S.S., Kahn, K.C., Rattner, J.R., Kuck, D.J.: Platform 2015: Intel processor and platform evolution for the next decade. ftp://download.intel.com/technology/computing/archinnov/platform2015/, 2005
The CELL project at IBM Research: http://www.research.ibm.com/cell/
Goodarce, J.: Challenges in programming the multiprocessor platforms. In: 5th International Forum on Application-Specific Multi-Processor SoC, Saint-Maximin la Sainte Baume, France, July 2004
Dennis, J.B., Misunas, D.: A preliminary architecture for a basic data flow processor. In: Proceedings of the 2nd Annual International Symposium on Computer Architecture, 1974, pp. 126–132
Arvind, Gostelow, K.P.: The U-interpreter. IEEE Comput. 15(2), 42–49 (1982)
Davis, A.L., Keller, R.M.: Data flow progarm graphs. Comput. 15(2), 26–41 (1982)
Lee, B., Hurson, A.: Dataflow architectures and multithreading. IEEE Comput. 27(8), 27–39 (1994)
Najjar, W.A., Lee, E.A., Gao, G.R.: Advances in the dataflow computational model. Parallel Comput. 25(13–14), 1907–1929 (1999)
Theobald, K.B.: EARTH: An efficient architecture for running threads. Ph.D. dissertation, May 1999
Hum, H.H.J., Maquelin, O., Theobald, K.B., Tian, X., Tang, X., Gao, G.R.: A design study of the EARTH multiprocessor. In: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT), 1995, pp. 59–68
Theobald, K.B., Agrawal, G., Kumar, R., Heber, G., Gao, G.R., Stodghill, P., Pingali, K.: Landing CG on EARTH: A case study of fine-grained multithreading on an evolutionary path. In: Proceedings of Supercomputing’2000, Nov. 2000
del Cuvillo, J., Tian, X., Gao, G.R., Girkar, M.: Performance study of a whole genome comparison tool on a hyper-threading multiprocessor. In: Fifth International Symposium on High Performance Computing, Tokyo, Japan, Oct. 2003
Zhu, W., Niu, Y., Lu, J., Shen, C., Gao, G.R.: A cluster-based solution for high performance hmmpfam using earth execution model. In: Proceedings of IEEE 5th International Conference on Cluster Computing (CLUSTER’03), Hong Kong, P.R. China, Dec. 2003, pp. 30–37
Chen, F., Theobald, K.B., Gao, G.R.: Implementing parallel conjugate gradient on the EARTH multithreaded architecture. In: Proceedings of IEEE 6th International Conference on Cluster Computing (CLUSTER’04), San Diego, California, 20–23 Sept. 2004
Tremblay, G., Theobald, K.B., Morrone, C.J., Butala, M.D., Amaral, J.N., Gao, G.R.: Threaded-C language reference manual (release 2.0). CAPSL Technical Memo 39 (2000)
Shen, C.: A portable runtime system and its derivation for the hardware SU implementation. Master’s thesis, Univ. of Delaware, Newark, DE, December 2003
Kakulavarapu, P., Maquelin, O., Gao, G.R.: Design of the runtime system for the portable Threaded-C language. CAPSL Technical Memo 24 (1998)
Morrone, C.J.: An EARTH runtime system for multi-processor/multi-node Beowulf clusters. Master’s thesis, Univ. of Delaware, Newark, DE, May 2001
Hum, H.H.J.: The super-actor machine: A hybrid dataflow/von neuman architecture. Ph.D. dissertation, McGill University, Montreal, Canada, May 1992
The Argonne scalable cluster. http://www-unix.mcs.anl.gov/chiba/
The Argonne JAZZ cluster, laboratory computing resource center (lcrc). http://www.lcrc.anl.gov/jazz/
Bailey, D., Harris, T., Saphir, W., van der Wijngaart, R., Woo, A., Yarrow, M.: The NAS parallel benchmarks 2.0. (1995)
HMMER: sequence analysis using profile hidden Markov models. http://hmmer.wustl.edu/
Gao, G., Yates, R.: The argument-fetching dataflow architecture project: A status report. In: Can. Conf. on Elec. and Comp. Eng., Montreal, Sept. 1989
Sodan, A., Gao, G.R., Maquelin, O., Schultz, J.-U., Tian, X.-M.: Experiences with non-numeric applications on multithreaded architectures. In: Proceedings of the Sixth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP97), 1997, pp. 124–135
Thulasiraman, P., Theobald, K.B., Khokhar, A.A., Gao, G.R.: Multithreaded algorithms for the fast Fourier transform. In: SPAA ’00: Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures, pp. 176–185. ACM, New York (2000)
Thulasiram, R.K., Litov, L., Nojumi, H., Downing, C.T., Gao, G.R.: Multithreaded algorithms for pricing a class of complex options. In: IPDPS ’01: Proceedings of the 15th International Parallel & Distributed Processing Symposium, p. 18. IEEE Computer Society, Washington, USA (2001)
Theobald, K.B., Kumar, R., Agrawal, G., Heber, G., Thulasiram, R.K., Gao, G.R.: Implementation and evaluation of a communication intensive application on the EARTH multithreaded system. Concurr. Comput. Pract. Experience 14(3), 183–201 (2002)
Thulasiraman, P., Khokhar, A.A., Heber, G., Gao, G.R.: A fine-grain load-adaptive algorithm of the 2D discrete wavelet transform for multithreaded architectures. J. Parallel Distrib. Comput. 64(1), 68–78 (2004)
Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, Cambridge, USA (1994)
Pacheco, P.: Parallel Programming with MPI. Morgan Kaufmann, San Francisco (1997)
Gropp, W.D., Lusk, E.: User’s Guide for MPICH, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, aNL-96/6 (1996)
Tang, H., Yang, T.: Optimizing threaded MPI execution on SMP clusters. In: Proceedings of the 15th ACM International Conference on Supercomputing (ICS-01), pp. 381–392. ACM, New York (2001)
Sistare, S., van de Vaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMPs. In: Proceedings of Supercomputing 1999 (SC99). ACM and IEEE Computer Society Press, New York (1999)
Takahashi, T., O’Carroll, F., Tezuka, H., Hori, A., Sumimoto, S., Harada, H., Ishikawa, Y., Beckman, P.H.: Implementation and evaluation of MPI on an SMP cluster. In: Proceedings of the 11th IPPS/SPDP’99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, pp. 1178–1192. Springer-Verlag, London (1999)
TOMPI, a threads-only MPI implementation. http://theory.lcs.mit.edu/~edemaine/TOMPI/
Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R., Sunderam, V., PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge (1994)
Santos, C., Aude, J.: PM-PVM: A portable multithreaded PVM. In: Proceedings of 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, San Juan, Puerto Rico, 12–16 April, 1999
Zhou, H., Geist, A.: LPVM: a step towards multithread PVM. Concurr. Pract. Experience 10(5), 407–416 (1998)
Ferrari, A., Sunderam, V.: Multiparadigm Distributed Computing with TPVM. Concurr. Pract. Experience 10(3), 199–228 (1998)
Chandra, R., Menon, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J.: Parallel Programming in OpenMP. Morgan Kaufmann, San Mateo (2000)
Lu, H., Hu, Y.C., Zwaenepoel, W.: OpenMP on network of workstations. In: Proceedings of Supercomputing’98, Oct. 1998
Kee, Y.-S., Kim, J.-S., Ha, S.: ParADE: An OpenMP programming environment for SMP cluster systems. In: Proceedings of Supercomputing 2003 (SC2003). ACM, Phoenix (2003)
Ojima, Y., Sato, M., Harada, H., Ishikawa, Y.: Performance of cluster-enabled OpenMP for the SCASH software distributed shared memory system. In: Proceedings of the 3rd IEEE/ACM Int’l Symp. on Cluster Computing and the Grid (CCGrid’03), May 2003, pp. 450–456
Butenhof, D.R.: Programming with POSIX(R) Threads. Addison-Wesley, Reading (1997)
Löf, H., Radovic, Z., Hagersten, E.: THROOM—running POSIX multithreaded binaries on a cluster. Department of Information Technology, Uppsala University, Tech. Rep. 2003-026, Apr. 2003
Jamieson, P., Bilas, A.: CableS: Thread control and memory system extensions for shared virtual memory clusters, In: Lecture Notes in Computer Science, vol. 2104 (2001)
Smith, L., Bull, M.: Development of mixed mode MPI/openMP applications. Sci. Program. 9(2–3), 83–98 (2001)
Cappello, F., Etiemble, D.: MPI versus MPI+openMP on IBM SP for the NAS benchmarks. In: Proceedings of Supercomputing’2000. IEEE and ACM SIGARCH, Dallas (2000)
Jost, G., Jin, H., an Mey, D., Hatay, F.F.: Comparing the OpenMP, MPI, and hybrid programming paradigms on an SMP cluster. In: Proceedings of the Fifth European Workshop on OpenMP (EWOMP03), Aachen, Germany, September 2003
Rebenseifner, R.: Hybrid parallel programming: Performance problems and chances. In: Proceedings of the 45th Cray User Group Conference, Ohio, 12–16 May 2003
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, W., Niu, Y. & Gao, G. Performance portability on EARTH: a case study across several parallel architectures. Cluster Comput 10, 115–126 (2007). https://doi.org/10.1007/s10586-007-0011-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-007-0011-1