Abstract
Current practice in parallelization strategies for MIMD machines adopted by both programmers and parallelizing compilers is based on partitioning the computation such that communication is minimized. The optimization for enhancing single node performance is then performed as a second step. This two-step procedure may not deliver optimal parallel performance. Good performance on tightly coupled parallel machines relies more and more on a good utilization of the single node resources, like cache memories and vector units. In this paper we present evidence related to the importance of the efficient utilization of single node resources when deciding how to parallelize a program.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
C. Stunkel, D. Shea, B. Abali, Atkins M., C. Bender, D. Grice, P. Hochschild, D. Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao, and P. Varker. The SP2 High-Performance Switch. IBM Systems Journal, 34(2):185–204, 1995.
MEIKO. Computing Surface CS-2, Enterprise Server. Technical documentation supplied by Meiko, 1993.
J. Hennessy and D. Patterson. Computer Architecture, a Quantitative Approach. Morgan Kaufman Publishers, inc., 1990.
M. Lam, E. Rothberg, and M. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. In 4th International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991.
Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, and M. Wu. Compiling FORTRAN 90D/HPF for Distributed-Memory MIMD Computers. Journal of Parallel and Distributed Computing, 21(1):15–26, 1994.
A. Wakatani and M. Wolfe. Optimization of Array Redistribution for Distributed-Memory Multicomputers. Parallel Computing, 21(9):1485–1490, 1995.
P. Banerjee, J. Chandy, M. Gupta, E. Hodges, J. Holm, A. Lain, D. Palermo, S. Ramaswamy, and E. Su. The Paradigm Compiler for Distributed-Memory Multicomputers. Computer, 28(10):37–47, October 1995.
M. Quinn. Parallel Computing, Theory and Practice (Chapter 7). McGraw-Hill, 2nd edition, 1994.
M. Gupta and P. Banerjee. Compile-Time Estimation of Communication Costs of Programs. Journal of Programming Languages, 2(3):191–225, 1994.
Inc. Portland Group. The Portland Group PGCC, User's Guide. Technical documentation supplied by Portland Group, 1993.
N. Aburto. Matrix multiplication benchmarks and results: mm.shar, mm_1.tbl, mm_2.tbl, mm_3.tbl. Accessible using ftp from ftp.nosc.mil:/pub/aburto, 1995.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hernández, E. (1996). Parallelizing for a good node performance. In: Liddell, H., Colbrook, A., Hertzberger, B., Sloot, P. (eds) High-Performance Computing and Networking. HPCN-Europe 1996. Lecture Notes in Computer Science, vol 1067. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61142-8_593
Download citation
DOI: https://doi.org/10.1007/3-540-61142-8_593
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61142-4
Online ISBN: 978-3-540-49955-8
eBook Packages: Springer Book Archive