Abstract
In order to fully exploit the power of a parallel computer, an application must be distributed onto processors so that, as much as possible, each has an equal-sized, independent portion of the work. There is a tension between balancing processor loads and maximizing locality, as the dynamic re-assignment of work necessitates access to remote data. Fractiling is a dynamic scheduling scheme that simultaneously balances processor loads and maintains locality by exploiting the self-similarity properties of fractals.
Fractiling accommodates load imbalances caused by predictable phenomena, such as irregular data, and unpredictable phenomena, such as data-access latencies. Probabilistic analysis gives evidence that it should achieve close to optimal load-balance. We have applied fractiling to two applications, an N-body problem and dense matrix multiplication, running on shared-address space and on private-address space parallel machines, namely the Kendall Square KSR1 and the IBM SPI. Although the applications contained little or no algorithmic variance, fractiling improved performance over static scheduling due to systemic variance; however, artifacts of the memory subsystems of the two architectures impeded the scalability of the fractiled code.
Research supported by ARPA/USAF under Grant no. F30602–95-1-0008 and the New York State Science and Technology Foundation. The research was conducted using the resources of the Cornell Theory Center, which receives major funding from the National Science Foundation and New York State; additional funding comes from the Advanced Research Projects Agency, the National Institute of Health, IBM Corporation and other members of the center’s Corporate Research Institute. Susan Hummel was also supported in part by NSF Grant CCR-9321424; Joel Wein by NSF Grant CCR-92l 1494. We thank Bob Walkup for his assistance in programming the SP1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
I. Banicescu and S. Flynn Hummel, Balancing Processor Loads and Exploiting Locality in Irregular Computations, IBM Research Report RC19934, Feb. 1995.
I. Banicescu, Load Balancing in the Parallel Fast Multipole Algorithm Solution to the N-body Problem, PhD Thesis, Polytechnic University, Computer Science Dept., in preparation.
J. A. Board, KSR implementation of Greengard’s PFMA, email, Oct. 1994.
Carter, L., J. Ferrante and S. Flynn Hummel, Efficient Parallelism via Hierarchical Tiling, Proc. of SIAM Conference on Parallel Processing for Scientific Computing, Feb. 1995
Carter, L., J. Ferrante and S. Flynn Hummel, Hierarchical Tiling for Improved Superscalar Perfomance, Proc. of International Parallel Processing Symposium, Apr. 1995.
M.D. Durand, T. Montaut, L. Kervella and W. Jalby, Impact of Memory Contention on Task Duration in Self-Scheduled Programs, Int. Conf. on Parallel Processing, Aug. 1993.
L. E. Flynn and S. Flynn Hummel, The Mathematical Foundations of the Factoring Scheduling Method, IBM Research Report RC18462, Oct. 1992.
S. Flynn Hummel, E. Schonberg, and L. E. Flynn, Factoring: A Practical and Robust Method for Scheduling Parallel Loops, Comm. of the ACM 35(8) pp. 90–101, Aug. 1992.
S. Flynn Hummel, Fractiling: A Method for Scheduling Parallel Loops on NUMA Machines, IBM Research Report RC18958, June 1993.
S. Flynn Hummel, C. Wang, and J. Wein, Simulations of Fractiling in the logP Model, unpublished manuscript, 1994.
S. Frank, H. Burkhardt and J. Rothnie, The KSR1: Bridging the Gap between Shared Memory and MMPs, Proc. Compcon ‘83.
H. Franke, C. E. Wu, M. Riviere, P. Pattnaik, and M. Snir, MPI Programming Environment for IBM SP1/SP2, unpublished manuscript, 1995.
L. Greengard, The Rapid Evaluation of Potential Fields in Particle Systems, ACM Distinguished Dissertaion Series, MIT Press, 1987.
L. Greengard and W. D. Gropp, A Parallel Version of the Fast Multipole Algorithm, Computers Math. Applic. 20(7) pp. 63–71, 1992.
M. Gupta and P. Banerjee, Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers, IEEE Tran. on Parallel and Distributed Systems, 3(2) pp. 179–193, Mar. 1992.
F. Irigoin and R. Triolet, Supernode Partitioning, Proc. 15th ACM Symp. Principles of Programming Languages pp. 319–329, Jan. 1988.
G. KhermouchTechnology 1994: large computers, IEEE Spectrum 31(1) pp. 46–49, 1994.
C. Kruskal and A. Weiss, Allocating Independent Subtasks on Parallel Processors, IEEE Trans. Software Eng. SE-11(10) pp. 1001–1016, Oct. 1985.
E R Lee, Partitioning of Regular Computation on Multiprocessor Systems, Journal of Parallel and Distributed Computing 9 pp. 312–317, 1990.
H. Li, S. Tandri, M. Stumm, K. C. Sevcik, Locality and Loop Scheduling on NUMA Machines, Int. Conf. on Parallel Processing, Aug. 1993, to appear.
E. P. Markatos and T. J. LeBlanc, Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors, Proceeedings of Supercomputing ’92 pp. 104–113, Nov. 1992.
MPI Furum, Document for a standard message passing interface, Tech Rep. CS-93–214, University of Tennessee, Nov. 1993.
C. Polychronopoulos, Loop Coalescing: A Compiler Transformation for Parallel Machines, Int. Conf. on Parallel Processing pp. 235–242, 1987.
C. Polychronopoulos and D. Kuck, Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Computers. IEEE Transactions on Computers C-36(12) pp. 1425–1439, Dec. 1987.
R. Mraz, Reducing the Variance of Point-to-Point Transfers for Parallel Real-Time Programs, Parallel and Distributed Technology 2(4) pp. 20–31, Winter 1994.
D. A. Reed, L. M. Adams, and M. L. Patrick, Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systemn, IEEE Tran. on Computers C-36(7) pp.845–858, July 1987.
S. Talla, C implementation of Greengard’s FMA, email, June 1994.
C. D. Thompson and H. T. Kung, Sorting on a Mesh-Connected Parallel Computer, Comm. of the ACM 20(4) pp. 263–271, 1977.
T. H. Tzen and L. M. Ni, Dynamic Loop Scheduling for Shared-Memory Multiprocessors, Proc. Int. Conf. on Parallel Processing, Vol. II, pp. 247–250, 1991.
M. E. Wolf and M. S. Lam, A Data Locality Algorithm, Proc. of the ACM SIGPLAN ‘81 Conference on Programming Language Design and Implementation pp. 30–44, June 1991.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1996 Springer Science+Business Media New York
About this chapter
Cite this chapter
Hummel, S.F., Banicescu, I., Wang, CT., Wein, J. (1996). Load Balancing and Data Locality Via Fractiling: An Experimental Study. In: Szymanski, B.K., Sinharoy, B. (eds) Languages, Compilers and Run-Time Systems for Scalable Computers. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-2315-4_7
Download citation
DOI: https://doi.org/10.1007/978-1-4615-2315-4_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-5979-1
Online ISBN: 978-1-4615-2315-4
eBook Packages: Springer Book Archive