Skip to main content

Load Balancing and Data Locality Via Fractiling: An Experimental Study

  • Chapter
Languages, Compilers and Run-Time Systems for Scalable Computers

Abstract

In order to fully exploit the power of a parallel computer, an application must be distributed onto processors so that, as much as possible, each has an equal-sized, independent portion of the work. There is a tension between balancing processor loads and maximizing locality, as the dynamic re-assignment of work necessitates access to remote data. Fractiling is a dynamic scheduling scheme that simultaneously balances processor loads and maintains locality by exploiting the self-similarity properties of fractals.

Fractiling accommodates load imbalances caused by predictable phenomena, such as irregular data, and unpredictable phenomena, such as data-access latencies. Probabilistic analysis gives evidence that it should achieve close to optimal load-balance. We have applied fractiling to two applications, an N-body problem and dense matrix multiplication, running on shared-address space and on private-address space parallel machines, namely the Kendall Square KSR1 and the IBM SPI. Although the applications contained little or no algorithmic variance, fractiling improved performance over static scheduling due to systemic variance; however, artifacts of the memory subsystems of the two architectures impeded the scalability of the fractiled code.

Research supported by ARPA/USAF under Grant no. F30602–95-1-0008 and the New York State Science and Technology Foundation. The research was conducted using the resources of the Cornell Theory Center, which receives major funding from the National Science Foundation and New York State; additional funding comes from the Advanced Research Projects Agency, the National Institute of Health, IBM Corporation and other members of the center’s Corporate Research Institute. Susan Hummel was also supported in part by NSF Grant CCR-9321424; Joel Wein by NSF Grant CCR-92l 1494. We thank Bob Walkup for his assistance in programming the SP1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. I. Banicescu and S. Flynn Hummel, Balancing Processor Loads and Exploiting Locality in Irregular Computations, IBM Research Report RC19934, Feb. 1995.

    Google Scholar 

  2. I. Banicescu, Load Balancing in the Parallel Fast Multipole Algorithm Solution to the N-body Problem, PhD Thesis, Polytechnic University, Computer Science Dept., in preparation.

    Google Scholar 

  3. J. A. Board, KSR implementation of Greengard’s PFMA, email, Oct. 1994.

    Google Scholar 

  4. Carter, L., J. Ferrante and S. Flynn Hummel, Efficient Parallelism via Hierarchical Tiling, Proc. of SIAM Conference on Parallel Processing for Scientific Computing, Feb. 1995

    Google Scholar 

  5. Carter, L., J. Ferrante and S. Flynn Hummel, Hierarchical Tiling for Improved Superscalar Perfomance, Proc. of International Parallel Processing Symposium, Apr. 1995.

    Google Scholar 

  6. M.D. Durand, T. Montaut, L. Kervella and W. Jalby, Impact of Memory Contention on Task Duration in Self-Scheduled Programs, Int. Conf. on Parallel Processing, Aug. 1993.

    Google Scholar 

  7. L. E. Flynn and S. Flynn Hummel, The Mathematical Foundations of the Factoring Scheduling Method, IBM Research Report RC18462, Oct. 1992.

    Google Scholar 

  8. S. Flynn Hummel, E. Schonberg, and L. E. Flynn, Factoring: A Practical and Robust Method for Scheduling Parallel Loops, Comm. of the ACM 35(8) pp. 90–101, Aug. 1992.

    Article  Google Scholar 

  9. S. Flynn Hummel, Fractiling: A Method for Scheduling Parallel Loops on NUMA Machines, IBM Research Report RC18958, June 1993.

    Google Scholar 

  10. S. Flynn Hummel, C. Wang, and J. Wein, Simulations of Fractiling in the logP Model, unpublished manuscript, 1994.

    Google Scholar 

  11. S. Frank, H. Burkhardt and J. Rothnie, The KSR1: Bridging the Gap between Shared Memory and MMPs, Proc. Compcon ‘83.

    Google Scholar 

  12. H. Franke, C. E. Wu, M. Riviere, P. Pattnaik, and M. Snir, MPI Programming Environment for IBM SP1/SP2, unpublished manuscript, 1995.

    Google Scholar 

  13. L. Greengard, The Rapid Evaluation of Potential Fields in Particle Systems, ACM Distinguished Dissertaion Series, MIT Press, 1987.

    Google Scholar 

  14. L. Greengard and W. D. Gropp, A Parallel Version of the Fast Multipole Algorithm, Computers Math. Applic. 20(7) pp. 63–71, 1992.

    Article  MathSciNet  Google Scholar 

  15. M. Gupta and P. Banerjee, Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers, IEEE Tran. on Parallel and Distributed Systems, 3(2) pp. 179–193, Mar. 1992.

    Article  Google Scholar 

  16. F. Irigoin and R. Triolet, Supernode Partitioning, Proc. 15th ACM Symp. Principles of Programming Languages pp. 319–329, Jan. 1988.

    Google Scholar 

  17. G. KhermouchTechnology 1994: large computers, IEEE Spectrum 31(1) pp. 46–49, 1994.

    Article  Google Scholar 

  18. C. Kruskal and A. Weiss, Allocating Independent Subtasks on Parallel Processors, IEEE Trans. Software Eng. SE-11(10) pp. 1001–1016, Oct. 1985.

    Article  Google Scholar 

  19. E R Lee, Partitioning of Regular Computation on Multiprocessor Systems, Journal of Parallel and Distributed Computing 9 pp. 312–317, 1990.

    Article  Google Scholar 

  20. H. Li, S. Tandri, M. Stumm, K. C. Sevcik, Locality and Loop Scheduling on NUMA Machines, Int. Conf. on Parallel Processing, Aug. 1993, to appear.

    Google Scholar 

  21. E. P. Markatos and T. J. LeBlanc, Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors, Proceeedings of Supercomputing ’92 pp. 104–113, Nov. 1992.

    Article  Google Scholar 

  22. MPI Furum, Document for a standard message passing interface, Tech Rep. CS-93–214, University of Tennessee, Nov. 1993.

    Google Scholar 

  23. C. Polychronopoulos, Loop Coalescing: A Compiler Transformation for Parallel Machines, Int. Conf. on Parallel Processing pp. 235–242, 1987.

    Google Scholar 

  24. C. Polychronopoulos and D. Kuck, Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Computers. IEEE Transactions on Computers C-36(12) pp. 1425–1439, Dec. 1987.

    Article  Google Scholar 

  25. R. Mraz, Reducing the Variance of Point-to-Point Transfers for Parallel Real-Time Programs, Parallel and Distributed Technology 2(4) pp. 20–31, Winter 1994.

    Article  Google Scholar 

  26. D. A. Reed, L. M. Adams, and M. L. Patrick, Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systemn, IEEE Tran. on Computers C-36(7) pp.845–858, July 1987.

    Article  Google Scholar 

  27. S. Talla, C implementation of Greengard’s FMA, email, June 1994.

    Google Scholar 

  28. C. D. Thompson and H. T. Kung, Sorting on a Mesh-Connected Parallel Computer, Comm. of the ACM 20(4) pp. 263–271, 1977.

    Article  MathSciNet  MATH  Google Scholar 

  29. T. H. Tzen and L. M. Ni, Dynamic Loop Scheduling for Shared-Memory Multiprocessors, Proc. Int. Conf. on Parallel Processing, Vol. II, pp. 247–250, 1991.

    Google Scholar 

  30. M. E. Wolf and M. S. Lam, A Data Locality Algorithm, Proc. of the ACM SIGPLAN ‘81 Conference on Programming Language Design and Implementation pp. 30–44, June 1991.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer Science+Business Media New York

About this chapter

Cite this chapter

Hummel, S.F., Banicescu, I., Wang, CT., Wein, J. (1996). Load Balancing and Data Locality Via Fractiling: An Experimental Study. In: Szymanski, B.K., Sinharoy, B. (eds) Languages, Compilers and Run-Time Systems for Scalable Computers. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-2315-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-2315-4_7

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-5979-1

  • Online ISBN: 978-1-4615-2315-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics