Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Hofmeyr, Steven; Colmenares, Juan A.; Iancu, Costin; Kubiatowicz, John

doi:10.1007/s10586-012-0204-0

Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Published: 14 April 2012

Volume 16, pages 299–319, (2013)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Steven Hofmeyr¹,
Juan A. Colmenares²,
Costin Iancu¹ &
…
John Kubiatowicz²

216 Accesses
1 Citation
Explore all metrics

Abstract

We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Design and Implementation of an Efficient Lock-Free Scheduler

Scheduling Overheads for Task-Based Parallel Programming Models

Challenges in the Implementation of MrsP

Notes

We refer to a single processing element, whether it is a core or a hardware-thread, as a processor.
http://newmexicoconsortium.org/usrc/usrc-publication-pdfs/REDfish.pdf.
This is a simplification. Load balancing in Linux is also dependent on the memory hierarchy, through scheduling domains.
UPC 2.9.3 with NAS 2.4, OMP Intel 11.0 Fortran with NAS 3.3, available at http://www.nas.nasa.gov/Resources/Software/npb.html.
We observed similar results on Tigerton and Nehalem.
We have obtained good results with proactive load balancing on the MPI versions of the benchmark in previous work [9], although we do not include the results in this paper.

References

Blumofe, R.D., Papadopoulos, D.: The performance of work stealing in multiprogrammed environments. ACM SIGMETRICS Perform. Eval. Rev. 26(1), 266–267 (1998)
Article Google Scholar
Boneti, C., Gioiosa, R., Cazorla, F.J., Corbalán, J., Labarta, J., Valero, M.: Balancing HPC applications through smart allocation of resources in MT processors. In: Proc. 22nd IEEE Int’l Symposium on Parallel and Distributed Processing, pp. 1–12 (2008)
Google Scholar
Boneti, C., Gioiosa, R., Cazorla, F.J., Valero, M.: A dynamic scheduler for balancing HPC applications. In: Proc. 2008 ACM/IEEE Conference on Supercomputing, pp. 41:1–41:12, (2008)
Google Scholar
Cedo, F., Cortes, A., Ripoll, A., Senar, M., Luque, E.: The convergence of realistic distributed load-balancing algorithms. Theory Comput. Syst. 41(4), 609–618 (2007)
Article MathSciNet MATH Google Scholar
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proc. 2008 ACM/IEEE Conference on Supercomputing, pp. 4:1–4:12, (2008)
Google Scholar
Feitelson, D.G., Rudolph, L.: Gang scheduling performance benefits for fine-grain synchronization. J. Parallel Distrib. Comput. 16, 306–318 (1992)
Article MATH Google Scholar
Fonlupt, C., Marquet, P., luc Dekeyser, J.: Data-parallel load balancing strategies. Parallel Comput. 24(11), 1665–1684 (1998)
Article Google Scholar
Gupta, A., Tucker, A., Urushibara, S.: The impact of operating system scheduling policies and synchronization methods on performance of parallel applications. ACM SIGMETRICS Perform. Eval. Rev. 19(1) (1991)
Hofmeyr, S., Iancu, C., Blagojević, F.: Load balancing on speed. In: Proc. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 147–158 (2010)
Chapter Google Scholar
Hofmeyr, S., Colmenares, J.A., Iancu, C., Kubiatowicz, J.: Juggle: proactive load balancing on multicore computers. In: Proc. 20th ACM Int’l Symposium on High Performance and Distributed Computing, pp. 3–14 (2011)
Google Scholar
Iancu, C., Hofmeyr, S., Blagojevic, F., Zheng, Y.: Oversubscription on multicore processors. In: Proc. 2010 IEEE Int’l Symposium on Parallel and Distributed Processing, pp. 1–11 (2010)
Chapter Google Scholar
Jones, T., Dawson, S., Neely, R., Tuel, W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proc 2003 ACM/IEEE Conference on Supercomputing, p. 10 (2003)
Chapter Google Scholar
Khan, Z., Singh, R., Alam, J., Kumar, R.: Performance analysis of dynamic load balancing techniques for parallel and distributed systems. Int. J. Comput. Netw. Secur. 2, 2 (2010)
Google Scholar
Kukanov, A., Voss, M.J.: The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technol. J. 11(4) (2007)
Li, T., Baumberger, D., Hahn, S.: Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin. In: Proc. 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)
Google Scholar
Nishtala, R., Yelick, K.: Optimizing collective communication on multicores. In: Proc. 1st USENIX Workshop on Hot Topics in Parallelism (2009)
Google Scholar
Olivier, S., Prins, J.: Scalable dynamic load balancing using UPC. In: Proc. 37th Int’l Conference on Parallel Processing, pp. 123–131 (2008)
Google Scholar
Ousterhout, J.: Scheduling techniques for concurrent systems. In: Proc. 3rd Int’l Conference on Distributed Computing Systems, pp. 22–30 (1982)
Google Scholar
Plastino, A., Ribeiro, C.C., Rodriguez, N.: Developing SPMD applications with load balancing. Parallel Comput. 29(6), 743–766 (2003)
Article Google Scholar
Roberson, J.: ULE: A modern scheduler for FreeBSD. In: Proc. USENIX BSD Conference (BSDCON), pp. 17–28 (2003)
Google Scholar
Sancho, J.C., Kerbyson, D.J., Lang, M.: Characterizing the impact of using spare-cores on application performance. In: Proc. 16th Int’l Euro-Par Conference on Parallel Processing, Part I. LNCS, vol. 6271, pp. 74–85 (2010)
Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.G., Kirkpatrick, S.: System noise, OS clock ticks, and fine-grained parallel applications. In: Proc. 19th ACM Annual Int’l Conference on Supercomputing (ICS), pp. 303–312 (2005)
Chapter Google Scholar
Willebeek-LeMair, M., Reeves, A.: Strategies for dynamic load balancing on highly parallel computers. IEEE Trans. Parallel Distrib. Syst. 4(9) (1993)
Xu, C., Lau, F.C.: Load Balancing in Parallel Computers: Theory and Practice. Kluwer Academic, Dordrecht (1997)
Google Scholar

Download references

Acknowledgements

The authors acknowledge the support of DOE Grant #DE-FG02-08ER25849. Juan Colmenares and John Kubiatowicz acknowledge support of Microsoft (Award #024263), Intel (Award #024894), matching U.C. Discovery funding (Award #DIG07-102270), and additional support from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems. No part of this paper represents the views and opinions of the sponsors.

Author information

Authors and Affiliations

Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Steven Hofmeyr & Costin Iancu
Parallel Computing Laboratory, UC Berkeley, Berkeley, CA, USA
Juan A. Colmenares & John Kubiatowicz

Authors

Steven Hofmeyr
View author publications
You can also search for this author in PubMed Google Scholar
Juan A. Colmenares
View author publications
You can also search for this author in PubMed Google Scholar
Costin Iancu
View author publications
You can also search for this author in PubMed Google Scholar
John Kubiatowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Hofmeyr.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hofmeyr, S., Colmenares, J.A., Iancu, C. et al. Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers. Cluster Comput 16, 299–319 (2013). https://doi.org/10.1007/s10586-012-0204-0

Download citation

Received: 12 September 2011
Accepted: 25 February 2012
Published: 14 April 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10586-012-0204-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Abstract

Access this article

Similar content being viewed by others

On the Design and Implementation of an Efficient Lock-Free Scheduler

Scheduling Overheads for Task-Based Parallel Programming Models

Challenges in the Implementation of MrsP

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Abstract

Access this article

Similar content being viewed by others

On the Design and Implementation of an Efficient Lock-Free Scheduler

Scheduling Overheads for Task-Based Parallel Programming Models

Challenges in the Implementation of MrsP

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation