Skip to main content

Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne

  • Conference paper
  • First Online:
Book cover Job Scheduling Strategies for Parallel Processing (JSSPP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10773))

Included in the following conference series:

Abstract

The mission of the DOE Argonne Leadership Computing Facility (ALCF) is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community. The ALCF operates supercomputers that are generally amongst the Top 5 fastest machines in the world. Specifically, ALCF is looking for the science that is either too big to run anywhere else, or it would take so long as to be impractical (i.e., “capability jobs”). At ALCF, batch scheduling plays a critical role for achieving a set of site goals within a set of constraints. While system utilization is an important goal at ALCF, its largest mission constraint is to enable extreme scale parallel jobs to take precedence. In this paper, we will describe the specific scheduling goals and constraints, analyze the workload traces collected in 2013–2017 from the 48-rack petascale supercomputer Mira, and discuss the upcoming scheduling challenges at ALCF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Argonne National Laboratory. http://www.anl.gov/

  2. Argonne National Laboratory User Facilities. http://www.anl.gov/user-facilities

  3. Argonne Leadership Computing Facility. http://www.alcf.anl.gov/

  4. Top500. https://www.top500.org/

  5. Innovative and Novel Computational Impact on Theory and Experiment (INCITE) Program. http://www.doeleadershipcomputing.org/incite-program/

  6. Advanced Scientific Computing Research (ASCR) Leadership Computing Challenge (ALCC) Program. https://science.energy.gov/ascr/facilities/accessing-ascr-facilities/alcc/

  7. The Directors Discretionary (DD) program. https://www.alcf.anl.gov/dd-program

  8. IBM Blue Gene. https://en.wikipedia.org/wiki/Blue_Gene

  9. Plan9. https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs

  10. ZeptoOS. http://www.mcs.anl.gov/research/projects/zeptoos/

  11. Cobalt. http://trac.mcs.anl.gov/projects/cobalt/

  12. SciDAC Scalable Systems Software ISIC. http://www.scidac.gov/ASCR/ASCR_SSS.html

  13. Intrepid. https://www.alcf.anl.gov/intrepid

  14. Mira. https://www.alcf.anl.gov/mira

  15. Argonne Advanced Photon Source. https://www1.aps.anl.gov/

  16. DIII-D. https://en.wikipedia.org/wiki/DIII-D_(fusion_reactor)

  17. ITER. https://www.iter.org/

  18. Shifter. https://github.com/NERSC/shifter

  19. Singularity. http://singularity.lbl.gov/

  20. Zheng, Z., Yu, L., Tang, W., Lan, Z.: Co-analysis of RAS log and job log on Blue Gene/P. In: Proceedings of IPDPS (2011)

    Google Scholar 

  21. Yang, X., Zhou, Z., Wallace, S., Lan, Z., Tang, W., Coghlan, S., Papka, M.: Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2013)

    Google Scholar 

  22. Wallace, S., Yang, X., Vishwanath, V., Allcock, W., Coghlan, S., Papka, M., Lan, Z.: A data driven scheduling approach for power management on HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2016)

    Google Scholar 

  23. Zhou, Z., Yang, X., Lan, Z., Rich, P., Tang, W., Morozov, V., Desai, N.: Improving batch scheduling on Blue Gene/Q by relaxing 5D torus network allocation constraints. In: Proceedings of IEEE IPDPS (2015)

    Google Scholar 

  24. Zhou, Z., Yang, X., Zhao, D., Rich, P., Tang, W., Wang, J., Lan, Z.: I/O-aware batch scheduling for petascale computing systems. In: Proceedings of IEEE Cluster (2015)

    Google Scholar 

  25. Yan, J., Yang, X., Jin, D., Lan, Z.: Cerberus: a three-phase burst-buffer-aware batch scheduler for high performance computing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Poster Session (2016)

    Google Scholar 

Download references

Acknowledgement

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Zhiling Lan is supported in part by US National Science Foundation grants CNS-1320125 and CCF-1422009.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiling Lan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Allcock, W., Rich, P., Fan, Y., Lan, Z. (2018). Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77398-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77397-1

  • Online ISBN: 978-3-319-77398-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics