Skip to main content

Supporting Real-Time Jobs on the IBM Blue Gene/Q: Simulation-Based Study

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10773))

Abstract

As the volume and velocity of data generated by scientific experiments increase, the analysis of those data inevitably requires HPC resources. Successful research in a growing number of scientific fields depends on the ability to analyze data rapidly. In many situations, scientists and engineers want quasi-instant feedback, so that results from one experiment can guide selection of the next or even improve the course of a single experiment. Such real-time requirements are hard to meet on current HPC systems, which are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available and is otherwise queued. Real-time jobs, in order to meet their requirements, should sometimes have higher priority than batch jobs that were submitted earlier. But, accommodating more real-time jobs will negatively impact the performance of batch jobs, which may have to be preempted. The overhead involved in preempting and restarting batch jobs will, in turn, negatively impact system utilization. Here we evaluate various scheduling schemes to support real-time jobs along with the traditional batch jobs. We perform simulation studies using trace logs of Mira, the IBM BG/Q system at Argonne National Laboratory, to quantify the impact of real-time jobs on batch job performance for various percentages of real-time jobs in the workload. We present new insights gained from grouping the jobs into different categories and studying the performance of each category. Our results show that real-time jobs in all categories can achieve an average slowdown less than 1.5 and that most categories achieve an average slowdown close to 1 with at most 20% increase in average slowdown for some categories of batch jobs with 20% or fewer real-time jobs.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Cobalt project. http://trac.mcs.anl.gov/projects/cobalt

  2. Frost, NCAR/CU BG/L System. https://wiki.ucar.edu/display/BlueGene/Frost

  3. Mira. https://www.alcf.anl.gov/mira

  4. Qsim. http://trac.mcs.anl.gov/projects/cobalt

  5. Allen, G., Angulo, D., Foster, I., Lanfermann, G., Liu, C., Radke, T., Seidel, E., Shalf, J.: The cactus worm: experiments with dynamic resource selection and allocation in a grid environment. IJHPCA 15(4), 345–358 (2001)

    Google Scholar 

  6. Anastasiadis, S., Sevcik, K.: Parallel application scheduling on networks of workstations. J. Parallel Distrib. Comput. 43(2), 109–124 (1997)

    Article  Google Scholar 

  7. Barak, A., Guday, S., Wheeler, R.G. (eds.): The MOSIX Distributed Operating System: Load Balancing for UNIX. LNCS, vol. 672. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-56663-5

    MATH  Google Scholar 

  8. Chiang, S.-H., Vernon, M.K.: Production job scheduling for parallel shared memory systems. In: Proceedings of the 15th International Parallel & Distributed Processing Symposium, Washington, DC, USA, p. 47 (2001)

    Google Scholar 

  9. Cirne, W., Berman, F.: Adaptive selection of partition size for supercomputer requests. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 187–207. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-39997-6_12

    Chapter  Google Scholar 

  10. Deng, X., Gu, N., Brecht, T., Lu, K.: Preemptive scheduling of parallel jobs on multiprocessors. In: Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1996, Philadelphia, PA, USA, pp. 159–167 (1996)

    Google Scholar 

  11. Duell, J.: The design and implementation of Berkeley Labs Linux checkpoint/restart. Technical report (2003). http://www.nersc.gov/research/FTG/checkpoint/reports.html

  12. Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(8), 885–900 (2005)

    Google Scholar 

  13. Feitelson, D.G.: Job scheduling in multiprogrammed parallel systems. Research Report RC 19790 (87657), IBM T. J. Watson Research Center, October 1994

    Google Scholar 

  14. Feitelson, D.G., Rudolph, L.: Parallel job scheduling: issues and approaches. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 1–18. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_20

    Chapter  Google Scholar 

  15. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling — a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 1–16. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_1

    Chapter  Google Scholar 

  16. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14

    Chapter  Google Scholar 

  17. Jones, J.P., Nitzberg, B.: Scheduling for parallel supercomputing: a historical perspective of achievable utilization. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 1–16. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_1

    Chapter  Google Scholar 

  18. Ward Jr., W.A., Mahood, C.L., West, J.E.: Scheduling jobs on parallel systems using a relaxed backfill strategy. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 88–102. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_6

    Chapter  Google Scholar 

  19. Kettimuthu, R., Subramani, V., Srinivasan, S., Gopalsamy, T., Panda, D.K., Sadayappan, P.: Selective preemption strategies for parallel job scheduling. IJHPCN 3(2/3), 122–152 (2005)

    Article  Google Scholar 

  20. Lawson, B.G., Smirni, E.: Multiple-queue backfilling scheduling with priorities and reservations for parallel systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 72–87. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_5

    Chapter  Google Scholar 

  21. Leung, V.J., Sabin, G., Sadayappan, P.: Parallel job scheduling policies to improve fairness: a case study. In: Lee, W.-C., Yuan, X. (eds.) ICPP Workshops, pp. 346–353. IEEE Computer Society (2010)

    Google Scholar 

  22. Leutenneger, L.T., Vernon, M.K.: The performance of multiprogrammed multiprocessor scheduling policies. In: ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, pp. 226–236, May 1990

    Google Scholar 

  23. Lifka, D.A.: The ANL/IBM SP scheduling system. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_35

    Chapter  Google Scholar 

  24. Motwani, R., Phillips, S., Torng, E.: Non-clairvoyant scheduling. In: Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1993, Philadelphia, PA, USA, pp. 422–431 (1993)

    Google Scholar 

  25. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  26. Niu, S., Zhai, J., Ma, X., Liu, M., Zhai, Y., Chen, W., Zheng, W.: Employing checkpoint to improve job scheduling in large-scale systems. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 36–55. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_3

    Chapter  Google Scholar 

  27. Parsons, E.W., Sevcik, K.C.: Implementing multiprocessor scheduling disciplines. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 166–192. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_21

    Chapter  Google Scholar 

  28. Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, HPDC 2002, p. 352. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  29. Sabin, G., Lang, M., Sadayappan, P.: Moldable parallel job scheduling using job efficiency: an iterative approach. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2006. LNCS, vol. 4376, pp. 94–114. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71035-6_5

    Chapter  Google Scholar 

  30. Sabin, G., Sadayappan, P.: Unfairness metrics for space-sharing parallel job schedulers. In: Feitelson, D., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 238–256. Springer, Heidelberg (2005). https://doi.org/10.1007/11605300_12

    Chapter  Google Scholar 

  31. Schulz, M., Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proceedings of the ACM/IEEE SC 2004 Conference Supercomputing, pp. 38–38, November 2004

    Google Scholar 

  32. Sevcik, K.C.: Application scheduling and processor allocation in multiprogrammed parallel processing systems. Perform. Eval. 19(2–3), 107–140 (1994)

    Article  MATH  Google Scholar 

  33. Shmueli, E., Feitelson, D.G.: Backfilling with lookahead to optimize the packing of parallel jobs. J. Parallel Distrib. Comput. 65(9), 1090–1107 (2005)

    Article  MATH  Google Scholar 

  34. Snell, Q.O., Clement, M.J., Jackson, D.B.: Preemption based backfill. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 24–37. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_2

    Chapter  Google Scholar 

  35. Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Selective reservation strategies for backfill job scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 55–71. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_4

    Chapter  Google Scholar 

  36. Srinivasan, S., Subramani, V., Kettimuthu, R., Holenarsipur, P., Sadayappan, P.: Effective selection of partition sizes for moldable scheduling of parallel jobs. In: Sahni, S., Prasanna, V.K., Shukla, U. (eds.) HiPC 2002. LNCS, vol. 2552, pp. 174–183. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36265-7_17

    Chapter  Google Scholar 

  37. Subramani, V., Kettimuthu, R., Srinivasan, S., Sadayappan, P.: Distributed job scheduling on computational grids using multiple simultaneous requests. In: Proceedings of the 11th International Symposium on High Performance Distributed Computing, p. 359. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  38. Talby, D., Feitelson, D.G.: Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling. In: Proceedings of the 13th International Parallel Processing Symposium, pp. 513–517 (1999)

    Google Scholar 

  39. Tang, W., Desai, N., Buettner, D., Lan, Z.: Job scheduling with adjusted runtime estimates on production supercomputers. J. Parallel Distrib. Comput. 73(7), 926–938 (2013)

    Article  Google Scholar 

  40. Tang, W., Ren, D., Lan, Z., Desai, N.: Toward balanced and sustainable job scheduling for production supercomputers. Parallel Comput. 39(12), 753–768 (2013)

    Article  Google Scholar 

  41. Thomas, M., Dam, K., Marshall, M., Kuprat, A., Carson, J., Lansing, C., Guillen, Z., Miller, E., Lanekoff, I., Laskin, J.: Towards adaptive, streaming analysis of X-ray tomography data. Synchrotron Radiat. News 28(2), 10–14 (2015)

    Article  Google Scholar 

  42. Trebon, N.: Enabling urgent computing within the existing distributed computing infrastructure, Ph.D. thesis. University of Chicago (2011). AAI3472964

    Google Scholar 

  43. Walters, J.P., Chaudhary, V.: Application-level checkpointing techniques for parallel programs. In: Madria, S.K., Claypool, K.T., Kannan, R., Uppuluri, P., Gore, M.M. (eds.) ICDCIT 2006. LNCS, vol. 4317, pp. 221–234. Springer, Heidelberg (2006). https://doi.org/10.1007/11951957_21

    Chapter  Google Scholar 

  44. Zahorjan, J., McCann, C.: Processor scheduling in shared memory multiprocessors. In: ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, pp. 214–225, May 1990

    Google Scholar 

Download references

Acknowledgments

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. We thank the Argonne Leadership Computing Facility at Argonne National Laboratory for providing the Mira trace log used in this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eun-Sung Jung .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, D., Jung, ES., Kettimuthu, R., Foster, I., Foran, D.J., Parashar, M. (2018). Supporting Real-Time Jobs on the IBM Blue Gene/Q: Simulation-Based Study. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77398-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77397-1

  • Online ISBN: 978-3-319-77398-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics