Abstract
As the volume and velocity of data generated by scientific experiments increase, the analysis of those data inevitably requires HPC resources. Successful research in a growing number of scientific fields depends on the ability to analyze data rapidly. In many situations, scientists and engineers want quasi-instant feedback, so that results from one experiment can guide selection of the next or even improve the course of a single experiment. Such real-time requirements are hard to meet on current HPC systems, which are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available and is otherwise queued. Real-time jobs, in order to meet their requirements, should sometimes have higher priority than batch jobs that were submitted earlier. But, accommodating more real-time jobs will negatively impact the performance of batch jobs, which may have to be preempted. The overhead involved in preempting and restarting batch jobs will, in turn, negatively impact system utilization. Here we evaluate various scheduling schemes to support real-time jobs along with the traditional batch jobs. We perform simulation studies using trace logs of Mira, the IBM BG/Q system at Argonne National Laboratory, to quantify the impact of real-time jobs on batch job performance for various percentages of real-time jobs in the workload. We present new insights gained from grouping the jobs into different categories and studying the performance of each category. Our results show that real-time jobs in all categories can achieve an average slowdown less than 1.5 and that most categories achieve an average slowdown close to 1 with at most 20% increase in average slowdown for some categories of batch jobs with 20% or fewer real-time jobs.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Cobalt project. http://trac.mcs.anl.gov/projects/cobalt
Frost, NCAR/CU BG/L System. https://wiki.ucar.edu/display/BlueGene/Frost
Allen, G., Angulo, D., Foster, I., Lanfermann, G., Liu, C., Radke, T., Seidel, E., Shalf, J.: The cactus worm: experiments with dynamic resource selection and allocation in a grid environment. IJHPCA 15(4), 345–358 (2001)
Anastasiadis, S., Sevcik, K.: Parallel application scheduling on networks of workstations. J. Parallel Distrib. Comput. 43(2), 109–124 (1997)
Barak, A., Guday, S., Wheeler, R.G. (eds.): The MOSIX Distributed Operating System: Load Balancing for UNIX. LNCS, vol. 672. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-56663-5
Chiang, S.-H., Vernon, M.K.: Production job scheduling for parallel shared memory systems. In: Proceedings of the 15th International Parallel & Distributed Processing Symposium, Washington, DC, USA, p. 47 (2001)
Cirne, W., Berman, F.: Adaptive selection of partition size for supercomputer requests. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 187–207. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-39997-6_12
Deng, X., Gu, N., Brecht, T., Lu, K.: Preemptive scheduling of parallel jobs on multiprocessors. In: Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1996, Philadelphia, PA, USA, pp. 159–167 (1996)
Duell, J.: The design and implementation of Berkeley Labs Linux checkpoint/restart. Technical report (2003). http://www.nersc.gov/research/FTG/checkpoint/reports.html
Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(8), 885–900 (2005)
Feitelson, D.G.: Job scheduling in multiprogrammed parallel systems. Research Report RC 19790 (87657), IBM T. J. Watson Research Center, October 1994
Feitelson, D.G., Rudolph, L.: Parallel job scheduling: issues and approaches. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 1–18. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_20
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U.: Parallel job scheduling — a status report. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 1–16. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_1
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Jones, J.P., Nitzberg, B.: Scheduling for parallel supercomputing: a historical perspective of achievable utilization. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 1–16. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_1
Ward Jr., W.A., Mahood, C.L., West, J.E.: Scheduling jobs on parallel systems using a relaxed backfill strategy. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 88–102. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_6
Kettimuthu, R., Subramani, V., Srinivasan, S., Gopalsamy, T., Panda, D.K., Sadayappan, P.: Selective preemption strategies for parallel job scheduling. IJHPCN 3(2/3), 122–152 (2005)
Lawson, B.G., Smirni, E.: Multiple-queue backfilling scheduling with priorities and reservations for parallel systems. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 72–87. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_5
Leung, V.J., Sabin, G., Sadayappan, P.: Parallel job scheduling policies to improve fairness: a case study. In: Lee, W.-C., Yuan, X. (eds.) ICPP Workshops, pp. 346–353. IEEE Computer Society (2010)
Leutenneger, L.T., Vernon, M.K.: The performance of multiprogrammed multiprocessor scheduling policies. In: ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, pp. 226–236, May 1990
Lifka, D.A.: The ANL/IBM SP scheduling system. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_35
Motwani, R., Phillips, S., Torng, E.: Non-clairvoyant scheduling. In: Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 1993, Philadelphia, PA, USA, pp. 422–431 (1993)
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Niu, S., Zhai, J., Ma, X., Liu, M., Zhai, Y., Chen, W., Zheng, W.: Employing checkpoint to improve job scheduling in large-scale systems. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 36–55. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_3
Parsons, E.W., Sevcik, K.C.: Implementing multiprocessor scheduling disciplines. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 166–192. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_21
Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, HPDC 2002, p. 352. IEEE Computer Society, Washington, DC (2002)
Sabin, G., Lang, M., Sadayappan, P.: Moldable parallel job scheduling using job efficiency: an iterative approach. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2006. LNCS, vol. 4376, pp. 94–114. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71035-6_5
Sabin, G., Sadayappan, P.: Unfairness metrics for space-sharing parallel job schedulers. In: Feitelson, D., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 238–256. Springer, Heidelberg (2005). https://doi.org/10.1007/11605300_12
Schulz, M., Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proceedings of the ACM/IEEE SC 2004 Conference Supercomputing, pp. 38–38, November 2004
Sevcik, K.C.: Application scheduling and processor allocation in multiprogrammed parallel processing systems. Perform. Eval. 19(2–3), 107–140 (1994)
Shmueli, E., Feitelson, D.G.: Backfilling with lookahead to optimize the packing of parallel jobs. J. Parallel Distrib. Comput. 65(9), 1090–1107 (2005)
Snell, Q.O., Clement, M.J., Jackson, D.B.: Preemption based backfill. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 24–37. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_2
Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Selective reservation strategies for backfill job scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 55–71. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_4
Srinivasan, S., Subramani, V., Kettimuthu, R., Holenarsipur, P., Sadayappan, P.: Effective selection of partition sizes for moldable scheduling of parallel jobs. In: Sahni, S., Prasanna, V.K., Shukla, U. (eds.) HiPC 2002. LNCS, vol. 2552, pp. 174–183. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36265-7_17
Subramani, V., Kettimuthu, R., Srinivasan, S., Sadayappan, P.: Distributed job scheduling on computational grids using multiple simultaneous requests. In: Proceedings of the 11th International Symposium on High Performance Distributed Computing, p. 359. IEEE Computer Society, Washington, DC (2002)
Talby, D., Feitelson, D.G.: Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling. In: Proceedings of the 13th International Parallel Processing Symposium, pp. 513–517 (1999)
Tang, W., Desai, N., Buettner, D., Lan, Z.: Job scheduling with adjusted runtime estimates on production supercomputers. J. Parallel Distrib. Comput. 73(7), 926–938 (2013)
Tang, W., Ren, D., Lan, Z., Desai, N.: Toward balanced and sustainable job scheduling for production supercomputers. Parallel Comput. 39(12), 753–768 (2013)
Thomas, M., Dam, K., Marshall, M., Kuprat, A., Carson, J., Lansing, C., Guillen, Z., Miller, E., Lanekoff, I., Laskin, J.: Towards adaptive, streaming analysis of X-ray tomography data. Synchrotron Radiat. News 28(2), 10–14 (2015)
Trebon, N.: Enabling urgent computing within the existing distributed computing infrastructure, Ph.D. thesis. University of Chicago (2011). AAI3472964
Walters, J.P., Chaudhary, V.: Application-level checkpointing techniques for parallel programs. In: Madria, S.K., Claypool, K.T., Kannan, R., Uppuluri, P., Gore, M.M. (eds.) ICDCIT 2006. LNCS, vol. 4317, pp. 221–234. Springer, Heidelberg (2006). https://doi.org/10.1007/11951957_21
Zahorjan, J., McCann, C.: Processor scheduling in shared memory multiprocessors. In: ACM SIGMETRICS Conference on Measurement and Modelling of Computer Systems, pp. 214–225, May 1990
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. We thank the Argonne Leadership Computing Facility at Argonne National Laboratory for providing the Mira trace log used in this study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Wang, D., Jung, ES., Kettimuthu, R., Foster, I., Foran, D.J., Parashar, M. (2018). Supporting Real-Time Jobs on the IBM Blue Gene/Q: Simulation-Based Study. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-77398-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77397-1
Online ISBN: 978-3-319-77398-8
eBook Packages: Computer ScienceComputer Science (R0)