Abstract
Long running multi-physics coupled parallel applications have gained prominence in recent years. The high computational requirements and long durations of simulations of these applications necessitate the use of multiple systems of a Grid for execution. In this paper, we have built an adaptive middleware framework for execution of long running multi-physics coupled applications across multiple batch systems of a Grid. Our framework, apart from coordinating the executions of the component jobs of an application on different batch systems, also automatically resubmits the jobs multiple times to the batch queues to continue and sustain long running executions. As the set of active batch systems available for execution changes, our framework performs migration and rescheduling of components using a robust rescheduling decision algorithm. We have used our framework for improving the application throughput of a foremost long running multi-component application for climate modeling, the Community Climate System Model (CCSM). Our real multi-site experiments with CCSM indicate that Grid executions can lead to improved application throughput for climate models.
Similar content being viewed by others
References
Coveney, P., Fabritiis, G.D., Harvey, M., Pickles, S., Porter, A.: On steering coupled models. In: e-Science All Hands Meeting (2005)
Larson, J., Jacob, R., Ong, E.: The model coupling toolkit: a new Fortran90 toolkit for building multiphysics parallel coupled models. Int. J. High Perform. Comput. Appl. 19, 277–292 (2005)
Delgado-Buscalioni, R., Coveney, P., Riley, G., Ford, R.: Hybrid molecular-continuum fluid models: implementation within a general coupling framework. Philos. Trans. R. Soc. Lond. A 363, 1833 (2005)
TeraGrid: http://www.teragrid.org. Accessed Sept 2011
UK e-Science: http://www.rcuk.ac.uk/escience/default.htm. Accessed Sept 2011
Community Climate System Model (CCSM): http://www.ccsm.ucar.edu. Accessed Sept 2011
Collins, W., Bitz, C., Blackmon, L., Bonan, G., Bretherton, C., Carton, J., Chang, P., Doney, S., Hack, J., Henderson, T., Kiehl, J., Large, W., McKenna, D., Santer, B., Smith, R.: The community climate system model version 3: CCSM3. J. Climate 19(11), 2122–2143 (2006)
Ccsm user guide: http://www.cesm.ucar.edu/models/ccsm3.0/ccsm/doc/UsersGuide/UsersGuide.pdf. Accessed Sept 2011
Gabriel, E., Resch, M., Beisel, T., Keller, R.: Distributed computing in a heterogenous computing environment. In: EuroPVMMPI’98 (1998)
Park, K., Park, S., Kwon, O., Park, H.: MPICH-GP: a private-IP-enabled MPI over Grid environments. In: Proc. of Second International Symposium on Parallel and Distributed Processing and Applications, ISPA04, Hong Kong, China, pp. 469–473 (2004)
Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 202–219 (1999)
Brevik, J., Nurmi, D., Wolski, R.: Predicting bounds on queuing delay for batch-scheduled parallel machines. In: PPoPP ’06: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 110–118 (2006)
The National Center for Atmospheric Research (NCAR): http://www.ncar.ucar.edu. Accessed Sept 2011
Lublin, U., Feitelson, D.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)
Lee, B., Brooks, D., de Supinski, B., Schulz, M., Singh, K., McKee, S.: Methods of inference and learning for performance modeling of parallel applications. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Jose, CA (2007)
Yang, L., Ma, X., Mueller, F.: Cross-platform performance prediction of parallel applications using partial execution. In: SC ’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 40 (2005)
Parallel Climate Model (PCM): http://www.cgd.ucar.edu/pcm. Accessed Sept 2011
Skamarock, W., Klemp, J., Dudhia, J., Gill, D., Barker, D., Wang, W., Powers, J.: A description of the advanced research WRF version 2. NCAR, Tech. Rep. Technical Note (2005)
Lefantzi, S., Ray, J.: A component-based scientific toolkit for reacting flows. In: Proc. Second MIT Conference on Computational Fluid and Solid Mechanics, pp. 1401–1405 (2003)
ANSYS FLUENT: http://www.ansys.com/products/fluid-dynamics/fluent/default.asp. Accessed Sept 2011
Vadhiyar, S., Dongarra, J.: SRS—a framework for developing malleable and migratableparallel applications for distributed systems. Parallel Process. Lett. 13(2), 291–312 (2003)
Fernandes, R., Pingali, K., Stodghill, P.: Mobile MPI programs in computational Grids. In: PPoPP ’06: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 22–31 (2006)
WS Resource Framework: http://www.globus.org/wsrf. Accessed Sept 2011
Czajkowski, K., Foster, I., Kesselman, C.: Agreement-based resource management. Proc. IEEE 93(3), 631–643 (2005)
Markatchev, N., Kiddle, C., Simmonds, R.: A framework for executing long running jobs in Grid environments. In: HPCS ’08: Proceedings of the 22nd International Symposium on High Performance Computing Systems and Applications, pp. 69–75 (2008)
Sarkar, A.D., Roy, S., Ghosh, D., Mukhopadhyay, R., Mukherjee, N.: An adaptive execution scheme for achieving guaranteed performance in computational Grids. J. Grid Computing 8(1), 109–131 (2010)
de O. Lucchese, F., Yero, E., Sambatti, F., Henriques, M.: An adaptive scheduler for Grids. J. Grid Computing 4(1), 1–17 (2006)
Bucur, A., Epema, D.: Scheduling policies for processor coallocation in multicluster systems. IEEE Trans. Parallel Distrib. Syst. 18(7), 958–972 (2007)
Buisson, J., Sonmez, O., Mohamed, H., Lammers, W., Epema, D.: Scheduling malleable applications in multicluster systems. In: CLUSTER ’07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pp. 372–381 (2007)
Casanova, H.: Benefits and drawbacks of redundant batch requests. J. Grid Computing 5(2), 235–250 (2007)
Ko, S.-H., Kim, N., Kim, J., Thota, A., Jha, S.: Efficient runtime environment for coupled multi-physics simulations: dynamic resource allocation and load-balancing. In: CCGRID 2010: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 349–358 (2010)
Yu, J., Buyya, R.: A taxonomy of workflow management systems for Grid computing. J. Grid Computing 3(3–4), 171–200 (2005)
Nurmi, D., Mandal, A., Brevik, J., Koelbel, C., Wolski, R., Kennedy, K.: Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 119 (2006)
Kim, H., el-Khamra, Y., Rodero, I., Jha, S., Parashar, M.: Autonomic management of application workflows on hybrid computing infrastructure. Sci. Program. 19(2–3), 75–89 (2011)
Zhang, X., Freschl, J., Schopf, J.: A performance study of monitoring and information services for distributed systems. In: HPDC ’03: Proceedings of the 12th IEEE International Symposiumon High Performance Distributed Computing, p. 270 (2003)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported partly by Ministry of Information Technology, India, project ref. no. DIT/R&D/C-DAC/2(10)/2006 DT.30/04/07 and partly by Department of Science and Technology, India, project ref no. SR/S3/EECE/59/2005/8.6.06.
Rights and permissions
About this article
Cite this article
Murugavel, S.S., Vadhiyar, S.S. & Nanjundiah, R.S. Adaptive Executions of Multi-Physics Coupled Applications on Batch Grids. J Grid Computing 9, 455–478 (2011). https://doi.org/10.1007/s10723-011-9197-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-011-9197-9