ABSTRACT
Processing scientific workloads involves staging inputs, executing and monitoring jobs, archiving outputs, and doing all of this in a secure, repeatable way. Specialized middleware has been developed to automate this process in HPC, HTC, cloud, Kubernetes and other environments. This paper describes the Job Management (JM) design pattern used to enhance workload reliability, scalability and recovery. We discuss two implementations of JM in the Tapis Jobs service, both currently in production. We also discuss the reliability and performance of the system under load, such as when 10,000 jobs are submitted at once.
- Apache Airavata (2022). https://airavata.apache.org/index.html. Accessed 25 Mar 2022.Google Scholar
- Cyverse (2022). https://cyverse.org/. Accessed 25 Mar 2022.Google Scholar
- DesignSafe (2022). https://www.designsafe-ci.org/. Accessed 25 Mar 2022.Google Scholar
- Dooley, R., : Software-as-a-Service: the iPlant foundation API. In: 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS). IEEE (2012)Google Scholar
- Gamma, E., Helm, R., Johnson, R. E., & Vlissides, J. (1995). Design patterns: Elements of reusable object-oriented software (1995). Reading, Mass: Addison-Wesley.Google ScholarDigital Library
- Hewitt, Carl; Bishop, Peter; Steiger, Richard (1973). "A Universal Modular Actor Formalism for Artificial Intelligence". IJCAI.Google Scholar
- HubZero (2022). https://hubzero.org/. Accessed 25 Mar 2022.Google Scholar
- McLennan, M., and Kennell, R. (2010). "HUBzero: a platform for dissemination and collaboration in computational science and engineering." Computing in Science & Engineering 12.2 (2010): 48-53.Google ScholarDigital Library
- Open OnDemand. https://openondemand.org/. Accessed 25 Mar 2022.Google Scholar
- Roy, G.: RabbitMQ in Depth (2017). Chapter 10. Shelter Island, NY: Manning.Google Scholar
- Schmidt, Douglas Pattern-Oriented Software Architecture Volume 2: Patterns for Concurrent and Networked Objects. Volume 2. Wiley, 2000.Google Scholar
- Stubbs J. (2021) Tapis: An API Platform for Reproducible, Distributed Computational Research. In: Arai K. (eds) Advances in Information and Communication. FICC 2021. Advances in Intelligent Systems and Computing, vol 1363. Springer, Cham. https://doi.org/10.1007/978-3-030-73100-7_61Google Scholar
- Tapis v2 documentation. https://tacc-cloud.readthedocs.io/projects/agave. Accessed 25 Mar 2022.Google Scholar
- Tapiis v3 documentation. https://tapis.readthedocs.io. Accessed 25 Mar 2022.Google Scholar
- Tapis v3 APIs. https://tapis-project.github.io/live-docs. Accessed 25 Mar 2022.Google Scholar
Recommendations
BPEL-Based Workflow Management and Parallel Job Scheduling in Ensemble Prediction
GCC '09: Proceedings of the 2009 Eighth International Conference on Grid and Cooperative ComputingThere are increasingly demanding for huge computing capabilities and complex processes managing technologies along with the development of large-scale parallel scientific computing applications. Taking Ensemble Prediction in climate domain for example, ...
A performance study of job management systems: Research Articles
Systems Performance EvaluationJob Management Systems (JMSs) efficiently schedule and monitor jobs in parallel and distributed computing environments. Therefore, they are critical for improving the utilization of expensive resources in high-performance computing systems and centers, ...
Effective ensembles of heuristics for scheduling flexible job shop problem with new job insertion
Flexible job shop scheduling problem.Ensembles of heuristics.Re-scheduling for new job insertion.Multiple objectives scheduling problem. This study investigates the flexible job shop scheduling problem (FJSP) with new job insertion. FJSP with new job ...
Comments