skip to main content
10.1145/3491418.3535136acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article
Open Access

A Design Pattern for Recoverable Job Management

Authors Info & Claims
Published:08 July 2022Publication History

ABSTRACT

Processing scientific workloads involves staging inputs, executing and monitoring jobs, archiving outputs, and doing all of this in a secure, repeatable way. Specialized middleware has been developed to automate this process in HPC, HTC, cloud, Kubernetes and other environments. This paper describes the Job Management (JM) design pattern used to enhance workload reliability, scalability and recovery. We discuss two implementations of JM in the Tapis Jobs service, both currently in production. We also discuss the reliability and performance of the system under load, such as when 10,000 jobs are submitted at once.

References

  1. Apache Airavata (2022). https://airavata.apache.org/index.html. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  2. Cyverse (2022). https://cyverse.org/. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  3. DesignSafe (2022). https://www.designsafe-ci.org/. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  4. Dooley, R., : Software-as-a-Service: the iPlant foundation API. In: 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS). IEEE (2012)Google ScholarGoogle Scholar
  5. Gamma, E., Helm, R., Johnson, R. E., & Vlissides, J. (1995). Design patterns: Elements of reusable object-oriented software (1995). Reading, Mass: Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hewitt, Carl; Bishop, Peter; Steiger, Richard (1973). "A Universal Modular Actor Formalism for Artificial Intelligence". IJCAI.Google ScholarGoogle Scholar
  7. HubZero (2022). https://hubzero.org/. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  8. McLennan, M., and Kennell, R. (2010). "HUBzero: a platform for dissemination and collaboration in computational science and engineering." Computing in Science & Engineering 12.2 (2010): 48-53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Open OnDemand. https://openondemand.org/. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  10. Roy, G.: RabbitMQ in Depth (2017). Chapter 10. Shelter Island, NY: Manning.Google ScholarGoogle Scholar
  11. Schmidt, Douglas Pattern-Oriented Software Architecture Volume 2: Patterns for Concurrent and Networked Objects. Volume 2. Wiley, 2000.Google ScholarGoogle Scholar
  12. Stubbs J. (2021) Tapis: An API Platform for Reproducible, Distributed Computational Research. In: Arai K. (eds) Advances in Information and Communication. FICC 2021. Advances in Intelligent Systems and Computing, vol 1363. Springer, Cham. https://doi.org/10.1007/978-3-030-73100-7_61Google ScholarGoogle Scholar
  13. Tapis v2 documentation. https://tacc-cloud.readthedocs.io/projects/agave. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  14. Tapiis v3 documentation. https://tapis.readthedocs.io. Accessed 25 Mar 2022.Google ScholarGoogle Scholar
  15. Tapis v3 APIs. https://tapis-project.github.io/live-docs. Accessed 25 Mar 2022.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    PEARC '22: Practice and Experience in Advanced Research Computing
    July 2022
    455 pages
    ISBN:9781450391610
    DOI:10.1145/3491418

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 July 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate133of202submissions,66%

    Upcoming Conference

    PEARC '24
  • Article Metrics

    • Downloads (Last 12 months)190
    • Downloads (Last 6 weeks)39

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format