skip to main content
10.1145/3569951.3593605acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Airavata Metascheduler: A Reliable, Fault Tolerant, and Resource-Aware Job Scheduling Service

Published:10 September 2023Publication History

ABSTRACT

Software-as-a-service science gateways provide user interfaces and middleware for accessing scientific software deployed on remote high-performance computing resources and clusters. Selecting the resource to use for a particular job submission may be left to the user, who may need more information to make good choices when selecting from multiple options. To address this problem, we have designed and developed an extensible, scalable metascheduling system that can provide automated scheduling capabilities based on resource availability and other characteristics. We develop a system model based on queuing theory to guide our implementation and provide a basis for analysis. In particular, we derive an efficiency metric from these considerations. We implement the metascheduling system within the open-source Apache Airavata framework for science gateways as a supplemental service for guiding the job submission capabilities. We measure efficiency in representative scenarios, observing efficiencies of greater than 70% even in scenarios with high input rates and low job acceptance rates.

References

  1. 2023. Airavata DataModels. https://github.com/apache/airavata/blob/develop/thrift-interface-descriptions/data-models/experiment-catalog-models/process_model.thrift.Google ScholarGoogle Scholar
  2. 2023. Airavata Metascheduler. https://github.com/apache/airavata/tree/develop/modules/airavata-metascheduler, https://github.com/apache/airavata/tree/develop/modules/cluster-monitoring.Google ScholarGoogle Scholar
  3. 2023. Airavata Python SDK. https://github.com/apache/airavata/tree/develop/airavata-api/airavata-client-sdks/airavata-python-sdk.Google ScholarGoogle Scholar
  4. Enis Afgan, Dannon Baker, Bérénice Batut, Marius Van Den Beek, Dave Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Björn A Grüning, 2018. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic acids research 46, W1 (2018), W537–W544.Google ScholarGoogle Scholar
  5. Enis Afgan and Purushotham Bangalore. 2008. Embarrassingly parallel jobs are not embarrassingly easy to schedule on the grid. In 2008 Workshop on Many-Task Computing on Grids and Supercomputers. 1–10. https://doi.org/10.1109/MTAGS.2008.4777910Google ScholarGoogle ScholarCross RefCross Ref
  6. Enis Afgan, Purushotham V. Bangalore, and Tibor Skala. 2011. Scheduling and planning job execution of loosely coupled applications. The Journal of Supercomputing 59 (2011), 1431 – 1454.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Aymen Alsaadi, Logan Ward, Andre Merzky, Kyle Chard, Ian Foster, Shantenu Jha, and Matteo Turilli. 2022. RADICAL-Pilot and Parsl: Executing Heterogeneous Workflows on HPC Platforms. In 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS). 27–34. https://doi.org/10.1109/WORKS56498.2022.00009Google ScholarGoogle Scholar
  8. Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic. 2009. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25, 6 (2009), 599–616. https://doi.org/10.1016/j.future.2008.12.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Thomas E. Carroll and Daniel Grosu. 2008. An Incentive-Compatible Mechanism for Scheduling Non-Malleable Parallel Jobs with Individual Deadlines. In 2008 37th International Conference on Parallel Processing. 107–114. https://doi.org/10.1109/ICPP.2008.27Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tuhinangshu Choudhury, Gauri Joshi, Weina Wang, and Sanjay Shakkottai. 2021. Job Dispatching Policies for Queueing Systems with Unknown Service Rates(MobiHoc ’21). Association for Computing Machinery, New York, NY, USA, 181–190. https://doi.org/10.1145/3466772.3467047Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J Eric Coulter, Eroma Abeysinghe, Sudhakar Pamidighantam, and Marlon Pierce. 2019. Virtual clusters in the jetstream cloud: A story of elasticized hpc. In Proceedings of the Humans in the Loop: Enabling and Facilitating Research on Cloud Computing. 1–6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Attila Csenki. 2011. Independent events in elementary probability theory. International Journal of Mathematical Education in Science and Technology 42, 5 (2011), 685–691. https://doi.org/10.1080/0020739X.2011.562313Google ScholarGoogle ScholarCross RefCross Ref
  13. Borries Demeler. 2005. UltraScan: a comprehensive data analysis software package for analytical ultracentrifugation experiments. Modern analytical ultracentrifugation: techniques and methods 10 (2005), 210–229.Google ScholarGoogle Scholar
  14. Ye Fan, Sudhakar Pamidighantam, and Warren Smith. 2014. Incorporating Job Predictions into the SEAGrid Science Gateway(XSEDE ’14). Association for Computing Machinery, New York, NY, USA, Article 57, 3 pages. https://doi.org/10.1145/2616498.2616563Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Carole Fayad, Jonathan M. Garibaldi, and Djamila Ouelhadj. 2007. Fuzzy Grid Scheduling Using Tabu Search. In 2007 IEEE International Fuzzy Systems Conference. 1–6. https://doi.org/10.1109/FUZZY.2007.4295513Google ScholarGoogle Scholar
  16. Saurabh Garg, Pramod Konugurthi, and Rajkumar Buyya. 2008. A Linear Programming Driven Genetic Algorithm for Meta-Scheduling on Utility Grids. In 2008 16th International Conference on Advanced Computing and Communications. 19–26. https://doi.org/10.1109/ADCOM.2008.4760422Google ScholarGoogle ScholarCross RefCross Ref
  17. David Y Hancock, Jeremy Fischer, John Michael Lowe, Winona Snapp-Childs, Marlon Pierce, Suresh Marru, J Eric Coulter, Matthew Vaughn, Brian Beck, Nirav Merchant, 2021. Jetstream2: Accelerating cloud computing via Jetstream. In Practice and Experience in Advanced Research Computing. 1–8.Google ScholarGoogle Scholar
  18. James H. Anderson J. Y-T.Leung. 2004. Handbook of Scheduling: Algorithms, Models, and Performance Analysis. Chapman and Hall.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Katherine A. Lawrence, Michael Zentner, Nancy Wilkins-Diehr, Julie A. Wernert, Marlon Pierce, Suresh Marru, and Scott Michael. 2015. Science gateways today and tomorrow: positive perspectives of nearly 5000 members of the research community. Concurrency and Computation: Practice and Experience 27, 16 (2015), 4252–4268. https://doi.org/10.1002/cpe.3526Google ScholarGoogle ScholarCross RefCross Ref
  20. Gunho Lee, Byung-Gon Chun, and H. Katz. 2011. Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud(HotCloud’11). USENIX Association, USA, 4.Google ScholarGoogle Scholar
  21. D. V. Lindley and L. D. Phillips. 1976. Inference for a Bernoulli Process (A Bayesian View). The American Statistician 30, 3 (1976), 112–119. http://www.jstor.org/stable/2683855Google ScholarGoogle Scholar
  22. Suresh Marru, Lahiru Gunathilake, Chathura Herath, Patanachai Tangchaisin, Marlon Pierce, Chris Mattmann, Raminder Singh, Thilina Gunarathne, Eran Chinthaka, Ross Gardler, 2011. Apache Airavata: a framework for distributed applications and computational workflows. In Proceedings of the 2011 ACM workshop on Gateway computing environments. 21–28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Avinash Maurya, Bogdan Nicolae, Ishan Guliani, and M. Mustafa Rafique. 2020. CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters. In 2020 IEEE/ACM 24th International Symposium on Distributed Simulation and Real Time Applications (DS-RT). 1–8. https://doi.org/10.1109/DS-RT50469.2020.9213578Google ScholarGoogle Scholar
  24. Michael McLennan, Steven Clark, Ewa Deelman, Mats Rynge, Karan Vahi, Frank McKenna, Derrick Kearney, and Carol Song. 2015. HUBzero and Pegasus: integrating scientific workflows into science gateways. Concurrency and Computation: Practice and Experience 27, 2 (2015), 328–343.Google ScholarGoogle ScholarCross RefCross Ref
  25. Marlon Pierce, Suresh Marru, Eroma Abeysinghe, Sudhakar Pamidighantam, Marcus Christie, and Dimuthu Wannipurage. 2018. Supporting science gateways using Apache Airavata and SciGaP services. In Proceedings of the Practice and Experience on Advanced Research Computing. 1–4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Marlon Pierce, Suresh Marru, Borries Demeler, Raminderjeet Singh, and Gary Gorbet. 2014. The Apache Airavata application programming interface: overview and evaluation with the UltraScan science gateway. In 2014 9th Gateway Computing Environments Workshop. IEEE, 25–29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marlon E Pierce, Mark A Miller, Emre H Brookes, Mona Wong, Enis Afgan, Yan Liu, Sandra Gesing, Maytal Dahan, Suresh Marru, and Tony Walker. 2018. Towards a science gateway reference architecture. (2018).Google ScholarGoogle Scholar
  28. Alexey Savelyev and Emre Brookes. 2019. GenApp: Extensible tool for rapid generation of web and native GUI applications. Future Generation Computer Systems 94 (2019), 929–936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jennifer M. Schopf. 2004. Ten Actions When Grid Scheduling. Springer US, Boston, MA, 15–23. https://doi.org/10.1007/978-1-4615-0509-9_2Google ScholarGoogle Scholar
  30. Uwe Schwiegelshohn and Ramin Yahyapour. 1999. Resource Allocation and Scheduling in Metasystems. In Proceedings of the 7th International Conference on High-Performance Computing and Networking(HPCN Europe ’99). Springer-Verlag, Berlin, Heidelberg, 851–860.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Stelios Sotiriadis, Nik Bessis, Fatos Xhafa, and Nick Antonopoulos. 2012. From Meta-computing to Interoperable Infrastructures: A Review of Meta-schedulers for HPC, Grid and Cloud. In 2012 IEEE 26th International Conference on Advanced Information Networking and Applications. 874–883. https://doi.org/10.1109/AINA.2012.15Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Srikant and Lei Ying. 2014. Communication Networks: An Optimization, Control and Stochastic Networks Perspective. Cambridge University Press, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Joe Stubbs, Richard Cardone, Mike Packard, Anagha Jamthe, Smruti Padhy, Steve Terry, Julia Looney, Joseph Meiring, Steve Black, Maytal Dahan, 2021. Tapis: an API platform for reproducible, distributed computational research. In Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Volume 1. Springer, 878–900.Google ScholarGoogle ScholarCross RefCross Ref
  34. Dimuthu Wannipurage, Suresh Marru, Marlon Piece, Eroma Abeysinghe, Sudhakar Pamidighantam, Marcus Christie, Gourav Shenoy, Ajinkya Dhamnaskar, and Lahiru Jayathilaka. 2019. Implementing a flexible, fault tolerant job management system for science gateways. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning). 1–8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Dimuthu Wannipurage, Suresh Marru, Marlon Piece, Eroma Abeysinghe, Sudhakar Pamidighantam, Marcus Christie, Gourav Shenoy, Ajinkya Dhamnaskar, and Lahiru Jayathilaka. 2019. Implementing a Flexible, Fault Tolerant Job Management System for Science Gateways. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning) (Chicago, IL, USA) (PEARC ’19). Association for Computing Machinery, New York, NY, USA, Article 15, 8 pages. https://doi.org/10.1145/3332186.3332233Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Fatos Xhafa, Javier Carretero, Bernabé Dorronsoro, and Enrique Alba. 2009. A Tabu Search Algorithm for Scheduling Independent Jobs in Computational Grids. Comput. Informatics 28 (2009), 237–250.Google ScholarGoogle Scholar
  37. Shijue Zheng, Wanneng Shu, and Li Gao. 2006. Task Scheduling using Parallel Genetic Simulated Annealing Algorithm. In 2006 IEEE International Conference on Service Operations and Logistics, and Informatics. 46–50. https://doi.org/10.1109/SOLI.2006.328980Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    PEARC '23: Practice and Experience in Advanced Research Computing
    July 2023
    519 pages
    ISBN:9781450399852
    DOI:10.1145/3569951

    Copyright © 2023 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 10 September 2023

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate133of202submissions,66%

    Upcoming Conference

    PEARC '24
  • Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)9

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format