skip to main content
10.1145/2286996.2287000acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Job and data clustering for aggregate use of multiple production cyberinfrastructures

Published: 19 June 2012 Publication History

Abstract

In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering.
We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More complete performance results will be presented in the final submission of this paper.

References

[1]
Advanced Network and Distrbuted Storage Laboratory website.
[2]
T. G. Armstrong, Z. Zhang, D. S. Katz, M. Wilde, and I. Foster. Scheduling many-task workloads on supercomputers: Dealing with trailing tasks. In Proceedings of Many-Task Computing on Grids and Supercomputers, 2010, 2010.
[3]
P. Avery, R. Roskies, and D. S. Katz. ExTENCI: Extending Science Through Enhanced National Cyberinfrastructure, 2010. Project homepage: https://sites.google.com/site/extenci/.
[4]
S. Callaghan, E. Deelman, D. Gunter, G. Juve, P. Maechling, C. Brooks, K. Vahi, K. Milner, R. Graves, E. Field, D. Okaya, and T. Jordan. Scaling up workflow-based applications. Journal of Computer and System Sciences, 76(6):18, 2010.
[5]
S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, D. Gunter, K. Beattie, and T. Jordan. Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows - Experiences from SCEC CyberShake. In Fourth International Conference on eScience, pages 151--158, 2008.
[6]
P. Couvares, T. Kosar, A. Roy, J. Weber, and K. Wenger. Workflow Management in Condor, pages 357--375. Springer, 2007.
[7]
E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi, and M. Livny. Pegasus: Mapping Scientific Workflows onto the Grid, volume 3165, pages 131--140. Springer Berlin / Heidelberg, 2004.
[8]
E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219--237, 2005.
[9]
A. Espinosa. Cybershake on Opportunistic Cyberinfrastructures. Master thesis, University of Chicago, Chicago, Mar. 2011.
[10]
A. Espinosa, D. S. Katz, M. Wilde, K. Maheshwari, I. Foster, S. Callaghan, and P. Maechling. Data-intensive CyberShake computations on an opportunistic cyberinfrastructure. In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery. ACM, 2011.
[11]
I. Foster and C. Kesselman. The Globus toolkit, pages 259--278. Morgan Kaufmann Publishers Inc., 1999.
[12]
R. Graves, T. Jordan, S. Callaghan, E. Deelman, E. Field, G. Juve, C. Kesselman, P. Maechling, G. Mehta, K. Milner, D. Okaya, P. Small, and K. Vahi. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geophysics, Online Fir:1--15, May 2010.
[13]
M. Hategan, J. Wozniak, and K. Maheshwari. Coasters: uniform resource provisioning and access for clouds and grids. In 4th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2011), Dec. 2011.
[14]
D. S. Katz, S. Callaghan, R. Harkness, S. Jha, K. Kurowski, S. Manos, S. Pamidighantam, M. Pierce, B. Plale, C. Song, and J. Towns. Science on the TeraGrid. Computational Methods in Science and Technology, Special Issue 2010:81--97, 2010.
[15]
P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, D. Okaya, H. Francoeur, V. Gupta, Y. Cui, K. Vahi, T. Jordan, and E. Field. SCEC CyberShake Workflows -- Automating Probabilistic Seismic Hazard Analysis Calculations, pages 143--163. Springer London, London, 2007.
[16]
J. McGee and C. Sehgal, 2011. Personal communication.
[17]
R. Pordes, D. Petravick, B. Kramer, D. Olson, M. Livny, A. Roy, P. Avery, K. Blackburn, T. Wenaus, F. Würthwein, I. Foster, R. Gardner, M. Wilde, A. Blatecky, J. McGee, and R. Quick. The open science grid. Journal of Physics: Conference Series, 78:012057, July 2007.
[18]
I. Raicu, Y. Zhao, I. T. Foster, and A. Szalay. Accelerating large-scale data exploration through data diffusion. In Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing (DADC '08), pages 9--18. ACM Press, June 2008.
[19]
K. Ranganathan and I. Foster. Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid Computing, 1(1):53--62, 2003.
[20]
D. Reed. Grids, the TeraGrid and beyond. Computer, 36(1):62--68, Jan. 2003.
[21]
M. Rynge, G. Juve, G. Mehta, E. Deelman, K. Larson, B. Holzman, I. Sfiligoi, F. Würthwein, G. B. Berriman, and S. Callaghan. Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures. In Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011), 2011.
[22]
I. Sfiligoi, D. Bradley, B. Holzman, P. Mhashilkar, S. Padhi, and F. Würthwein. The Pilot Way to Grid Resources Using glideinWMS. In Computer Science and Information Engineering, 2009 WRI World Congress on, pages 428--432, 2009.
[23]
G. von Laszewski, I. Foster, J. Gawor, and P. Lane. A Java Commodity Grid Kit. Concurrency and Computation: Practice and Experience, 13(8--9), 2001.
[24]
M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster. Swift: A language for distributed parallel scripting. Parallel Computing, 37(9):633--652, 2011.
[25]
J. M. Wozniak and M. Wilde. Case studies in storage access by loosely coupled petascale applications. In Proc. 4th Annual Workshop on Petascale Data Storage, pages 16--20, 2009.
[26]
XSEDE Project. XSEDE web site.
[27]
L. Zhao, P. Chen, and T. Jordan. Strain Green's tensors, reciprocity, and their applications to seismic source and structure studies. Bulletin of the Seismological Society of America, 96(5):1753--1765, 2006.

Cited By

View all
  • (2020)AN EFFICIENT FAULT TOLERANT CLUSTERING FOR SCIENTIFIC WORKFLOWINTERNATIONAL JOURNAL OF ADVANCED INFORMATION AND COMMUNICATION TECHNOLOGY10.46532/ijaict-2020004(16-19)Online publication date: 1-May-2020
  • (2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
  • (2018)Fault Tolerant and Optimal Task Clustering for Scientific Workflow in CloudInternational Journal of Cloud Applications and Computing10.4018/IJCAC.20180701018:3(1-19)Online publication date: 1-Jul-2018
  • Show More Cited By

Index Terms

  1. Job and data clustering for aggregate use of multiple production cyberinfrastructures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DIDC '12: Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date
      June 2012
      68 pages
      ISBN:9781450313414
      DOI:10.1145/2286996
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 June 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. hpc
      2. implementation
      3. parallel
      4. scec
      5. scripting
      6. swift

      Qualifiers

      • Research-article

      Conference

      HPDC'12
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 7 of 12 submissions, 58%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)AN EFFICIENT FAULT TOLERANT CLUSTERING FOR SCIENTIFIC WORKFLOWINTERNATIONAL JOURNAL OF ADVANCED INFORMATION AND COMMUNICATION TECHNOLOGY10.46532/ijaict-2020004(16-19)Online publication date: 1-May-2020
      • (2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
      • (2018)Fault Tolerant and Optimal Task Clustering for Scientific Workflow in CloudInternational Journal of Cloud Applications and Computing10.4018/IJCAC.20180701018:3(1-19)Online publication date: 1-Jul-2018
      • (2016)Dynamic and Fault-Tolerant Clustering for Scientific WorkflowsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24272004:1(49-62)Online publication date: 1-Jan-2016
      • (2016)Workflow performance improvement using model-based scheduling over multiple clusters and cloudsFuture Generation Computer Systems10.1016/j.future.2015.03.01754:C(206-218)Online publication date: 1-Jan-2016
      • (2014)Improving Multisite Workflow Performance Using Model-Based SchedulingProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/ICPP.2014.22(131-140)Online publication date: 18-Oct-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media