research-article

Job and data clustering for aggregate use of multiple production cyberinfrastructures

Authors:

Ketan Maheshwari,

Allan Espinosa,

Daniel S. Katz,

Scott Callaghan,

Phillip MaechlingAuthors Info & Claims

DIDC '12: Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date

Pages 3 - 12

https://doi.org/10.1145/2286996.2287000

Published: 19 June 2012 Publication History

Abstract

In this paper, we address the challenges of reducing the time-to-solution of the data intensive earthquake simulation workflow "CyberShake" by supplementing the high-performance parallel computing (HPC) resources on which it typically runs with distributed, heterogeneous resources that can be obtained opportunistically from grids and clouds. We seek to minimize time to solution by maximizing the amount of work that can be efficiently done on the distributed resources. We identify data movement as the main bottleneck in effectively utilizing the combined local and distributed resources. We address this by analyzing the I/O characteristics of the application, processor acquisition rate (from a pilot-job service), and the data movement throughput of the infrastructure. With these factors in mind, we explore a combination of strategies including partitioning of computation (over HPC and distributed resources) and job clustering.

We validate our approach with a theoretical study and with preliminary measurements on the Ranger HPC system and distributed Open Science Grid resources. More complete performance results will be presented in the final submission of this paper.

References

[1]

Advanced Network and Distrbuted Storage Laboratory website.

[2]

T. G. Armstrong, Z. Zhang, D. S. Katz, M. Wilde, and I. Foster. Scheduling many-task workloads on supercomputers: Dealing with trailing tasks. In Proceedings of Many-Task Computing on Grids and Supercomputers, 2010, 2010.

[3]

P. Avery, R. Roskies, and D. S. Katz. ExTENCI: Extending Science Through Enhanced National Cyberinfrastructure, 2010. Project homepage: https://sites.google.com/site/extenci/.

[4]

S. Callaghan, E. Deelman, D. Gunter, G. Juve, P. Maechling, C. Brooks, K. Vahi, K. Milner, R. Graves, E. Field, D. Okaya, and T. Jordan. Scaling up workflow-based applications. Journal of Computer and System Sciences, 76(6):18, 2010.

Digital Library

[5]

S. Callaghan, P. Maechling, E. Deelman, K. Vahi, G. Mehta, G. Juve, K. Milner, R. Graves, E. Field, D. Okaya, D. Gunter, K. Beattie, and T. Jordan. Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows - Experiences from SCEC CyberShake. In Fourth International Conference on eScience, pages 151--158, 2008.

Digital Library

[6]

P. Couvares, T. Kosar, A. Roy, J. Weber, and K. Wenger. Workflow Management in Condor, pages 357--375. Springer, 2007.

[7]

E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi, and M. Livny. Pegasus: Mapping Scientific Workflows onto the Grid, volume 3165, pages 131--140. Springer Berlin / Heidelberg, 2004.

[8]

E. Deelman, G. Singh, M. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13(3):219--237, 2005.

Digital Library

[9]

A. Espinosa. Cybershake on Opportunistic Cyberinfrastructures. Master thesis, University of Chicago, Chicago, Mar. 2011.

[10]

A. Espinosa, D. S. Katz, M. Wilde, K. Maheshwari, I. Foster, S. Callaghan, and P. Maechling. Data-intensive CyberShake computations on an opportunistic cyberinfrastructure. In Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery. ACM, 2011.

Digital Library

[11]

I. Foster and C. Kesselman. The Globus toolkit, pages 259--278. Morgan Kaufmann Publishers Inc., 1999.

Digital Library

[12]

R. Graves, T. Jordan, S. Callaghan, E. Deelman, E. Field, G. Juve, C. Kesselman, P. Maechling, G. Mehta, K. Milner, D. Okaya, P. Small, and K. Vahi. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geophysics, Online Fir:1--15, May 2010.

[13]

M. Hategan, J. Wozniak, and K. Maheshwari. Coasters: uniform resource provisioning and access for clouds and grids. In 4th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2011), Dec. 2011.

Digital Library

[14]

D. S. Katz, S. Callaghan, R. Harkness, S. Jha, K. Kurowski, S. Manos, S. Pamidighantam, M. Pierce, B. Plale, C. Song, and J. Towns. Science on the TeraGrid. Computational Methods in Science and Technology, Special Issue 2010:81--97, 2010.

[15]

P. Maechling, E. Deelman, L. Zhao, R. Graves, G. Mehta, N. Gupta, J. Mehringer, C. Kesselman, S. Callaghan, D. Okaya, H. Francoeur, V. Gupta, Y. Cui, K. Vahi, T. Jordan, and E. Field. SCEC CyberShake Workflows -- Automating Probabilistic Seismic Hazard Analysis Calculations, pages 143--163. Springer London, London, 2007.

[16]

J. McGee and C. Sehgal, 2011. Personal communication.

[17]

R. Pordes, D. Petravick, B. Kramer, D. Olson, M. Livny, A. Roy, P. Avery, K. Blackburn, T. Wenaus, F. Würthwein, I. Foster, R. Gardner, M. Wilde, A. Blatecky, J. McGee, and R. Quick. The open science grid. Journal of Physics: Conference Series, 78:012057, July 2007.

[18]

I. Raicu, Y. Zhao, I. T. Foster, and A. Szalay. Accelerating large-scale data exploration through data diffusion. In Proceedings of the 2008 International Workshop on Data-Aware Distributed Computing (DADC '08), pages 9--18. ACM Press, June 2008.

Digital Library

[19]

K. Ranganathan and I. Foster. Simulation studies of computation and data scheduling algorithms for data grids. Journal of Grid Computing, 1(1):53--62, 2003.

[20]

D. Reed. Grids, the TeraGrid and beyond. Computer, 36(1):62--68, Jan. 2003.

Digital Library

[21]

M. Rynge, G. Juve, G. Mehta, E. Deelman, K. Larson, B. Holzman, I. Sfiligoi, F. Würthwein, G. B. Berriman, and S. Callaghan. Experiences Using GlideinWMS and the Corral Frontend Across Cyberinfrastructures. In Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011), 2011.

Digital Library

[22]

I. Sfiligoi, D. Bradley, B. Holzman, P. Mhashilkar, S. Padhi, and F. Würthwein. The Pilot Way to Grid Resources Using glideinWMS. In Computer Science and Information Engineering, 2009 WRI World Congress on, pages 428--432, 2009.

Digital Library

[23]

G. von Laszewski, I. Foster, J. Gawor, and P. Lane. A Java Commodity Grid Kit. Concurrency and Computation: Practice and Experience, 13(8--9), 2001.

[24]

M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster. Swift: A language for distributed parallel scripting. Parallel Computing, 37(9):633--652, 2011.

Digital Library

[25]

J. M. Wozniak and M. Wilde. Case studies in storage access by loosely coupled petascale applications. In Proc. 4th Annual Workshop on Petascale Data Storage, pages 16--20, 2009.

Digital Library

[26]

XSEDE Project. XSEDE web site.

[27]

L. Zhao, P. Chen, and T. Jordan. Strain Green's tensors, reciprocity, and their applications to seismic source and structure studies. Bulletin of the Seismological Society of America, 96(5):1753--1765, 2006.

Cited By

A BRAJ JK SV T(2020)AN EFFICIENT FAULT TOLERANT CLUSTERING FOR SCIENTIFIC WORKFLOWINTERNATIONAL JOURNAL OF ADVANCED INFORMATION AND COMMUNICATION TECHNOLOGY10.46532/ijaict-2020004(16-19)Online publication date: 1-May-2020
https://doi.org/10.46532/ijaict-2020004
Khaldi MRebbah MMeftah BSmail O(2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
https://doi.org/10.1080/1206212X.2019.1647651
Dharwadkar NPoojara SKadam P(2018)Fault Tolerant and Optimal Task Clustering for Scientific Workflow in CloudInternational Journal of Cloud Applications and Computing10.4018/IJCAC.20180701018:3(1-19)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.4018/IJCAC.2018070101
Show More Cited By

Index Terms

Job and data clustering for aggregate use of multiple production cyberinfrastructures
1. Applied computing
  1. Physical sciences and engineering
    1. Earth and atmospheric sciences
    2. Engineering

Recommendations

Middleware support for many-task computing

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file ...
Improving Multisite Workflow Performance Using Model-Based Scheduling
BRACIS '14: Proceedings of the 2014 Brazilian Conference on Intelligent Systems

Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. ...
Towards a Powerful European DCI Based on Desktop Grids

Service Grids like the EGEE Grid can not provide the required number of resources for many VOs. Therefore extending the capacity of these VOs with volunteer or institutional desktop Grids would significantly increase the number of accessible computing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DIDC '12: Proceedings of the fifth international workshop on Data-Intensive Distributed Computing Date

June 2012

68 pages

ISBN:9781450313414

DOI:10.1145/2286996

Program Chairs:
Tevfik Kosar
University at Buffalo, USA
,
Douglas Thain
University of Notre Dame, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'12

Sponsor:

University of Arizona
SIGARCH

HPDC'12: The 21st International Symposium on High-Performance Parallel and Distributed Computing

June 19, 2012

Delft, The Netherlands

Acceptance Rates

Overall Acceptance Rate 7 of 12 submissions, 58%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
114
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

A BRAJ JK SV T(2020)AN EFFICIENT FAULT TOLERANT CLUSTERING FOR SCIENTIFIC WORKFLOWINTERNATIONAL JOURNAL OF ADVANCED INFORMATION AND COMMUNICATION TECHNOLOGY10.46532/ijaict-2020004(16-19)Online publication date: 1-May-2020
https://doi.org/10.46532/ijaict-2020004
Khaldi MRebbah MMeftah BSmail O(2019)Fault tolerance for a scientific workflow system in a Cloud computing environmentInternational Journal of Computers and Applications10.1080/1206212X.2019.164765142:7(705-714)Online publication date: 30-Jul-2019
https://doi.org/10.1080/1206212X.2019.1647651
Dharwadkar NPoojara SKadam P(2018)Fault Tolerant and Optimal Task Clustering for Scientific Workflow in CloudInternational Journal of Cloud Applications and Computing10.4018/IJCAC.20180701018:3(1-19)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.4018/IJCAC.2018070101
Chen Wda Silva RDeelman EFahringer T(2016)Dynamic and Fault-Tolerant Clustering for Scientific WorkflowsIEEE Transactions on Cloud Computing10.1109/TCC.2015.24272004:1(49-62)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1109/TCC.2015.2427200
Maheshwari KJung EMeng JMorozov VVishwanath VKettimuthu R(2016)Workflow performance improvement using model-based scheduling over multiple clusters and cloudsFuture Generation Computer Systems10.1016/j.future.2015.03.01754:C(206-218)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.future.2015.03.017
Maheshwari KJung EMeng JVishwanath VKettimuthu R(2014)Improving Multisite Workflow Performance Using Model-Based SchedulingProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/ICPP.2014.22(131-140)Online publication date: 18-Oct-2014
https://dl.acm.org/doi/10.1109/ICPP.2014.22

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten