abstract

Powering statistical genetics with the grid: using GridWay to automate R-based workflows

Authors:

John-Paul Robinson,

Purushotham Bangalore,

Jelai Wang,

Tapan MehtaAuthors Info & Claims

MG '08: Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities

Article No.: 40, Page 1

https://doi.org/10.1145/1341811.1341856

Published: 29 January 2008 Publication History

Get Access

Abstract

Many computationally intense workflows are composed of the same algorithm applied to many data sets. For example, it is good practice in statistical genetics to assess the validity of a method by simulating thousands of datasets of known properties. Further each simulation may involve using permutation tests that necessitate repeating analyses thousands of times per data set. Improvements in the overall throughput of this workflow can be achieved with a straightforward increase in the number of computations that can take place simultaneously. High performance compute clusters have significantly improved the ability to run many such computations simultaneously and have shown the adaptability of these workflows to ever-increasing processing capacity.

While clusters of increasing size can be constructed to improve throughput, additional hardware acquisitions impose increasing financial and operating environment burdens. Leveraging multiple clusters in distributed operating centers can dramatically increase capacity while alleviating infrastructure growth burdens, but adds significant complexity in managing workflows across heterogeneous systems and administrative domains.

Grid computing can offer immediate benefits in this area of multi-cluster workflows by offering a consistent, full-featured, programmatic interface and uniform identity infrastructure across cluster and administrative boundaries. The Globus Toolkit in combination with the GridWay meta-scheduler more realistically improves the ability to leverage large, geographically distributed, multi-cluster collections to orchestrate workflows for significant gains in throughput.

Researchers in the field of biostatistics at UAB are heavy users of the R statistical and graphical software environment, available on multiple clusters on campus. Their R-based statistical analysis workflow falls in the above category of applications. Large numbers of R job runs are currently being managed by manually dividing the workload across clusters based on resources availability. This is clearly cumbersome for the end user to manage and unlikely to result in the optimal division of labor amongst available resources.

The focus of our efforts to grid-enable R uses the GridWay meta-scheduler to access existing resources through the Globus-based, campus grid platform, and improve the workflow management across this set of compute resources, optimizing throughput. Because GridWay schedules jobs on multiple clusters using the uniform interface of the Globus Toolkit, this solution promises to transparently increase the throughput for the R workflow with the simple inclusion of additional compute resources. Our plans include adding a large shared-memory compute resource from the state supercomputing center and other clusters through collaborations with partners in SURAgrid, a regional grid infrastructure.

There are many dimensions to "grid-enabling" applications, including complex re-engineering of algorithms. It is advantageous, however, to take a step-wise approach to grid adoption that first maximizes workflow throughput by leveraging the most broadly available, commodity infrastructures. By concentrating on throughput gains first, existing infrastructure investments can be maximized and time-consuming algorithmic redesigns can be delayed until more capable infrastructures exist to address coordination and latency considerations. This presentation will focus on this first step of grid-enabling R and detail experiences and performance gains.

Index Terms

Recommendations

MGC middleware for grid computing: the Globus Toolkit
ACAI '11: Proceedings of the International Conference on Advances in Computing and Artificial Intelligence

Grid computing has made substantial advances during the last decade. A major concern in Grid environments is dealing with the high degree of heterogeneity of resources that can range from laptops and PCs to supercomputers. The unified virtual view of ...
The Organization and Management of Grid Infrastructures

Grid computing technology has become fundamental to e-Science. As the virtual organizations established by scientific communities progress from testing their applications to more routine usage, maintaining reliable and adaptive grid infrastructures ...
The Grid Resource Broker workflow engine
2nd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2007)

Increasingly, complex scientific applications are structured in terms of workflows. These applications are usually computationally and-or data intensive and thus are well suited for execution in grid environments. Distributed, geographically spread ...

Comments

Information & Contributors

Information

Published In

January 2008

178 pages

ISBN:9781595938350

DOI:10.1145/1341811

General Chair:
Daniel S. Katz
LSU
,
Program Chair:
Craig Lee
Aerospace Corporation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Abstract

Conference

Mardi Gras'08

Sponsor:

Mardi Gras'08: 15th Mardi Gras Conference on Distributed Applications

January 29 - February 3, 2008

Louisiana, Baton Rouge, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
114
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Index Terms

Recommendations

MGC middleware for grid computing: the Globus Toolkit

The Organization and Management of Grid Infrastructures

The Grid Resource Broker workflow engine

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations