skip to main content
10.1145/1341811.1341856acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesmardi-grasConference Proceedingsconference-collections
abstract

Powering statistical genetics with the grid: using GridWay to automate R-based workflows

Published: 29 January 2008 Publication History

Abstract

Many computationally intense workflows are composed of the same algorithm applied to many data sets. For example, it is good practice in statistical genetics to assess the validity of a method by simulating thousands of datasets of known properties. Further each simulation may involve using permutation tests that necessitate repeating analyses thousands of times per data set. Improvements in the overall throughput of this workflow can be achieved with a straightforward increase in the number of computations that can take place simultaneously. High performance compute clusters have significantly improved the ability to run many such computations simultaneously and have shown the adaptability of these workflows to ever-increasing processing capacity.
While clusters of increasing size can be constructed to improve throughput, additional hardware acquisitions impose increasing financial and operating environment burdens. Leveraging multiple clusters in distributed operating centers can dramatically increase capacity while alleviating infrastructure growth burdens, but adds significant complexity in managing workflows across heterogeneous systems and administrative domains.
Grid computing can offer immediate benefits in this area of multi-cluster workflows by offering a consistent, full-featured, programmatic interface and uniform identity infrastructure across cluster and administrative boundaries. The Globus Toolkit in combination with the GridWay meta-scheduler more realistically improves the ability to leverage large, geographically distributed, multi-cluster collections to orchestrate workflows for significant gains in throughput.
Researchers in the field of biostatistics at UAB are heavy users of the R statistical and graphical software environment, available on multiple clusters on campus. Their R-based statistical analysis workflow falls in the above category of applications. Large numbers of R job runs are currently being managed by manually dividing the workload across clusters based on resources availability. This is clearly cumbersome for the end user to manage and unlikely to result in the optimal division of labor amongst available resources.
The focus of our efforts to grid-enable R uses the GridWay meta-scheduler to access existing resources through the Globus-based, campus grid platform, and improve the workflow management across this set of compute resources, optimizing throughput. Because GridWay schedules jobs on multiple clusters using the uniform interface of the Globus Toolkit, this solution promises to transparently increase the throughput for the R workflow with the simple inclusion of additional compute resources. Our plans include adding a large shared-memory compute resource from the state supercomputing center and other clusters through collaborations with partners in SURAgrid, a regional grid infrastructure.
There are many dimensions to "grid-enabling" applications, including complex re-engineering of algorithms. It is advantageous, however, to take a step-wise approach to grid adoption that first maximizes workflow throughput by leveraging the most broadly available, commodity infrastructures. By concentrating on throughput gains first, existing infrastructure investments can be maximized and time-consuming algorithmic redesigns can be delayed until more capable infrastructures exist to address coordination and latency considerations. This presentation will focus on this first step of grid-enabling R and detail experiences and performance gains.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MG '08: Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities
January 2008
178 pages
ISBN:9781595938350
DOI:10.1145/1341811
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • National e-Science Institute (Edinburgh, UK)
  • Louisiana State University (USA)

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2008

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Abstract

Conference

Mardi Gras'08
Sponsor:
Mardi Gras'08: 15th Mardi Gras Conference on Distributed Applications
January 29 - February 3, 2008
Louisiana, Baton Rouge, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 114
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media