Loading [MathJax]/extensions/TeX/mhchem.js
Comparing the performance of clusters, Hadoop, and Active Disks on microarray correlation computations | IEEE Conference Publication | IEEE Xplore

Comparing the performance of clusters, Hadoop, and Active Disks on microarray correlation computations

Publisher: IEEE

Abstract:

Microarray-based comparative genomic hybridization (aCGH) offers an increasingly fine-grained method for detecting copy number variations in DNA. These copy number variat...View more

Abstract:

Microarray-based comparative genomic hybridization (aCGH) offers an increasingly fine-grained method for detecting copy number variations in DNA. These copy number variations can directly influence the expression of the proteins that are encoded in the genes in question. A useful analysis of the data produced from these microarray experiments is pairwise correlation. However, the high resolution of today's microarray technology requires that supercomputing computation and storage resources be leveraged in order to perform this analysis. This application is an exemplar of the class of data intensive problems which require high-throughput I/O in order to be tractable. Although the performance of these types of applications on a cluster can be improved by parallelization, storage hardware and network limitations restrict the scalability of an I/O-bound application such as this. The Hadoop software framework is designed to enable data-intensive applications on cluster architectures, and offers significantly better scalability due to its distributed file system. However, specialized architecture adhering to the Active Disk paradigm, in which compute power is placed close to the disk instead of across a network, can further improve performance. The Netezza Corporation's database systems are designed around the Active Disk approach, and offer tremendous gains in implementing this application over the traditional cluster architecture. We present methods and performance analyses of several implementations of this application: on a cluster, on a cluster with a parallel file system, with Hadoop on a cluster, and using a Netezza data warehouse appliance. Our results offer benchmarks for the performance of data intensive applications within these distributed computing paradigms.
Date of Conference: 16-19 December 2009
Date Added to IEEE Xplore: 18 March 2010
ISBN Information:
Print ISSN: 1094-7256
Publisher: IEEE
Conference Location: Kochi, India

References

References is not available for this document.