Case Study of Scientific Data Processing on a Cloud Using Hadoop

Zhang, Chen; De Sterck, Hans; Aboulnaga, Ashraf; Djambazian, Haig; Sladek, Rob

doi:10.1007/978-3-642-12659-8_29

Chen Zhang²⁰,
Hans De Sterck²¹,
Ashraf Aboulnaga²⁰,
Haig Djambazian²² &
…
Rob Sladek²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5976))

Included in the following conference series:

International Symposium on High Performance Computing Systems and Applications

1747 Accesses
23 Citations

Abstract

With the increasing popularity of cloud computing, Hadoop has become a widely used open source cloud computing framework for large scale data processing. However, few efforts have been made to demonstrate the applicability of Hadoop to various real-world application scenarios in fields other than server side computations such as web indexing, etc. In this paper, we use the Hadoop cloud computing framework to develop a user application that allows processing of scientific data on clouds. A simple extension to Hadoop’s MapReduce is described which allows it to handle scientific data processing problems with arbitrary input formats and explicit control over how the input is split. This approach is used to develop a Hadoop-based cloud computing application that processes sequences of microscope images of live cells, and we test its performance. It is discussed how the approach can be generalized to more complicated scientific data processing problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aguilera, M.K., Merchant, A., Shah, M.A., Veitch, A.C., Karamanolis, C.T.: Sinfonia: A New Paradigm for Building Scalable Distributed Systems. In: SOSP 2007 (2007)
Google Scholar
Aguilera, M., Golab, W., Shah, M.: A Practical Scalable Distributed B-Tree. In: VLDB 2008 (2008)
Google Scholar
Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/ (retrieved date: September 27, 2009)
Apache Hadoop, http://hadoop.apache.org/ (retrieved date: September 27, 2009)
Apache HBase, http://hadoop.apache.org/hbase/ (retrieved date: September 27, 2009)
Apache Hama, http://incubator.apache.org/hama/ (retrieved date: September 27, 2000)
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauery, R., Pratt, I., Warfield, A.: Xen and the Art of Virtualization. In: SOSP 2003 (2003)
Google Scholar
Brantner, M., Florescu, D., Graf, D.A., Kossmann, D., Kraska, T.: Building a Database on S3. In: SIGMOD 2008 (2008)
Google Scholar
Catanzaro, B., Sundaram, N., Keutzer, K.: A MapReduce framework for programming graphics processors. In: Workshop on Software Tools for MultiCore Systems (2008)
Google Scholar
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In: VLDB 2008 (2008)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A Distributed Storage System for Structured Data. In: OSDI 2006 (2006)
Google Scholar
Cooper, B., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H.-A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s Hosted Data Serving Platform. In: VLDB 2008 (2008)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI 2004 (2004)
Google Scholar
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-Value Store. In: SOSP 2007 (2007)
Google Scholar
DeWitt, D.J., Robinson, E., Shankar, S., Paulson, E., Naughton, J., Krioukov, A., Royalty, J.: Clustera: An Integrated Computation and Data Management System. In: VLDB 2008 (2008)
Google Scholar
ELASTRA, http://www.elastra.com/ (retrieved date: Sepember 27, 2009)
Elsayed, T., Lin, J., Oard, D.: Pairwise Document Similarity in Large Collections with MapReduce. In: Proc. Annual Meeting of the Association for Computational Linguistics (2008)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google File System. In: SOSP 2003 (2003)
Google Scholar
GigaSpaces, http://www.gigaspaces.com/ (retrieved date: September 27, 2009)
Google and IBM Announce University Initiative, http://www.ibm.com/ibm/ideasfromibm/us/google/index.shtml (retrieved date: September 27, 2009)
Irwin, D.E., Chase, J.S., Grit, L.E., Yumerefendi, A.R., Becker, D., Yocum, K.: Sharing Networked Resources with Brokered Leases. In: USENIX Annual Conference 2006 (2006)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In: EuroSys 2007 (2007)
Google Scholar
McNabb, A.W., Monson, C.K., Seppi, K.D.: MRPSO: MapReduce Particle Swarm Optimization. In: Genetic and Evolutionary Computation Conference (2007)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign Language for Data Processing. In: SIGMOD 2008 (2008)
Google Scholar
Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming 13(4) (2005)
Google Scholar
Ramakrishnan, L., Irwin, D.E., Grit, L.E., Yumerefendi, A.R., Iamnitchi, A., Chase, J.S.: Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control. In: SC 2006 (2006)
Google Scholar
Scalable Scientific Computing Group, University of Waterloo, http://www.math.uwaterloo.ca/groups/SSC/software/cloud (retrieved date: September 27, 2009)
Soror, A., Minhas, U.F., Aboulnaga, A., Salem, K., Kokosielis, P., Kamath, S.: Automatic Virtual Machine Configuration for Database Workloads. In: SIGMOD 2008 (2008)
Google Scholar
Yang, H.C., Dasdan, A., Hsiao, R.-L., Parker, D.S.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD 2007 (2007)
Google Scholar
Zhang, C., De Sterck, H.: CloudWF: A Computational Work ow System for Clouds Based on Hadoop. In: The First International Conference on Cloud Computing, Beijing, China (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

David R. Cheriton School of Computer Science, University of Waterloo, Ontario, N2L 3G1, Canada
Chen Zhang & Ashraf Aboulnaga
Department of Applied Mathematics, University of Waterloo, Ontario, N2L 3G1, Canada
Hans De Sterck
McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, H3A 1A4, Canada
Haig Djambazian & Rob Sladek

Authors

Chen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hans De Sterck
View author publications
You can also search for this author in PubMed Google Scholar
Ashraf Aboulnaga
View author publications
You can also search for this author in PubMed Google Scholar
Haig Djambazian
View author publications
You can also search for this author in PubMed Google Scholar
Rob Sladek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Psychology, Queen‘s University, 62 Arch St, K7L 3N6, Kingston, Ontario, Canada
Douglas J. K. Mewhort
Dept of Chemistry, Queen’s University, Chernoff Hall, K7L 3N6, Kingston, Ontario, Canada
Natalie M. Cann
University of Ottawa, Hagen Hall, 115 Séraphin-Marion, K1N 6N5, Ottawa, Ontario, Canada
Gary W. Slater
Oak Ridge National Laboratory, 1 Bethel Valley Road, Bldg. 5100, MS-6173, Oak Ridge, 37831-6173, TN, USA
Thomas J. Naughton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, C., De Sterck, H., Aboulnaga, A., Djambazian, H., Sladek, R. (2010). Case Study of Scientific Data Processing on a Cloud Using Hadoop. In: Mewhort, D.J.K., Cann, N.M., Slater, G.W., Naughton, T.J. (eds) High Performance Computing Systems and Applications. HPCS 2009. Lecture Notes in Computer Science, vol 5976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12659-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-12659-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12658-1
Online ISBN: 978-3-642-12659-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics