research-article

Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms

Author:

Diana MoiseAuthors Info & Claims

DIDC '16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

Pages 11 - 18

https://doi.org/10.1145/2912152.2912154

Published: 01 June 2016 Publication History

Abstract

The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.

References

[1]

SciSpark - Scientific Spark at NASA. http://github.com/SciSpark/SciSpark.

[2]

The Apache Hadoop Project. http://www.hadoop.org.

[3]

The BioJava Library. http://biojava.org/wiki/Main_Page.

[4]

The Hadoop MapReduce Framework. http://hadoop.apache.org/mapreduce/.

[5]

The NetCDF Java Library. http://www.unidata.ucar.edu/software/thredds/current/netcdf-java.

[6]

The NetCDF Java Library for HDFS. https://github.com/gengyifeng/NetCDF-Java-Hadoop.

[7]

D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M. Merz, A. Onufriev, C. Simmerling, B. Wang, and R. J. Woods. The Amber biomolecular simulation programs. Journal of Computational Chemistry, 26(16):1668--1688, 2005.

[8]

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[9]

J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, ESCIENCE '08, pages 277--284, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[10]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Comput., 22(6):789--828, Sept. 1996.

Digital Library

[11]

J. C. Jacob, D. S. Katz, G. B. Berriman, J. C. Good, A. C. Laity, E. Deelman, C. Kesselman, G. Singh, M. Su, T. A. Prince, and R. Williams. Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng., 4(2):73--87, July 2009.

Digital Library

[12]

H. Lin, P. Balaji, R. Poole, C. Sosa, X. Ma, and W. C. Feng. Massively parallel genomic sequence search on the Blue Gene/P architecture. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[13]

X. Lu, N. Islam, W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. Panda. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. October 2013.

Digital Library

[14]

X. Lu, W. Rahman, N. Islam, D. Shankar, and D. Panda. Accelerating Spark with RDMA for Big Data Processing: Early Experiences. August 2014.

[15]

A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining MapReduce and virtualization on distributed resources for bioinformatics applications. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, ESCIENCE '08, pages 222--229, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[16]

S. Mikami, K. Ohta, and O. Tatebe. Using the Gfarm File System As a POSIX Compatible Storage Platform for Hadoop MapReduce Applications. In Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing, GRID '11, pages 181--189, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[17]

F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking Data-Intensive Science Using Scalable Analytics Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 631--646, New York, NY, USA, 2015. ACM.

Digital Library

[18]

N. Rutman. Map/Reduce on Lustre-Hadoop performance in HPC environments. Langstone Road, Havant, Hampshire, P09 ISA, England, Tech. Rep, 2011.

[19]

F. Schmuck and R. Haskin. Gpfs: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST '02, Berkeley, CA, USA, 2002. USENIX Association.

Digital Library

[20]

P. Schwan. Lustre: Building a file system for 1,000-node clusters. In PROCEEDINGS OF THE LINUX SYMPOSIUM, page 9, 2003.

[21]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In IEEE Proc. Mass Storage Systems and Technologies MSST, pages 1--10, 2010.

Digital Library

[22]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 5:1--5:16, New York, NY, USA, 2013. ACM.

Digital Library

[23]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.

Digital Library

[24]

Z. Zhang, K. Barbary, F. A. Nothaft, E. R. Sparks, O. Zahn, M. J. Franklin, D. A. Patterson, and S. Perlmutter. Scientific computing meets Big Data technology: An astronomy use case. CoRR, abs/1507.03325, 2015.

Cited By

Pathak APandey MRautaray S(2020)Approaches of enhancing interoperations among high performance computing and big data analytics via augmentationCluster Computing10.1007/s10586-019-02960-y23:2(953-988)Online publication date: 1-Jun-2020
https://dl.acm.org/doi/10.1007/s10586-019-02960-y

Recommendations

A comparative between hadoop mapreduce and apache Spark on HDFS
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine Learning

Data is growing now in a very high speed with a large volume, Spark and MapReduce¹ both provide a processing model for analyzing and managing this large data -Big Data- stored on HDFS. In this paper, we discuss a comparative between Apache Spark and ...
A MapReduce workflow system for architecting scientific data intensive applications
SECLOUD '11: Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing

MapReduce is promising for developing both scalable business and scientific data intensive applications. However, there are few existing scientific workflow systems which can benefit from the MapReduce programming model. We propose a workflow system for ...
E-HPC: a library for elastic resource management in HPC environments
WORKS '17: Proceedings of the 12th Workshop on Workflows in Support of Large-Scale Science

Next-generation data-intensive scientific workflows need to support streaming and real-time applications with dynamic resource needs on high performance computing (HPC) platforms. The static resource allocation model on current HPC systems that was ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DIDC '16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

June 2016

62 pages

ISBN:9781450343527

DOI:10.1145/2912152

General Chairs:
Esma Yildirim
Rutgers University, USA
,
Tevfik Kosar
University at Buffalo, USA
,
Program Chair:
Esma Yildirim
Rutgers University, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'16

Sponsor:

University of Arizona
SIGARCH

HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing

June 1, 2016

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 7 of 12 submissions, 58%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
107
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pathak APandey MRautaray S(2020)Approaches of enhancing interoperations among high performance computing and big data analytics via augmentationCluster Computing10.1007/s10586-019-02960-y23:2(953-988)Online publication date: 1-Jun-2020
https://dl.acm.org/doi/10.1007/s10586-019-02960-y

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents