skip to main content
10.1145/2912152.2912154acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Experiences with Performing MapReduce Analysis of Scientific Data on HPC Platforms

Published: 01 June 2016 Publication History

Abstract

The growing interest in being able to apply Big Data techniques to scientific data generated using HPC simulations led to the question of whether this is achievable on the same HPC platform, and if so, what is the performance that can be obtained on these systems. The motivation behind this approach is twofold: scientific datasets are often very large, and would take a long time to transfer to external Big Data clusters; furthermore, the ability to perform live analysis on the data as it is being generated on the HPC platform can be crucial to many scientific applications. Using as case-study a Hadoop-based application that analyzes Molecular Dynamics simulations data on the same HPC platform on which it was produced, we present our experiences with performing Big Data analysis on an HPC system. This work also describes the challenges that one has to deal with when performing Hadoop-based computations on scientific data on HPC platforms: data storage, data formats, ingesting data in Hadoop, optimizing the deployment to overcome the limitations of the HPC environment. Our work shows in a first phase that such an instantiation of Big Data analysis on an HPC system is both relevant and feasible; in a second phase, we greatly improve the performance by efficient configuration of HPC resources and tuning of the application. Our findings can be shared as best practices and recommendations in the context of the convergence of the HPC and Big Data environments.

References

[1]
SciSpark - Scientific Spark at NASA. http://github.com/SciSpark/SciSpark.
[2]
The Apache Hadoop Project. http://www.hadoop.org.
[3]
The BioJava Library. http://biojava.org/wiki/Main_Page.
[4]
The Hadoop MapReduce Framework. http://hadoop.apache.org/mapreduce/.
[5]
The NetCDF Java Library. http://www.unidata.ucar.edu/software/thredds/current/netcdf-java.
[6]
The NetCDF Java Library for HDFS. https://github.com/gengyifeng/NetCDF-Java-Hadoop.
[7]
D. A. Case, T. E. Cheatham, T. Darden, H. Gohlke, R. Luo, K. M. Merz, A. Onufriev, C. Simmerling, B. Wang, and R. J. Woods. The Amber biomolecular simulation programs. Journal of Computational Chemistry, 26(16):1668--1688, 2005.
[8]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.
[9]
J. Ekanayake, S. Pallickara, and G. Fox. Mapreduce for data intensive scientific analyses. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, ESCIENCE '08, pages 277--284, Washington, DC, USA, 2008. IEEE Computer Society.
[10]
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Comput., 22(6):789--828, Sept. 1996.
[11]
J. C. Jacob, D. S. Katz, G. B. Berriman, J. C. Good, A. C. Laity, E. Deelman, C. Kesselman, G. Singh, M. Su, T. A. Prince, and R. Williams. Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng., 4(2):73--87, July 2009.
[12]
H. Lin, P. Balaji, R. Poole, C. Sosa, X. Ma, and W. C. Feng. Massively parallel genomic sequence search on the Blue Gene/P architecture. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.
[13]
X. Lu, N. Islam, W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. Panda. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. October 2013.
[14]
X. Lu, W. Rahman, N. Islam, D. Shankar, and D. Panda. Accelerating Spark with RDMA for Big Data Processing: Early Experiences. August 2014.
[15]
A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining MapReduce and virtualization on distributed resources for bioinformatics applications. In Proceedings of the 2008 Fourth IEEE International Conference on eScience, ESCIENCE '08, pages 222--229, Washington, DC, USA, 2008. IEEE Computer Society.
[16]
S. Mikami, K. Ohta, and O. Tatebe. Using the Gfarm File System As a POSIX Compatible Storage Platform for Hadoop MapReduce Applications. In Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing, GRID '11, pages 181--189, Washington, DC, USA, 2011. IEEE Computer Society.
[17]
F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. J. Franklin, A. D. Joseph, and D. A. Patterson. Rethinking Data-Intensive Science Using Scalable Analytics Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 631--646, New York, NY, USA, 2015. ACM.
[18]
N. Rutman. Map/Reduce on Lustre-Hadoop performance in HPC environments. Langstone Road, Havant, Hampshire, P09 ISA, England, Tech. Rep, 2011.
[19]
F. Schmuck and R. Haskin. Gpfs: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST '02, Berkeley, CA, USA, 2002. USENIX Association.
[20]
P. Schwan. Lustre: Building a file system for 1,000-node clusters. In PROCEEDINGS OF THE LINUX SYMPOSIUM, page 9, 2003.
[21]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In IEEE Proc. Mass Storage Systems and Technologies MSST, pages 1--10, 2010.
[22]
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 5:1--5:16, New York, NY, USA, 2013. ACM.
[23]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association.
[24]
Z. Zhang, K. Barbary, F. A. Nothaft, E. R. Sparks, O. Zahn, M. J. Franklin, D. A. Patterson, and S. Perlmutter. Scientific computing meets Big Data technology: An astronomy use case. CoRR, abs/1507.03325, 2015.

Cited By

View all
  • (2020)Approaches of enhancing interoperations among high performance computing and big data analytics via augmentationCluster Computing10.1007/s10586-019-02960-y23:2(953-988)Online publication date: 1-Jun-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DIDC '16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing
June 2016
62 pages
ISBN:9781450343527
DOI:10.1145/2912152
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data analysis
  2. hadoop mapreduce
  3. hpc systems
  4. post-simulation analysis
  5. scientific applications

Qualifiers

  • Research-article

Conference

HPDC'16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 7 of 12 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)5
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Approaches of enhancing interoperations among high performance computing and big data analytics via augmentationCluster Computing10.1007/s10586-019-02960-y23:2(953-988)Online publication date: 1-Jun-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media