skip to main content
10.1145/2642769.2642802acmotherconferencesArticle/Chapter ViewAbstractPublication Pageseurompi-asiaConference Proceedingsconference-collections
research-article

HPC in Big Data Age: An Evaluation Report for Java-Based Data-Intensive Applications Implemented with Hadoop and OpenMPI

Published: 09 September 2014 Publication History

Abstract

The current IT technologies have a strong need for scaling up the high-performance analysis to large-scale datasets. Tremendously increased over the last few years volume and complexity of data gathered in both public (such as on the web) and enterprise (e.g. digitalized internal document base) domains have posed new challenges to providers of high performance computing (HPC) infrastructures, which is recognised in the community as Big Data problem. On contrast to the typical HPC applications, the Big Data ones are not oriented on reaching the peak performance of the infrastructure and thus offer more opportunities for the "capacity" infrastructure model rather than for the "capability" one, making the use of Cloud infrastructures preferable over the HPC. However, considering the more and more vanishing difference between these two infrastructure types, i.e. Cloud and HPC, it makes a lot of sense to investigate the abilities of traditional HPC infrastructure to execute Big Data applications as well, despite their relatively poor efficiency as compared with the traditional, very optimized HPC ones. This paper discusses the main state-of-the-art parallelisation techniques utilised in both Cloud and HPC domains and evaluates them on an exemplary text processing application on a testbed HPC cluster.

References

[1]
M. Baker, B. Carpenter, G. Fox, S. Ko, and S. Lim. mpiJava: An object-oriented java interface to mpi. In Proc. International Workshop on Java for Parallel and Distributed Computing IPPS/SPDP, San Juan, Puerto Rico, 1999.
[2]
R. H. Castain and W. Tan. Mr+. a technical overview. http://www.open-mpi.org/video/mrplus/Greenplum RalphCastain -- 1up.pdf, 2012. Accessed: 2014-06-17.
[3]
A. Cheptsov. Enabling high performance computing for semantic web applications by means of open mpi java bindings. International Journal on Semantic Web and Information Systems, 5(3):1--22, 2009.
[4]
A. Cheptsov and B. Koller. JUNIPER takes aim at big data. inSiDE - Journal of Innovatives Supercomputing in Deutschland, 11(1):68--69, 2013.
[5]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun., 51,1:107--113, 2008.
[6]
E. Gabriel. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc., 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.
[7]
MPI Forum. Mpi: A message-passing interface standard. http://www.mcs.anl.gov/research/projects/mpi/mpistandard/mpi-report-1.1/mpi-report.htm, 1995. Accessed: 2014-06-17.
[8]
S. J. Plimpton and K. D. Devine. Mapreduce in mpi for large-scale graph algorithms. Parallel Computing, 37:610--632, 2011.
[9]
Sun Microsystems Inc. Using lustre with apache hadoop. http://wiki.lustre.orgAccessed: 2014-06-17.
[10]
O. Vega-Gisbert, J. E. Roman, S. Gross, and J. M. Squyres. Towards the availability of java bindings in Open MPI. In Proc. the 20th European MPI Users' Group Meeting, pages 141--142, 2013.

Cited By

View all
  • (2021)A unified extensible architecture for efficient distributed analyticsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507806(123-132)Online publication date: 22-Nov-2021
  • (2021)BigMPI4py: Python module for parallelization of Big Data objects discloses germ layer specific DNA demethylation motifsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.3043979(1-1)Online publication date: 2021
  • (2019)Knowledge Discovery and Big Data AnalyticsWeb Services10.4018/978-1-5225-7501-6.ch011(168-183)Online publication date: 2019
  • Show More Cited By

Recommendations

Reviews

Long Wang

Traditionally, parallel computation has focused on high-performance computing (HPC), and many special platforms and systems have been built for this purpose. In recent years, a new parallel computing model has emerged for handling big data problems: MapReduce and its Hadoop implementation; it is becoming increasingly important. To address the rapidly increasing requirement of running Hadoop, it is desirable to use the special HPC platforms and systems that have already been built. This paper develops a design for a Hadoop-over-MPI approach that runs Hadoop on traditional HPC platforms, and implements a MapReduce benchmark application, WordCount, based on the design. Then, the paper compares this Hadoop-over-MPI performance with the Hadoop implementation of WordCount. The paper demonstrates that the Hadoop-over-MPI implementation performs much better than the Hadoop implementation. According to Cheptsov, “the nominal performance of MPI is indeed higher than the [performance] of Hadoop, [and ...] the poor performance of Hadoop could be caused by [the] small size of [the] particular experiment setup.” Besides the evaluation, the paper also gives a good introduction to the MapReduce and MPI technologies. It is a good read for those who are interested in how MapReduce and MPI work, how they differ from each other, and how their performances compare with each other. Any engineer, researcher, or scientist in information technology will find this paper interesting. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting
September 2014
183 pages
ISBN:9781450328753
DOI:10.1145/2642769
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Kyoto University: Kyoto University
  • University of Tokyo
  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud
  2. HPC
  3. Hadoop
  4. MPI
  5. MapReduce

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroMPI/ASIA '14

Acceptance Rates

EuroMPI/ASIA '14 Paper Acceptance Rate 18 of 39 submissions, 46%;
Overall Acceptance Rate 18 of 39 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)A unified extensible architecture for efficient distributed analyticsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507806(123-132)Online publication date: 22-Nov-2021
  • (2021)BigMPI4py: Python module for parallelization of Big Data objects discloses germ layer specific DNA demethylation motifsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.3043979(1-1)Online publication date: 2021
  • (2019)Knowledge Discovery and Big Data AnalyticsWeb Services10.4018/978-1-5225-7501-6.ch011(168-183)Online publication date: 2019
  • (2019)Large-scale urban traffic simulation with Scala and high-performance computing systemJournal of Computational Science10.1016/j.jocs.2019.06.00235(91-101)Online publication date: Jul-2019
  • (2017)Knowledge Discovery and Big Data AnalyticsWeb Semantics for Textual and Visual Information Retrieval10.4018/978-1-5225-2483-0.ch007(144-164)Online publication date: 2017
  • (2017)An Interface for Biomedical Big Data Processing on the Tianhe-2 SupercomputerMolecules10.3390/molecules2212211622:12(2116)Online publication date: 1-Dec-2017
  • (2016)Big data applications in operations/supply-chain managementComputers and Industrial Engineering10.1016/j.cie.2016.09.023101:C(528-543)Online publication date: 1-Nov-2016
  • (2016)Big Data AnalyticsBig Data Technologies and Applications10.1007/978-3-319-44550-2_2(13-52)Online publication date: 17-Sep-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media