research-article

HPC in Big Data Age: An Evaluation Report for Java-Based Data-Intensive Applications Implemented with Hadoop and OpenMPI

Author:

Alexey CheptsovAuthors Info & Claims

EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

Pages 175 - 180

https://doi.org/10.1145/2642769.2642802

Published: 09 September 2014 Publication History

Get Access

Abstract

The current IT technologies have a strong need for scaling up the high-performance analysis to large-scale datasets. Tremendously increased over the last few years volume and complexity of data gathered in both public (such as on the web) and enterprise (e.g. digitalized internal document base) domains have posed new challenges to providers of high performance computing (HPC) infrastructures, which is recognised in the community as Big Data problem. On contrast to the typical HPC applications, the Big Data ones are not oriented on reaching the peak performance of the infrastructure and thus offer more opportunities for the "capacity" infrastructure model rather than for the "capability" one, making the use of Cloud infrastructures preferable over the HPC. However, considering the more and more vanishing difference between these two infrastructure types, i.e. Cloud and HPC, it makes a lot of sense to investigate the abilities of traditional HPC infrastructure to execute Big Data applications as well, despite their relatively poor efficiency as compared with the traditional, very optimized HPC ones. This paper discusses the main state-of-the-art parallelisation techniques utilised in both Cloud and HPC domains and evaluates them on an exemplary text processing application on a testbed HPC cluster.

References

[1]

M. Baker, B. Carpenter, G. Fox, S. Ko, and S. Lim. mpiJava: An object-oriented java interface to mpi. In Proc. International Workshop on Java for Parallel and Distributed Computing IPPS/SPDP, San Juan, Puerto Rico, 1999.

Digital Library

Google Scholar

[2]

R. H. Castain and W. Tan. Mr+. a technical overview. http://www.open-mpi.org/video/mrplus/Greenplum RalphCastain -- 1up.pdf, 2012. Accessed: 2014-06-17.

Google Scholar

[3]

A. Cheptsov. Enabling high performance computing for semantic web applications by means of open mpi java bindings. International Journal on Semantic Web and Information Systems, 5(3):1--22, 2009.

Google Scholar

[4]

A. Cheptsov and B. Koller. JUNIPER takes aim at big data. inSiDE - Journal of Innovatives Supercomputing in Deutschland, 11(1):68--69, 2013.

Google Scholar

[5]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun., 51,1:107--113, 2008.

Digital Library

Google Scholar

[6]

E. Gabriel. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc., 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.

Crossref

Google Scholar

[7]

MPI Forum. Mpi: A message-passing interface standard. http://www.mcs.anl.gov/research/projects/mpi/mpistandard/mpi-report-1.1/mpi-report.htm, 1995. Accessed: 2014-06-17.

Google Scholar

[8]

S. J. Plimpton and K. D. Devine. Mapreduce in mpi for large-scale graph algorithms. Parallel Computing, 37:610--632, 2011.

Digital Library

Google Scholar

[9]

Sun Microsystems Inc. Using lustre with apache hadoop. http://wiki.lustre.orgAccessed: 2014-06-17.

Google Scholar

[10]

O. Vega-Gisbert, J. E. Roman, S. Gross, and J. M. Squyres. Towards the availability of java bindings in Open MPI. In Proc. the 20th European MPI Users' Group Meeting, pages 141--142, 2013.

Digital Library

Google Scholar

Cited By

View all

Orensa NEager DMakaroff DOnuţ IZulkernine F(2021)A unified extensible architecture for efficient distributed analyticsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507806(123-132)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507806
Ascension AArauzo-Bravo M(2021)BigMPI4py: Python module for parallelization of Big Data objects discloses germ layer specific DNA demethylation motifsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.3043979(1-1)Online publication date: 2021
https://doi.org/10.1109/TCBB.2020.3043979
Jambulingam VSanthi V(2019)Knowledge Discovery and Big Data AnalyticsWeb Services10.4018/978-1-5225-7501-6.ch011(168-183)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7501-6.ch011
Show More Cited By

Index Terms

HPC in Big Data Age: An Evaluation Report for Java-Based Data-Intensive Applications Implemented with Hadoop and OpenMPI

Recommendations

Big data analytics on traditional HPC infrastructure using two-level storage
DISCS '15: Proceedings of the 2015 International Workshop on Data-Intensive Scalable Computing Systems

Data-intensive computing has become one of the major workloads on traditional high-performance computing (HPC) clusters. Currently, deploying data-intensive computing software framework on HPC clusters still faces performance and scalability issues. In ...
Components and Rationale of a Big Data Toolkit Spanning HPC, Grid, Edge and Cloud Computing
UCC '17: Proceedings of the10th International Conference on Utility and Cloud Computing

We look again at Big Data Programming environments such as Hadoop, Spark, Flink, Heron, Pregel; HPC concepts such as MPI and Asynchronous Many-Task runtimes and Cloud/Grid/Edge ideas such as event-driven computing, serverless computing, workflow, and ...
Aeromancer: A Workflow Manager for Large-Scale MapReduce-Based Scientific Workflows
TRUSTCOM '14: Proceedings of the 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications

The Hadoop framework has gained significant attention from the scientific community due to its applicability to large-scale data analysis in many areas. This analysis often involves multiple stages of processing, which in turn, constitutes a workflow. ...

Reviews

Reviewer: Long Wang

Traditionally, parallel computation has focused on high-performance computing (HPC), and many special platforms and systems have been built for this purpose. In recent years, a new parallel computing model has emerged for handling big data problems: MapReduce and its Hadoop implementation; it is becoming increasingly important. To address the rapidly increasing requirement of running Hadoop, it is desirable to use the special HPC platforms and systems that have already been built. This paper develops a design for a Hadoop-over-MPI approach that runs Hadoop on traditional HPC platforms, and implements a MapReduce benchmark application, WordCount, based on the design. Then, the paper compares this Hadoop-over-MPI performance with the Hadoop implementation of WordCount. The paper demonstrates that the Hadoop-over-MPI implementation performs much better than the Hadoop implementation. According to Cheptsov, “the nominal performance of MPI is indeed higher than the [performance] of Hadoop, [and ...] the poor performance of Hadoop could be caused by [the] small size of [the] particular experiment setup.” Besides the evaluation, the paper also gives a good introduction to the MapReduce and MPI technologies. It is a good read for those who are interested in how MapReduce and MPI work, how they differ from each other, and how their performances compare with each other. Any engineer, researcher, or scientist in information technology will find this paper interesting. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

September 2014

183 pages

ISBN:9781450328753

DOI:10.1145/2642769

General Chair:
Jack Dongarra
University of Tennessee
,
Program Chairs:
Yutaka Ishikawa
University of Tokyo
,
Atsushi Hori
RIKEN AICS

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Kyoto University: Kyoto University
University of Tokyo
University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI/ASIA '14

EuroMPI/ASIA '14: 21st European MPI Users' Group Meeting

September 9 - 12, 2014

Kyoto, Japan

Acceptance Rates

EuroMPI/ASIA '14 Paper Acceptance Rate 18 of 39 submissions, 46%;

Overall Acceptance Rate 18 of 39 submissions, 46%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Orensa NEager DMakaroff DOnuţ IZulkernine F(2021)A unified extensible architecture for efficient distributed analyticsProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507806(123-132)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507806
Ascension AArauzo-Bravo M(2021)BigMPI4py: Python module for parallelization of Big Data objects discloses germ layer specific DNA demethylation motifsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.3043979(1-1)Online publication date: 2021
https://doi.org/10.1109/TCBB.2020.3043979
Jambulingam VSanthi V(2019)Knowledge Discovery and Big Data AnalyticsWeb Services10.4018/978-1-5225-7501-6.ch011(168-183)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7501-6.ch011
Janczykowski MTurek WMalawski MByrski A(2019)Large-scale urban traffic simulation with Scala and high-performance computing systemJournal of Computational Science10.1016/j.jocs.2019.06.00235(91-101)Online publication date: Jul-2019
https://doi.org/10.1016/j.jocs.2019.06.002
Jambulingam VSanthi V(2017)Knowledge Discovery and Big Data AnalyticsWeb Semantics for Textual and Visual Information Retrieval10.4018/978-1-5225-2483-0.ch007(144-164)Online publication date: 2017
https://doi.org/10.4018/978-1-5225-2483-0.ch007
Yang XWu CLu KFang LZhang YLi SGuo GDu Y(2017)An Interface for Biomedical Big Data Processing on the Tianhe-2 SupercomputerMolecules10.3390/molecules2212211622:12(2116)Online publication date: 1-Dec-2017
https://doi.org/10.3390/molecules22122116
Addo-Tenkorang RHelo P(2016)Big data applications in operations/supply-chain managementComputers and Industrial Engineering10.1016/j.cie.2016.09.023101:C(528-543)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1016/j.cie.2016.09.023
Tsai CLai CChao HVasilakos AFurht BVillanustre F(2016)Big Data AnalyticsBig Data Technologies and Applications10.1007/978-3-319-44550-2_2(13-52)Online publication date: 17-Sep-2016
https://doi.org/10.1007/978-3-319-44550-2_2

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Big data analytics on traditional HPC infrastructure using two-level storage

Components and Rationale of a Big Data Toolkit Spanning HPC, Grid, Edge and Cloud Computing

Aeromancer: A Workflow Manager for Large-Scale MapReduce-Based Scientific Workflows

Reviews

Access critical reviews of Computing literature here