Investigating Apache Hama: a bulk synchronous parallel computing framework

Siddique, Kamran; Akhtar, Zahid; Kim, Yangwoo; Jeong, Young-Sik; Yoon, Edward J.

doi:10.1007/s11227-017-1987-9

Investigating Apache Hama: a bulk synchronous parallel computing framework

Published: 25 February 2017

Volume 73, pages 4190–4205, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Kamran Siddique¹,
Zahid Akhtar²,
Yangwoo Kim¹,
Young-Sik Jeong¹ &
…
Edward J. Yoon³

699 Accesses
Explore all metrics

Abstract

The quantity of digital data is growing exponentially, and the task to efficiently process such massive data is becoming increasingly challenging. Recently, academia and industry have recognized the limitations of the predominate Hadoop framework in several application domains, such as complex algorithmic computation, graph, and streaming data. Unfortunately, this widely known map-shuffle-reduce paradigm has become a bottleneck to address the challenges of big data trends. The demand for research and development of novel massive computing frameworks is increasing rapidly, and systematic illustration, analysis, and highlights of potential research areas are vital and very much in demand by the researchers in the field. Therefore, we explore one of the emerging and promising distributed computing frameworks, Apache Hama. This is a top level project under the Apache Software Foundation and a pure bulk synchronous parallel model for processing massive scientific computations, e.g. graph, matrix, and network algorithms. The objectives of this contribution are twofold. First, we outline the current state of the art, distinguish the challenges, and frame some research directions for researchers and application developers. Second, we present real-world use cases of Apache Hama to illustrate its potential specifically to the industrial community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Researching Apache Hama: A Pure BSP Computing Framework

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

Article 17 January 2020

Upgrading a high performance computing environment for massive data processing

Article Open access 16 October 2019

References

Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. doi:10.1007/s11227-016-1677-z
Article Google Scholar
Gebara FH, Hofstee HP, Nowka KJ (2015) Second-generation big data systems. IEEE Comput 48(1):36–41. doi:10.1109/MC.2015.25
Article Google Scholar
Yu N, Yu Z, Li B, Gu F, Pan Y (2016) A comprehensive review of emerging computational methods for gene identification. J Inf Process Syst 12(1):1–34. doi:10.3745/JIPS.04.0023
Google Scholar
Kolici V, Herrero A, Xhafa F (2014) On the performance of oracle grid engine queuing system for computing intensive applications. J Inf Process Syst 10(4):491–502. doi:10.3745/JIPS.01.0004
Article Google Scholar
Apache Hama. https://hama.apache.org/. Accessed 25 March 2016
Kalavri V, Vlassov V (2013) MapReduce limitations, optimizations and open issues. In: The IEEE 12th International Conference on Trust, Security and Privacy in Computing and Communications, pp 1031–1038
Fortune. http://fortune.com/2015/09/09/cloudera-spark-mapreduce/. Accessed 25 March 2016
InformationWeek. http://www.informationweek.com/cloud/software-as-a-service/google-i-o-hello-dataflow-goodbye-mapreduce/d/d-id/1278917. Accessed 25 March 2016
Elser B, Montresor A (2013) An evaluation study of BigData frameworks for graph processing. In: IEEE Big Data pp 60–67
Apache Apache Software Foundation blogging in action. https://blogs.apache.org/Hama/. Accessed 10 January 2016
Mailing list archives. https://hama.apache.org/mail-lists.html. Accessed 10 January 2016
Zotero. https://www.zotero.org/. Accessed 15 October 2015
Friedman R, Portnoy A (2015) A generic decentralized trust management framework. Softw Pract Exp 45(4):435–454. doi:10.1002/spe.2226
Article Google Scholar
Zhang X, Wang R, Chen X, Wang J, Lukasiewicz T, Han D (2015) Achieving up to zero communication delay in BSP based graph processing via vertex categorization. In: International Conference on Networking, Architecture, and Storage, IEEE, Boston, pp 112–121. doi:10.1109/NAS.2015.7255213
Ratnaparkhi AA, Pilli E, Joshi RC (2015) Scaling GMM expectation maximization algorithm using bulk synchronous parallel approach. In: International Conference on Green Computing and Internet of Things, IEEE, Noida, pp 558–562. doi:10.1109/ICGCIoT.2015.7380527
Zhou W, Han J, Gao Y, Xu Z (2016) An efficient graph data processing system for large-scale social network service applications. Concurr Comput 28(3):729–747. doi:10.1002/cpe.3393
Article Google Scholar
Luo S, Liu L, Wang H, Wu B, Liu Y (2014) Implementation of a parallel graph partitioning algorithm to speed up BSP computing. In: The 11th International Conference on Fuzzy Systems and Knowledge Discovery. IEEE, China, pp 740–744
Chen R, Ding X, Wang P, Chen H, Zang B, Guan H (2014) Computation and communication efficient graph processing with distributed immutable view. In: The 23rd International ACM Symposium on High Performance Parallel and Distributed Computing. Vancouver, Canada, pp 215–226
McColl R, Ediger D, Poovey J, Campbell D, Bader DA (2014) A performance evaluation of open source graph databases. In: The Proceedings of the First Workshop on Parallel Programming for Analytics Applications. Orlando, Florida, pp 11–17
Wang Z, Bao Y, Gu Y, Leng F, Yu G, Deng C, Guo L (2013) A BSP based parallel iterative processing system with multiple partition strategies for big graphs. In: IEEE International Congress on Big Data, CA, pp 173–180
Ho LY, Li TH, Wu JJ, Liu P (2013) Kylin: an efficient and scalable graph data processing system. In: IEEE International Conference on Big Data, CA, USA, pp 193–198
Khayyat Z, Awaraz K, Alonaziz A, Jamjoomy H, Williamsy D, Kalnis P (2013) Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of the 8th ACM European Conference on Computer Systems. Czech Republic, Prague, pp 169–182
Zhang J, Ge S (2012) A parallel algorithm to find overlapping community structure in directed and weighted complex networks. In: 2nd International Conference on Instrumentation and Measurement, Computer, Communication and Control, IEEE, Harbin City, Heilongjiang, China, pp 1561–1564. doi:10.1109/IMCCC.2012.364
Chen R, Weng X, He B, Yang M, Choi B, Li X (2012) Improving large graph processing on partitioned graphs in the cloud. In: ACM Symposium on Cloud Computing, San Jose, CA. doi:10.1145/2391229.2391232
Ting IH, Lin CH, Wang CS (2011) Constructing a cloud computing based social networks data warehousing and analyzing system. In: International Conference on Advances in Social Networks Analysis and Mining. IEEE, Kaohsiung, Taiwan, pp 735–740
Seo S, Yoon EJ, Kim J, Jin S, Kim JS, Maeng S (2010) HAMA: an efficient matrix computation with the MapReduce framework. In: Proceedings of the IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom). Greece, Athens, pp 721–726
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Article Google Scholar
Hama Graph Tutorial. https://hama.apache.org/hama_graph_tutorial.html. Accessed 10 January 2016
Apache Horn. http://horn.incubator.apache.org/index.html. Accessed 10 January 2016
Apache Hama Design Document V0.6. http://people.apache.org/~tjungblut/downloads/hamadocs/ApacheHamaDesign_06.pdf. Accessed 20 December 2015
Apache Hama Pipes Development Repository. https://github.com/millecker/hama-0.5.0-gpu. Accessed 10 January 2016
Golghate AA, Shende SW (2014) Parallel K-means clustering based on hadoop and hama. Int J Comput Technol 1(3):33–37
Google Scholar
Li S, Xu B (2015) Performance comparison between hama and hadoop. Int J Database Theory Appl 8(3):77–84
Article Google Scholar
Jin S, Yang S, Jia Y (2012) Optimization of task assignment strategy for map-reduce. In: 2$^{nd}$ International Conference on Computer Science and Network Technology. Changchun, China, pp 57–61
Module for Monte Carlo Pi. http://mathfaculty.fullerton.edu/mathews/n2003/montecarlopimod.html. Accessed 10 January 2016
Sogou Inc. https://www.sogou.com. Accessed 10 January 2016
KT Corporation. https://www.kt.com/eng. Accessed 10 January 2016
Samsung Electronics. https://www.samsung.com. Accessed 10 January 2016

Download references

Acknowledgements

This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the University Information Technology Research Center support program (IITP-2016-R2720-16-0004 and IITP-2016-H8501-16-1015) supervised by the IITP (Institute for Information & Communications Technology Promotion).

Author information

Authors and Affiliations

Dongguk University, Seoul, South Korea
Kamran Siddique, Yangwoo Kim & Young-Sik Jeong
University of Quebec, Montreal, Canada
Zahid Akhtar
Samsung Electronics, Seoul, South Korea
Edward J. Yoon

Authors

Kamran Siddique
View author publications
You can also search for this author inPubMed Google Scholar
Zahid Akhtar
View author publications
You can also search for this author inPubMed Google Scholar
Yangwoo Kim
View author publications
You can also search for this author inPubMed Google Scholar
Young-Sik Jeong
View author publications
You can also search for this author inPubMed Google Scholar
Edward J. Yoon
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yangwoo Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Siddique, K., Akhtar, Z., Kim, Y. et al. Investigating Apache Hama: a bulk synchronous parallel computing framework. J Supercomput 73, 4190–4205 (2017). https://doi.org/10.1007/s11227-017-1987-9

Download citation

Published: 25 February 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11227-017-1987-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating Apache Hama: a bulk synchronous parallel computing framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Researching Apache Hama: A Pure BSP Computing Framework

ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems

Upgrading a high performance computing environment for massive data processing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now