Abstract
The quantity of digital data is growing exponentially, and the task to efficiently process such massive data is becoming increasingly challenging. Recently, academia and industry have recognized the limitations of the predominate Hadoop framework in several application domains, such as complex algorithmic computation, graph, and streaming data. Unfortunately, this widely known map-shuffle-reduce paradigm has become a bottleneck to address the challenges of big data trends. The demand for research and development of novel massive computing frameworks is increasing rapidly, and systematic illustration, analysis, and highlights of potential research areas are vital and very much in demand by the researchers in the field. Therefore, we explore one of the emerging and promising distributed computing frameworks, Apache Hama. This is a top level project under the Apache Software Foundation and a pure bulk synchronous parallel model for processing massive scientific computations, e.g. graph, matrix, and network algorithms. The objectives of this contribution are twofold. First, we outline the current state of the art, distinguish the challenges, and frame some research directions for researchers and application developers. Second, we present real-world use cases of Apache Hama to illustrate its potential specifically to the industrial community.




Similar content being viewed by others
References
Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. doi:10.1007/s11227-016-1677-z
Gebara FH, Hofstee HP, Nowka KJ (2015) Second-generation big data systems. IEEE Comput 48(1):36–41. doi:10.1109/MC.2015.25
Yu N, Yu Z, Li B, Gu F, Pan Y (2016) A comprehensive review of emerging computational methods for gene identification. J Inf Process Syst 12(1):1–34. doi:10.3745/JIPS.04.0023
Kolici V, Herrero A, Xhafa F (2014) On the performance of oracle grid engine queuing system for computing intensive applications. J Inf Process Syst 10(4):491–502. doi:10.3745/JIPS.01.0004
Apache Hama. https://hama.apache.org/. Accessed 25 March 2016
Kalavri V, Vlassov V (2013) MapReduce limitations, optimizations and open issues. In: The IEEE 12th International Conference on Trust, Security and Privacy in Computing and Communications, pp 1031–1038
Fortune. http://fortune.com/2015/09/09/cloudera-spark-mapreduce/. Accessed 25 March 2016
InformationWeek. http://www.informationweek.com/cloud/software-as-a-service/google-i-o-hello-dataflow-goodbye-mapreduce/d/d-id/1278917. Accessed 25 March 2016
Elser B, Montresor A (2013) An evaluation study of BigData frameworks for graph processing. In: IEEE Big Data pp 60–67
Apache Apache Software Foundation blogging in action. https://blogs.apache.org/Hama/. Accessed 10 January 2016
Mailing list archives. https://hama.apache.org/mail-lists.html. Accessed 10 January 2016
Zotero. https://www.zotero.org/. Accessed 15 October 2015
Friedman R, Portnoy A (2015) A generic decentralized trust management framework. Softw Pract Exp 45(4):435–454. doi:10.1002/spe.2226
Zhang X, Wang R, Chen X, Wang J, Lukasiewicz T, Han D (2015) Achieving up to zero communication delay in BSP based graph processing via vertex categorization. In: International Conference on Networking, Architecture, and Storage, IEEE, Boston, pp 112–121. doi:10.1109/NAS.2015.7255213
Ratnaparkhi AA, Pilli E, Joshi RC (2015) Scaling GMM expectation maximization algorithm using bulk synchronous parallel approach. In: International Conference on Green Computing and Internet of Things, IEEE, Noida, pp 558–562. doi:10.1109/ICGCIoT.2015.7380527
Zhou W, Han J, Gao Y, Xu Z (2016) An efficient graph data processing system for large-scale social network service applications. Concurr Comput 28(3):729–747. doi:10.1002/cpe.3393
Luo S, Liu L, Wang H, Wu B, Liu Y (2014) Implementation of a parallel graph partitioning algorithm to speed up BSP computing. In: The 11th International Conference on Fuzzy Systems and Knowledge Discovery. IEEE, China, pp 740–744
Chen R, Ding X, Wang P, Chen H, Zang B, Guan H (2014) Computation and communication efficient graph processing with distributed immutable view. In: The 23rd International ACM Symposium on High Performance Parallel and Distributed Computing. Vancouver, Canada, pp 215–226
McColl R, Ediger D, Poovey J, Campbell D, Bader DA (2014) A performance evaluation of open source graph databases. In: The Proceedings of the First Workshop on Parallel Programming for Analytics Applications. Orlando, Florida, pp 11–17
Wang Z, Bao Y, Gu Y, Leng F, Yu G, Deng C, Guo L (2013) A BSP based parallel iterative processing system with multiple partition strategies for big graphs. In: IEEE International Congress on Big Data, CA, pp 173–180
Ho LY, Li TH, Wu JJ, Liu P (2013) Kylin: an efficient and scalable graph data processing system. In: IEEE International Conference on Big Data, CA, USA, pp 193–198
Khayyat Z, Awaraz K, Alonaziz A, Jamjoomy H, Williamsy D, Kalnis P (2013) Mizan: a system for dynamic load balancing in large-scale graph processing. In: Proceedings of the 8th ACM European Conference on Computer Systems. Czech Republic, Prague, pp 169–182
Zhang J, Ge S (2012) A parallel algorithm to find overlapping community structure in directed and weighted complex networks. In: 2nd International Conference on Instrumentation and Measurement, Computer, Communication and Control, IEEE, Harbin City, Heilongjiang, China, pp 1561–1564. doi:10.1109/IMCCC.2012.364
Chen R, Weng X, He B, Yang M, Choi B, Li X (2012) Improving large graph processing on partitioned graphs in the cloud. In: ACM Symposium on Cloud Computing, San Jose, CA. doi:10.1145/2391229.2391232
Ting IH, Lin CH, Wang CS (2011) Constructing a cloud computing based social networks data warehousing and analyzing system. In: International Conference on Advances in Social Networks Analysis and Mining. IEEE, Kaohsiung, Taiwan, pp 735–740
Seo S, Yoon EJ, Kim J, Jin S, Kim JS, Maeng S (2010) HAMA: an efficient matrix computation with the MapReduce framework. In: Proceedings of the IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom). Greece, Athens, pp 721–726
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Hama Graph Tutorial. https://hama.apache.org/hama_graph_tutorial.html. Accessed 10 January 2016
Apache Horn. http://horn.incubator.apache.org/index.html. Accessed 10 January 2016
Apache Hama Design Document V0.6. http://people.apache.org/~tjungblut/downloads/hamadocs/ApacheHamaDesign_06.pdf. Accessed 20 December 2015
Apache Hama Pipes Development Repository. https://github.com/millecker/hama-0.5.0-gpu. Accessed 10 January 2016
Golghate AA, Shende SW (2014) Parallel K-means clustering based on hadoop and hama. Int J Comput Technol 1(3):33–37
Li S, Xu B (2015) Performance comparison between hama and hadoop. Int J Database Theory Appl 8(3):77–84
Jin S, Yang S, Jia Y (2012) Optimization of task assignment strategy for map-reduce. In: 2\(^{nd}\) International Conference on Computer Science and Network Technology. Changchun, China, pp 57–61
Module for Monte Carlo Pi. http://mathfaculty.fullerton.edu/mathews/n2003/montecarlopimod.html. Accessed 10 January 2016
Sogou Inc. https://www.sogou.com. Accessed 10 January 2016
KT Corporation. https://www.kt.com/eng. Accessed 10 January 2016
Samsung Electronics. https://www.samsung.com. Accessed 10 January 2016
Acknowledgements
This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the University Information Technology Research Center support program (IITP-2016-R2720-16-0004 and IITP-2016-H8501-16-1015) supervised by the IITP (Institute for Information & Communications Technology Promotion).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Siddique, K., Akhtar, Z., Kim, Y. et al. Investigating Apache Hama: a bulk synchronous parallel computing framework. J Supercomput 73, 4190–4205 (2017). https://doi.org/10.1007/s11227-017-1987-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-1987-9