Abstract
This paper proposes a Hadoop system that considers both slave server’s processing capacity and network delay for wide area networks to reduce the job processing time. The task allocation scheme in the proposed Hadoop system divides each individual job into multiple tasks using suitable splitting ratios and then allocates the tasks to different slaves according to the computational capability of each server and the availability of network resources. We incorporate software-defined networking to the proposed Hadoop system to manage path computation elements and network resources. The performance of proposed Hadoop system is experimentally evaluated with fourteen machines located in the different parts of the globe using a scale-out approach. A scale-out experiment using the proposed and conventional Hadoop systems is conducted by executing both single job and multiple jobs. The practical testbed and simulation results indicate that the proposed Hadoop system is effective compared to the conventional Hadoop system in terms of processing time.
Similar content being viewed by others
References
Manikandan, S., & Ravi, S. (2014). Big data analysis using apache hadoop. In International conference on IT convergence and security (ICITCS) (pp. 1–4).
Dong, F., & Akl, S. G. (2006). Scheduling algorithms for grid computing: State of the art and open problems. Report: Technical.
Apache Hadoop. http://hadoop.apache.org/.
Adnan M., Afzal M., Aslam M., Jan R., & Martinez-Enriquez A. (2014). Minimizing big data problems using cloud computing based on hadoop architecture. In 11th annual high-capacity optical networks and emerging/enabling technologies (HONET) (pp. 99–103).
Cloudera Impala Project. http://impala.io/.
Cao, Z., Lin, J., Wan, C., Song, Y., Taylor, G., & Li, M. (2017). Hadoop-based framework for big data analysis of synchronised harmonics in active distribution network. IET Generation, Transmission & Distribution, 11(16), 3930–3937. https://doi.org/10.1049/iet-gtd.2016.1723.
White, T. (2012). Hadoop: The definitive guide (3rd ed.). Newton: O’Reilly Media Inc.
Martin, B. (2014). SARAH-statistical analysis for resource allocation in hadoop. In IEEE 13th international conference on trust, security and privacy in computing and communications (TrustCom) (pp. 777–782).
Chen, D., Chen, Y., Brownlow, B. N., Kanjamala, P. P., Arredondo, C. A. G., Radspinner, B. L., et al. (2017). Real-time or near real-time persisting daily healthcare data into HDFS and elasticsearch index inside a big data platform. IEEE Transactions on Industrial Informatics, 13(2), 595–606. https://doi.org/10.1109/TII.2016.2645606.
Palanisamy, B., Singh, A., & Liu, L. (2014). Cost-effective resource provisioning for mapreduce in a cloud. IEEE Transactions on Parallel and Distributed Systems, 26(5), 1265–1279. https://doi.org/10.1109/TPDS.2014.2320498.
Zhao, Y., Wu, J., & Liu, C. (2014). Dache: A data aware caching for big-data applications using the MapReduce framework. Tsinghua Science and Technology, 19(1), 39–50. https://doi.org/10.1109/TST.2014.6733207.
Jung, H., & Nakazato, H. (2014). Dynamic scheduling for speculative execution to improve MapReduce performance in heterogeneous environment. In IEEE 34th international conference on distributed computing systems workshops (ICDCSW) (pp. 119–124).
Hsiao, J. & Kao, S. (2014). A usage-aware scheduler for improving MapReduce performance in heterogeneous environments. In International conference on information science, electronics and electrical engineering (ISEEE) (pp. 1648–1652).
Zhu, N., Liu, X., Liu, J., & Hua, Y. (2014). Towards a cost-efficient MapReduce: Mitigating power peaks for Hadoop clusters. Tsinghua Science and Technology, 19(1), 24–32. https://doi.org/10.1109/TST.2014.6733205.
Xu, X., Cao, L., & Wang, X. (2014). Adaptive task scheduling strategy based on dynamic workload adjustment for heterogeneous Hadoop clusters. IEEE Systems Journal, 10(2), 471–482. https://doi.org/10.1109/JSYST.2014.2323112.
Yao, Y., Wang, J., Sheng, B., Lin, J., & Mi, N. (2014). HaSTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In IEEE 7th international conference on cloud computing (CLOUD) (pp. 184–191).
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., & Stoica, I. (2008). Improving MapReduce performance in heterogeneous environments. In 8th USENIX symposium on operating systems design and implementation (OSDI) (pp. 29–42).
Xiong, R., Luo, J., & Dong, F. (2014). SLDP: A novel data placement strategy for large-scale heterogeneous Hadoop cluster. In Second international conference on advanced cloud and big data (CBD) (pp. 9–17).
Guo, Z. & Fox, G. (2012). Improving MapReduce performance in heterogeneous network environments and resource utilization. In 12th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid) (pp. 714–716).
Matsuno, T., Chatterjee, B. C., Oki, E., Okamoto, S., Yamanaka, N., & Veeraraghavan, M. (2015). Task allocation scheme for Hadoop in campus network environment. In IEICE society conference (pp. B-12-20).
Matsuno, T., Chatterjee, B. C., Oki, E., Okamoto, S., Yamanaka, N., & Veeraraghavan, M. (2015). Resource allocation scheme for Hadoop in campus networks. In 21st Asia-Pacific conference on communications (APCC) (APCC 2015) (pp. 596–597).
Matsuno, T., Chatterjee, B. C., Oki, E., Okamoto, S., Yamanaka, N., & Veeraraghavan, M. (2016). Task allocation scheme based on computational and network resources for heterogeneous Hadoop clusters. In IEEE 17th international conference on high performance switching and routing (HPSR) (pp. 200–205).
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., & Stoica, I. (2010). Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In 5th European conference on computer systems (EuroSys ’10) (pp. 265–278).
Tan, J., Meng, X., & Zhang, L. (2013). Coupling task progress for mapreduce resource-aware scheduling. In IEEE INFOCOM (pp. 1618–1626).
Seo, S., Jang, I., Woo, K., Kim, I., Kim, J. S., & Maeng, S. (2009).HPMR: Prefetching and pre-shuffling in shared mapreduce computation environment. In IEEE international conference on cluster computing and workshops (pp. 1–8).
Jin, J., Luo, J., Song, A., Dong, F., & Xiong, R. (2011). Bar: An efficient data locality driven task scheduling algorithm for cloud computing. In 11th IEEE/ACM international symposium on cluster, cloud and grid computing (pp. 295–304).
Fischer, M. J., Su, X., & Yin, Y. (2010). Assigning tasks for efficiency in Hadoop: Extended abstract. In Twenty-second annual ACM symposium on parallelism in algorithms and architectures (SPAA ’10) (pp. 30–39).
Wang, G., Ng, T. E., & Shaikh, A. (2012). Programming your network at run-time for big data applications. In First workshop on hot topics in software defined networks (HotSDN ’12) (pp. 103–108).
Qin, P., Dai, B., Huang, B., & Xu, G. (2017). Bandwidth-aware scheduling with SDN in Hadoop: A new trend for big data. IEEE Systems Journal, 11(4), 2337–2344. https://doi.org/10.1109/JSYST.2015.2496368.
Zhu, T., Feng, D., Wang, F., Hua, Y., Shi, Q., Liu, J., et al. (2017). Efficient anonymous communication in SDN-based data center networks. IEEE/ACM Transactions on Networking, 25(6), 3767–3780. https://doi.org/10.1109/TNET.2017.2751616.
Ruffini, M., Slyne, F., Bluemm, C., Kitsuwan, N., & McGettrick, S. (2015). Software defined networking for next generation converged metro-access networks. Optical Fiber Technology, 26(A), 31–41. https://doi.org/10.1016/j.yofte.2015.08.008.
OpenFlow. http://archive.openflow.org/.
Oki, E. (2013). Linear programming and algorithms for communication networks. Boca Raton: CRC Press.
When SDN meets Hadoop big data analysis, things get dynamic. Retrieved January 20, 2018 from http://searchsdn.techtarget.com/opinion/When-SDN-meets-Hadoop-big-data-analysis-things-get-dynamic.
Kitsuwan, N., McGettrick, S., Slyne, F., Payne, D. B., & Ruffini, M. (2015). Independent transient plane design for protection in OpenFlow-based networks. IEEE/OSA Journal of Optical Communications and Networking, 7(4), 264–275. https://doi.org/10.1364/JOCN.7.000264.
Zhao, S., & Medhi, D. (2017). Application-aware network design for Hadoop MapReduce optimization using software-defined networking. IEEE Transactions on Network and Service Management, 14(4), 804–816. https://doi.org/10.1109/TNSM.2017.2728519.
Le Roux, J. L. (2007). Path computation element communication protocol (PCECP) specific requirements for inter-area MPLS and GMPLS traffic engineering. IETF RFC 4927. https://tools.ietf.org/html/rfc4927.
Lee, Y., Le Roux, J. L., King, D., & Oki, E. (2009). Path computation element communication protocol (PCEP) Requirements and Protocol Extensions in Support of Global Concurrent Optimization. IETF RFC 5557. https://tools.ietf.org/html/rfc5557.
Oki, E., Inoue, I., & Shiomoto, K. (2007). Path computation element (PCE)-based traffic engineering in MPLS and GMPLS networks. In IEEE sarnoff symposium (pp. 1–5).
Oki, E., Takada. T., Le Roux, J. L., & Farrel, A. (2009). Framework for PCE-based inter-layer MPLS and GMPLS Traffic Engineering. IETF RFC 5623. https://tools.ietf.org/html/rfc5623.
Apache Hadoop source code. Retrieved November 29, 2016 from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1-src.tar.gz/.
VMware solution. Retrieved January 24, 2016 from http://www.vsolution.jp/.
Ishii, M., Han, J., & Makino, H. (2013). Design and Performance Evaluation for Hadoop Clusters on Virtualized Environment. In International Conference on Information Networking (ICOIN) (pp. 244-249).
Pi program. Retrieved January 24, 2016 from http://h2np.net/pi/mt-bbp.c.
Machin-Like Formulas. Retrieved November 29, 2016 from http://mathworld.wolfram.com/ Machin-LikeFormulas.html.
WordCount. Retrieved November 29, 2016 from http://hadoop.apache.org/docs/stable/ hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html.
Apache Hadoop examples. Retrieved November 29, 2016 from http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/ examples/terasort/package-summary.html.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the national institute of information and communications technology (NICT), Japan and by an NSF Grant (CNS-1405171), USA. Parts of this paper were partially presented at The 21st Asia-pacific conference on communications (APCC 2015) and the 2016 IEEE 17th International conference on high performance switching and routing (HPSR 2016).
Rights and permissions
About this article
Cite this article
Matsuno, T., Chatterjee, B.C., Kitsuwan, N. et al. Designing a Hadoop system based on computational resources and network delay for wide area networks. Telecommun Syst 70, 13–25 (2019). https://doi.org/10.1007/s11235-018-0464-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11235-018-0464-y