Abstract
Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes to which future map tasks should be assigned and then preload the input data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2009)
Ananthanarayanan, G., Ghodsi, A., Warfield, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: Coordinated Memory Caching for Parallel Jobs. In: NSDI, pp. 267–280 (2012)
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user mapreduce clusters. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-55 (2009)
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM (2010)
Zhang, X., Zhong, Z., Feng, S., Tu, B., Fan, J.: Improving data locality of mapreduce by scheduling in homogeneous computing environments. In: 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), pp. 120–126. IEEE (2011)
Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: International Symposium on Parallel Architectures, Algorithms, and Networks, I-SPAN 2008, pp. 19–24. IEEE (2008)
Seo, S., Jang, I., Woo, K., Kim, I., Kim, J.S., Maeng, S.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009)
Gu, T., Zuo, C., Liao, Q., Yang, Y., Li, T.: Improving MapReduce Performance by Data Prefetching in Heterogeneous or Shared Environments. International Journal of Grid & Distributed Computing 6(5) (2013)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Chen, Y., Zhu, H., Sun, X.H.: An adaptive data prefetcher for high-performance processors. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 155–164. IEEE (2010)
Li, J., Wu, S.: Real-time Data Prefetching Algorithm Based on Sequential Patternmining in Cloud Environment. In: 2012 International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 1044–1048. IEEE (2012)
Xie, J., Meng, F., Wang, H., Pan, H., Cheng, J., Qin, X.: Research on Scheduling Scheme for Hadoop Clusters. Procedia Computer Science 18, 2468–2471 (2013)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI, vol. 8(4), p. 7 (2008)
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007, pp. 13–24. IEEE (2007)
Chen, R., Chen, H., Zang, B.: Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 523–534. ACM (2010)
Ganapathi, A., Kuno, H., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 592–603. IEEE (2009)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association (2012)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 13. ACM (2011)
Zhang, S., Han, J., Liu, Z., Wang, K., Feng, S.: Accelerating MapReduce with distributed memory cache. In: 2009 15th International Conference on Parallel and Distributed Systems (ICPADS), pp. 472–478. IEEE (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sun, M., Zhuang, H., Zhou, X., Lu, K., Li, C. (2014). HPSO: Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-11194-0_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)