HPSO: Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters

Sun, Mingming; Zhuang, Hang; Zhou, Xuehai; Lu, Kun; Li, Changlong

doi:10.1007/978-3-319-11194-0_7

Mingming Sun²⁵,
Hang Zhuang²⁵,
Xuehai Zhou²⁵,
Kun Lu²⁵ &
…
Changlong Li²⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2721 Accesses
10 Citations

Abstract

Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes to which future map tasks should be assigned and then preload the input data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2009)
Google Scholar
Ananthanarayanan, G., Ghodsi, A., Warfield, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: Coordinated Memory Caching for Parallel Jobs. In: NSDI, pp. 267–280 (2012)
Google Scholar
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user mapreduce clusters. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-55 (2009)
Google Scholar
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM (2010)
Google Scholar
Zhang, X., Zhong, Z., Feng, S., Tu, B., Fan, J.: Improving data locality of mapreduce by scheduling in homogeneous computing environments. In: 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), pp. 120–126. IEEE (2011)
Google Scholar
Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: International Symposium on Parallel Architectures, Algorithms, and Networks, I-SPAN 2008, pp. 19–24. IEEE (2008)
Google Scholar
Seo, S., Jang, I., Woo, K., Kim, I., Kim, J.S., Maeng, S.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009)
Google Scholar
Gu, T., Zuo, C., Liao, Q., Yang, Y., Li, T.: Improving MapReduce Performance by Data Prefetching in Heterogeneous or Shared Environments. International Journal of Grid & Distributed Computing 6(5) (2013)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Google Scholar
Chen, Y., Zhu, H., Sun, X.H.: An adaptive data prefetcher for high-performance processors. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 155–164. IEEE (2010)
Google Scholar
Li, J., Wu, S.: Real-time Data Prefetching Algorithm Based on Sequential Patternmining in Cloud Environment. In: 2012 International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 1044–1048. IEEE (2012)
Google Scholar
Xie, J., Meng, F., Wang, H., Pan, H., Cheng, J., Qin, X.: Research on Scheduling Scheme for Hadoop Clusters. Procedia Computer Science 18, 2468–2471 (2013)
Article Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI, vol. 8(4), p. 7 (2008)
Google Scholar
Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007, pp. 13–24. IEEE (2007)
Google Scholar
Chen, R., Chen, H., Zang, B.: Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 523–534. ACM (2010)
Google Scholar
Ganapathi, A., Kuno, H., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 592–603. IEEE (2009)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association (2012)
Google Scholar
Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 13. ACM (2011)
Google Scholar
Zhang, S., Han, J., Liu, Z., Wang, K., Feng, S.: Accelerating MapReduce with distributed memory cache. In: 2009 15th International Conference on Parallel and Distributed Systems (ICPADS), pp. 472–478. IEEE (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science University of Science and, Technology of China, Hefei, China
Mingming Sun, Hang Zhuang, Xuehai Zhou, Kun Lu & Changlong Li

Authors

Mingming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Hang Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Xuehai Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kun Lu
View author publications
You can also search for this author in PubMed Google Scholar
Changlong Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road Dailian, 116026, China
Zhiyang Li
BeiHang University, XueYuan Road No.37, HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Dalian Maritime University, NO.1 Linhai Road Dailian, China, 116026
Tingting Yang
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
Shandong University, 27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, M., Zhuang, H., Zhou, X., Lu, K., Li, C. (2014). HPSO: Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-11194-0_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11193-3
Online ISBN: 978-3-319-11194-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics