Skip to main content

HPSO: Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters

  • Conference paper
Algorithms and Architectures for Parallel Processing (ICA3PP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8631))

Abstract

Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes to which future map tasks should be assigned and then preload the input data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2009)

    Google Scholar 

  3. Ananthanarayanan, G., Ghodsi, A., Warfield, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: Coordinated Memory Caching for Parallel Jobs. In: NSDI, pp. 267–280 (2012)

    Google Scholar 

  4. Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Job scheduling for multi-user mapreduce clusters. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-55 (2009)

    Google Scholar 

  5. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM (2010)

    Google Scholar 

  6. Zhang, X., Zhong, Z., Feng, S., Tu, B., Fan, J.: Improving data locality of mapreduce by scheduling in homogeneous computing environments. In: 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), pp. 120–126. IEEE (2011)

    Google Scholar 

  7. Byna, S., Chen, Y., Sun, X.H.: A taxonomy of data prefetching mechanisms. In: International Symposium on Parallel Architectures, Algorithms, and Networks, I-SPAN 2008, pp. 19–24. IEEE (2008)

    Google Scholar 

  8. Seo, S., Jang, I., Woo, K., Kim, I., Kim, J.S., Maeng, S.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–8. IEEE (2009)

    Google Scholar 

  9. Gu, T., Zuo, C., Liao, Q., Yang, Y., Li, T.: Improving MapReduce Performance by Data Prefetching in Heterogeneous or Shared Environments. International Journal of Grid & Distributed Computing 6(5) (2013)

    Google Scholar 

  10. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

    Google Scholar 

  11. Chen, Y., Zhu, H., Sun, X.H.: An adaptive data prefetcher for high-performance processors. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 155–164. IEEE (2010)

    Google Scholar 

  12. Li, J., Wu, S.: Real-time Data Prefetching Algorithm Based on Sequential Patternmining in Cloud Environment. In: 2012 International Conference on Industrial Control and Electronics Engineering (ICICEE), pp. 1044–1048. IEEE (2012)

    Google Scholar 

  13. Xie, J., Meng, F., Wang, H., Pan, H., Cheng, J., Qin, X.: Research on Scheduling Scheme for Hadoop Clusters. Procedia Computer Science 18, 2468–2471 (2013)

    Article  Google Scholar 

  14. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI, vol. 8(4), p. 7 (2008)

    Google Scholar 

  15. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007, pp. 13–24. IEEE (2007)

    Google Scholar 

  16. Chen, R., Chen, H., Zang, B.: Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 523–534. ACM (2010)

    Google Scholar 

  17. Ganapathi, A., Kuno, H., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 592–603. IEEE (2009)

    Google Scholar 

  18. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2–2. USENIX Association (2012)

    Google Scholar 

  19. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Priter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 13. ACM (2011)

    Google Scholar 

  20. Zhang, S., Han, J., Liu, Z., Wang, K., Feng, S.: Accelerating MapReduce with distributed memory cache. In: 2009 15th International Conference on Parallel and Distributed Systems (ICPADS), pp. 472–478. IEEE (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Sun, M., Zhuang, H., Zhou, X., Lu, K., Li, C. (2014). HPSO: Prefetching Based Scheduling to Improve Data Locality for MapReduce Clusters. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11194-0_7

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11193-3

  • Online ISBN: 978-3-319-11194-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics