Skip to main content
Log in

CATS: cache-aware task scheduling for Hadoop-based systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Today with the explosion of big data, data-intensive cluster computing systems have driven to a new data processing paradigm. As Hadoop, one of the most famous data processing frameworks, achieves high performance by running multiple tasks in parallel across nodes in large clusters, task scheduling is considered as one of the most important factors affecting the overall performance. In modern operating systems, caching is used to improve local disk access times, providing data from the main memory without disk accesses. This option, however, is poorly utilized by existing task scheduling methods of Hadoop-based systems, mainly due to the inability of tracking cached data in shared-nothing distributed environments. In this paper, we propose a cache-aware task scheduling method, cache-aware task scheduling (CATS), for Hadoop-based systems which is able to exploit the operating system’s buffer cache and assign tasks to nodes in consideration of the cached data. Through comprehensive experiments, we show that the proposed cache-aware scheduling improves the overall job execution time for various workload types and data sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: a distributed data warehouse system on large clusters. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1320–1323. IEEE (2013)

  2. Chou, T.C., Abraham, J., et al.: Load balancing in distributed systems. IEEE Trans. Softw. Eng. 4, 401–412 (1982)

    Article  Google Scholar 

  3. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  4. El-Rewini, H., Lewis, T.G., Ali, H.H.: Task Scheduling in Parallel and Distributed Systems. Prentice-Hall Inc., Upper Saddle River (1994)

    Google Scholar 

  5. Gunarathne, T., Zhang, B., Wu, T.L., Qiu, J.: Scalable parallel computing on clouds using twister4azure iterative mapreduce. Future Gener. Comput. Syst. 29(4), 1035–1048 (2013)

    Article  Google Scholar 

  6. Apache Hadoop. http://hadoop.apache.org (2016). Accessed 20 May 2017

  7. Centralized cache management in hdfs. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html (2016). Accessed 20 May 2017

  8. IBM: Ibm cache aware scheduling. https://www.ibm.com/support/knowledgecenter/SSZUMP_7.1.2/mapreduce_user/cache_aware_scheduling_about.html (2017). Accessed 20 May 2017

  9. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72. ACM (2007)

  10. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp. 261–276. ACM (2009)

  11. Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)

    Article  Google Scholar 

  12. Lei, C., Rundensteiner, E.A., Eltabakh, M.Y.: Redoop: supporting recurring queries in hadoop. EDBT 14, 24–28 (2014)

    Google Scholar 

  13. Lo, V.M.: Heuristic algorithms for task assignment in distributed systems. IEEE Trans. Comput. 37(11), 1384–1397 (1988)

    Article  MathSciNet  Google Scholar 

  14. memcached. http://memcached.org (2016). Accessed 20 May 2017

  15. Nandakumar, V.: Transparent in-memory cache for hadoop-mapreduce. Ph.D. thesis, University of Toronto (2014)

  16. Pai, V.S., Aron, M., Banga, G., Svendsen, M., Druschel, P., Zwaenepoel, W., Nahum, E.: Locality-aware request distribution in cluster-based network servers. In: ACM Sigplan Notices, vol. 33, pp. 205–216. ACM (1998)

  17. Shinnar, A., Cunningham, D., Saraswat, V., Herta, B.: M3R: increased performance for in-memory hadoop jobs. Proc. VLDB Endow. 5(12), 1736–1747 (2012)

    Article  Google Scholar 

  18. Shirazi, B.A., Kavi, K.M., Hurson, A.R. (eds.): Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos (1995)

    Google Scholar 

  19. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

  20. Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, vol. 4. Addison-Wesley, Reading (1998)

    MATH  Google Scholar 

  21. Apache Spark. http://spark.apache.org (2016). Accessed 20 May 2017

  22. Stonebraker, M.: The case for shared nothing. IEEE Database Eng. Bull. 9(1), 4–9 (1986)

    Google Scholar 

  23. Apache Tajo. http://tajo.apache.org (2016). Accessed 20 May 2017

  24. Tang, Z., Zhou, J., Li, K., Li, R.: A mapreduce task scheduling algorithm for deadline constraints. Clust. Comput. 16(4), 651–662 (2013)

    Article  Google Scholar 

  25. Tpc-h benchmark. http://www.tpc.org/tpch (2016). Accessed 20 May 2017

  26. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European conference on Computer systems, pp. 265–278. ACM (2010)

  27. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea Government(MSIP) (No. NRF-2014R1A2A1A11053657).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yon Dohn Chung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lim, B., Kim, J.W. & Chung, Y.D. CATS: cache-aware task scheduling for Hadoop-based systems. Cluster Comput 20, 3691–3705 (2017). https://doi.org/10.1007/s10586-017-0920-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0920-6

Keywords

Navigation