Abstract
Today with the explosion of big data, data-intensive cluster computing systems have driven to a new data processing paradigm. As Hadoop, one of the most famous data processing frameworks, achieves high performance by running multiple tasks in parallel across nodes in large clusters, task scheduling is considered as one of the most important factors affecting the overall performance. In modern operating systems, caching is used to improve local disk access times, providing data from the main memory without disk accesses. This option, however, is poorly utilized by existing task scheduling methods of Hadoop-based systems, mainly due to the inability of tracking cached data in shared-nothing distributed environments. In this paper, we propose a cache-aware task scheduling method, cache-aware task scheduling (CATS), for Hadoop-based systems which is able to exploit the operating system’s buffer cache and assign tasks to nodes in consideration of the cached data. Through comprehensive experiments, we show that the proposed cache-aware scheduling improves the overall job execution time for various workload types and data sizes.
Similar content being viewed by others
References
Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: a distributed data warehouse system on large clusters. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1320–1323. IEEE (2013)
Chou, T.C., Abraham, J., et al.: Load balancing in distributed systems. IEEE Trans. Softw. Eng. 4, 401–412 (1982)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
El-Rewini, H., Lewis, T.G., Ali, H.H.: Task Scheduling in Parallel and Distributed Systems. Prentice-Hall Inc., Upper Saddle River (1994)
Gunarathne, T., Zhang, B., Wu, T.L., Qiu, J.: Scalable parallel computing on clouds using twister4azure iterative mapreduce. Future Gener. Comput. Syst. 29(4), 1035–1048 (2013)
Apache Hadoop. http://hadoop.apache.org (2016). Accessed 20 May 2017
Centralized cache management in hdfs. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html (2016). Accessed 20 May 2017
IBM: Ibm cache aware scheduling. https://www.ibm.com/support/knowledgecenter/SSZUMP_7.1.2/mapreduce_user/cache_aware_scheduling_about.html (2017). Accessed 20 May 2017
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72. ACM (2007)
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp. 261–276. ACM (2009)
Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)
Lei, C., Rundensteiner, E.A., Eltabakh, M.Y.: Redoop: supporting recurring queries in hadoop. EDBT 14, 24–28 (2014)
Lo, V.M.: Heuristic algorithms for task assignment in distributed systems. IEEE Trans. Comput. 37(11), 1384–1397 (1988)
memcached. http://memcached.org (2016). Accessed 20 May 2017
Nandakumar, V.: Transparent in-memory cache for hadoop-mapreduce. Ph.D. thesis, University of Toronto (2014)
Pai, V.S., Aron, M., Banga, G., Svendsen, M., Druschel, P., Zwaenepoel, W., Nahum, E.: Locality-aware request distribution in cluster-based network servers. In: ACM Sigplan Notices, vol. 33, pp. 205–216. ACM (1998)
Shinnar, A., Cunningham, D., Saraswat, V., Herta, B.: M3R: increased performance for in-memory hadoop jobs. Proc. VLDB Endow. 5(12), 1736–1747 (2012)
Shirazi, B.A., Kavi, K.M., Hurson, A.R. (eds.): Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos (1995)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, vol. 4. Addison-Wesley, Reading (1998)
Apache Spark. http://spark.apache.org (2016). Accessed 20 May 2017
Stonebraker, M.: The case for shared nothing. IEEE Database Eng. Bull. 9(1), 4–9 (1986)
Apache Tajo. http://tajo.apache.org (2016). Accessed 20 May 2017
Tang, Z., Zhou, J., Li, K., Li, R.: A mapreduce task scheduling algorithm for deadline constraints. Clust. Comput. 16(4), 651–662 (2013)
Tpc-h benchmark. http://www.tpc.org/tpch (2016). Accessed 20 May 2017
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European conference on Computer systems, pp. 265–278. ACM (2010)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Acknowledgements
This work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea Government(MSIP) (No. NRF-2014R1A2A1A11053657).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lim, B., Kim, J.W. & Chung, Y.D. CATS: cache-aware task scheduling for Hadoop-based systems. Cluster Comput 20, 3691–3705 (2017). https://doi.org/10.1007/s10586-017-0920-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0920-6