CATS: cache-aware task scheduling for Hadoop-based systems

Lim, Byungnam; Kim, Jong Wook; Chung, Yon Dohn

doi:10.1007/s10586-017-0920-6

CATS: cache-aware task scheduling for Hadoop-based systems

Published: 24 May 2017

Volume 20, pages 3691–3705, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Byungnam Lim¹,
Jong Wook Kim² &
Yon Dohn Chung¹

566 Accesses
8 Citations
Explore all metrics

Abstract

Today with the explosion of big data, data-intensive cluster computing systems have driven to a new data processing paradigm. As Hadoop, one of the most famous data processing frameworks, achieves high performance by running multiple tasks in parallel across nodes in large clusters, task scheduling is considered as one of the most important factors affecting the overall performance. In modern operating systems, caching is used to improve local disk access times, providing data from the main memory without disk accesses. This option, however, is poorly utilized by existing task scheduling methods of Hadoop-based systems, mainly due to the inability of tracking cached data in shared-nothing distributed environments. In this paper, we propose a cache-aware task scheduling method, cache-aware task scheduling (CATS), for Hadoop-based systems which is able to exploit the operating system’s buffer cache and assign tasks to nodes in consideration of the cached data. Through comprehensive experiments, we show that the proposed cache-aware scheduling improves the overall job execution time for various workload types and data sizes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Dynamic resource allocation in cloud computing: analysis and taxonomies

Article 28 January 2022

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

References

Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: a distributed data warehouse system on large clusters. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1320–1323. IEEE (2013)
Chou, T.C., Abraham, J., et al.: Load balancing in distributed systems. IEEE Trans. Softw. Eng. 4, 401–412 (1982)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
El-Rewini, H., Lewis, T.G., Ali, H.H.: Task Scheduling in Parallel and Distributed Systems. Prentice-Hall Inc., Upper Saddle River (1994)
Google Scholar
Gunarathne, T., Zhang, B., Wu, T.L., Qiu, J.: Scalable parallel computing on clouds using twister4azure iterative mapreduce. Future Gener. Comput. Syst. 29(4), 1035–1048 (2013)
Article Google Scholar
Apache Hadoop. http://hadoop.apache.org (2016). Accessed 20 May 2017
Centralized cache management in hdfs. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html (2016). Accessed 20 May 2017
IBM: Ibm cache aware scheduling. https://www.ibm.com/support/knowledgecenter/SSZUMP_7.1.2/mapreduce_user/cache_aware_scheduling_about.html (2017). Accessed 20 May 2017
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS Operating Systems Review, vol. 41, pp. 59–72. ACM (2007)
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp. 261–276. ACM (2009)
Jacobs, A.: The pathologies of big data. Commun. ACM 52(8), 36–44 (2009)
Article Google Scholar
Lei, C., Rundensteiner, E.A., Eltabakh, M.Y.: Redoop: supporting recurring queries in hadoop. EDBT 14, 24–28 (2014)
Google Scholar
Lo, V.M.: Heuristic algorithms for task assignment in distributed systems. IEEE Trans. Comput. 37(11), 1384–1397 (1988)
Article MathSciNet Google Scholar
memcached. http://memcached.org (2016). Accessed 20 May 2017
Nandakumar, V.: Transparent in-memory cache for hadoop-mapreduce. Ph.D. thesis, University of Toronto (2014)
Pai, V.S., Aron, M., Banga, G., Svendsen, M., Druschel, P., Zwaenepoel, W., Nahum, E.: Locality-aware request distribution in cluster-based network servers. In: ACM Sigplan Notices, vol. 33, pp. 205–216. ACM (1998)
Shinnar, A., Cunningham, D., Saraswat, V., Herta, B.: M3R: increased performance for in-memory hadoop jobs. Proc. VLDB Endow. 5(12), 1736–1747 (2012)
Article Google Scholar
Shirazi, B.A., Kavi, K.M., Hurson, A.R. (eds.): Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE Computer Society Press, Los Alamitos (1995)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, vol. 4. Addison-Wesley, Reading (1998)
MATH Google Scholar
Apache Spark. http://spark.apache.org (2016). Accessed 20 May 2017
Stonebraker, M.: The case for shared nothing. IEEE Database Eng. Bull. 9(1), 4–9 (1986)
Google Scholar
Apache Tajo. http://tajo.apache.org (2016). Accessed 20 May 2017
Tang, Z., Zhou, J., Li, K., Li, R.: A mapreduce task scheduling algorithm for deadline constraints. Clust. Comput. 16(4), 651–662 (2013)
Article Google Scholar
Tpc-h benchmark. http://www.tpc.org/tpch (2016). Accessed 20 May 2017
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European conference on Computer systems, pp. 265–278. ACM (2010)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) Grant funded by the Korea Government(MSIP) (No. NRF-2014R1A2A1A11053657).

Author information

Authors and Affiliations

Korea University, Seoul, Republic of Korea
Byungnam Lim & Yon Dohn Chung
Sangmyung University, Seoul, Republic of Korea
Jong Wook Kim

Authors

Byungnam Lim
View author publications
You can also search for this author in PubMed Google Scholar
Jong Wook Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yon Dohn Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yon Dohn Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, B., Kim, J.W. & Chung, Y.D. CATS: cache-aware task scheduling for Hadoop-based systems. Cluster Comput 20, 3691–3705 (2017). https://doi.org/10.1007/s10586-017-0920-6

Download citation

Received: 25 October 2016
Revised: 23 March 2017
Accepted: 09 May 2017
Published: 24 May 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10586-017-0920-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CATS: cache-aware task scheduling for Hadoop-based systems

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

Performance improvement of the triangular matrix product in commodity clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CATS: cache-aware task scheduling for Hadoop-based systems

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

Performance improvement of the triangular matrix product in commodity clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation