Abstract
Workloads with precedence constraints due to data dependencies are common in various applications. These workloads can be represented as directed acyclic graphs (DAG), and are often data-intensive, meaning that data loading cost is the dominant factor and thus cache misses should be minimized. We address the problem of parallel scheduling of a DAG of data-intensive tasks to minimize makespan. To do so, we propose greedy online scheduling algorithms that take load balancing, data dependencies, and data locality into account. Simulations and an experimental evaluation using an Apache Spark cluster demonstrate the advantages of our solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We assume a storage hierarchy with significant speed gaps between different levels, and use the term cache more generally, referring to SRAM cache memory, RAM memory, or distributed memory in a platform such as Spark, as appropriate.
- 2.
Reference Distance (RD) is a related metric that counts the total number of data accesses in between, not the distinct data accesses. SD was shown to be more accurate than RD in quantifying data locality [8], so we will not consider RD any further.
- 3.
We only report GCS results using weighted SD; results using WTMB were worse and are omitted from the figures.
References
Pegasus. https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator
Pylru 1.2.0. https://pypi.org/project/pylru/
Spark standalone. https://spark.apache.org/docs/latest/spark-standalone.html
Allahverdi, A.: The third comprehensive survey on scheduling problems with setup times/costs. Eur. J. Oper. Res. 246(2), 345–378 (2015)
Arras, P.A., Fuin, D., Jeannot, E., Stoutchinin, A., Thibault, S.: List scheduling in embedded systems under memory constraints. Int. J. Parallel Prog. 43, 1103–1128 (2015)
Bär, A., Golab, L., Ruehrup, S., Schiavone, M., Casas, P.: Cache-oblivious scheduling of shared workloads. In: IEEE International Conference on Data Engineering, pp. 855–866 (2015)
Canon, L.C., Jeannot, E., Sakellariou, R., Zheng, W.: Comparative evaluation of the robustness of dag scheduling heuristics. In: Grid Computing, pp. 73–84 (2008)
Coffman, E.G., Denning, P.J.: Operating Systems Theory. Prentice-Hall, New Jersey (1973)
Deslauriers, F., McCormick, P., Amvrosiadis, G., Goel, A., Brown, A.D.: Quartet: harmonizing task scheduling and caching for cluster computing. In: USENIX Workshop on Hot Topics in Storage and File Systems (2016)
Kwok, Y.K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. (CSUR) 31(4), 406–471 (1999)
Marchal, L., Simon, B., Vivien, F.: Limiting the memory footprint when dynamically scheduling dags on shared-memory platforms. J. Parallel Distrib. Comput. 128, 30–42 (2019). https://doi.org/10.1016/j.jpdc.2019.01.009
Meng, X., Golab, L.: Optimal reducer placement to minimize data transfer in MapReduce-style processing. In: 2017 IEEE International Conference on Big Data, pp. 339–346 (2017)
Nambiar, R.O., Poess, M.: The making of TPC-DS. In: International Conference on Very Large Data Bases, pp. 1049–1058 (2006)
Xu, E., Saxena, M., Chiu, L.: Neutrino: revisiting memory caching for iterative data analytics. In: USENIX Workshop on Hot Topics in Storage and File Systems (2016)
Yang, Z., Jia, D., Ioannidis, S., Mi, N., Sheng, B.: Intermediate data caching optimization for multi-stage and parallel big data frameworks. In: IEEE International Conference on Cloud Computing, pp. 277–284 (2018)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Meng, X., Golab, L. (2020). Parallel Scheduling of Data-Intensive Tasks. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-57675-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)