Abstract
While High Performance Computing systems are increasingly based on heterogeneous cores, their effectiveness depends on how well the scheduler can allocate workloads onto appropriate computing devices and how communication and computation can be overlapped. With different types of resources integrated into one system, the complexity of the scheduler correspondingly increases. Moreover, for applications with varying problem sizes on different heterogeneous resources, the optimal scheduling approach may vary accordingly. We thus present PDAWL, an event-driven profile-based Iterative Dynamic Adaptive Work-Load balance scheduling approach to dynamically and adaptively adjust workload to efficiently utilize heterogeneous resources. It combines online scheduling (DAWL), which can adaptively adjust workload based on available real time heterogeneous resources, with an offline machine learning (profile-based estimation model) which can build a device-specific communication computation estimation model. Our scheduling approach is tested on control-regular applications, Stencil kernel (based on a Jacobi Algorithm) and Sparse Matrix-Vector Multiplication (SpMV) in an event-driven runtime system. Experimental results show that PDAWL is either on-par or far outperforms whichever yields the best results (CPU or GPU).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arteaga, J., Zuckerman, S., Gao, G.R.: Generating fine-grain multithreaded applications using a multigrain approach. ACM Trans. Archit. Code Optim. 14(4), 1–47 (2017). https://doi.org/10.1145/3155288
Barnes, B.J., Rountree, B., Lowenthal, D.K., Reeves, J., de Supinski, B., Schulz, M.: A regression-based approach to scalability prediction. In: Proceedings of the 22Nd Annual International Conference on Supercomputing, pp. 368–377. ICS 2008, ACM, New York, USA (2008). https://doi.org/10.1145/1375527.1375580
Chen, Q., Guo, M.: Contention and locality-aware work-stealing for iterative applications in multi-socket computers. IEEE Trans. Comput. 67(6), 784–798 (2018). https://doi.org/10.1109/TC.2017.2783932
Chow, E., Anzt, H., Scott, J., Dongarra, J.: Using jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning. J. Parallel Distrib. Comput. 119, 219–230 (2018)
Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74. GPGPU-3, ACM, New York, USA (2010). https://doi.org/10.1145/1735688.1735702, http://doi.acm.org/10.1145/1735688.1735702
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 December 2011. https://doi.org/10.1145/2049662.2049663, http://doi.acm.org/10.1145/2049662.2049663
García, V., Gomez-Luna, J., Grass, T., Rico, A., Ayguade, E., Pena, A.J.: Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In: 2016 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10 September 2016. https://doi.org/10.1109/IISWC.2016.7581277
Geng, T., et al.: The importance of efficient fine-grain synchronization for many-core systems. In: Ding, C., Criswell, J., Wu, P. (eds.) LCPC 2016. LNCS, vol. 10136, pp. 203–217. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52709-3_16
Lee, V.W., et al.: Debunking the 100x gpu vs. cpu myth: An evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, pp. 451–460. ISCA 2010, ACM, New York, USA (2010). https://doi.org/10.1145/1815961.1816021, http://doi.acm.org/10.1145/1815961.1816021
Levon, J., Elie, P.: Oprofile: A system profiler for linux (2004)
List, T.S.: November 2017. http://www.top500.org
Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 45–55. MICRO 42, ACM, New York, USA (2009). https://doi.org/10.1145/1669112.1669121, http://doi.acm.org/10.1145/1669112.1669121
Lutz, T., Fensch, C., Cole, M.: Partans: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Arch. Code Optim. (TACO) 9(4), 59 (2013)
Margiolas, C., O’Boyle, M.F.P.: Portable and transparent software managed scheduling on accelerators for fair resource sharing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 82–93, March 2016
O’Boyle, M.F.P., Wang, Z., Grewe, D.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). pp. 1–10. CGO 2013, IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/CGO.2013.6494993, http://dx.doi.org/10.1109/CGO.2013.6494993
Sant’Ana, L., Cordeiro, D., Camargo, R.: PLB-HeC: a profile-based load-balancing algorithm for heterogeneous CPU-GPU clusters. In: 2015 IEEE International Conference on Cluster Computing, pp. 96–105, September 2015. https://doi.org/10.1109/CLUSTER.2015.24
San’Ana, L., Cordeiro, D., de Camargo, R.Y.: PLB-HAC: dynamic load-balancing for heterogeneous accelerator clusters. In: Yahyapour, R. (ed.) Euro-Par 2019. LNCS, vol. 11725, pp. 197–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29400-7_15
Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 633–644. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_63
Tribbey, W.: Modern database systems. In: Kim, W. (ed.) Modern Database Systems, chap. Numerical Recipes: The Art of Scientific Computing (3rd Edition) is Written by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, and Published by Cambridge University Press, 2007, Hardback, pp. 30–31, ISBN 978-0-521-88068-8, 1235 Pp. ACM Press/Addison-Wesley Publishing Co., New York, USA (1995). https://doi.org/10.1145/1874391.187410, http://dx.doi.org/10.1145/1874391.187410
Van Craeynest, K., Jaleel, A., Eeckhout, L., Narvaez, P., Emer, J.: Scheduling heterogeneous multi-cores through performance impact estimation (pie). SIGARCH Comput. Archit. News 40(3), 213–224 (2012). https://doi.org/10.1145/2366231.2337184, http://doi.acm.org/10.1145/2366231.2337184
Wang, Z., Tournavitis, G., Franke, B., O’boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim. 11(1), 1–26 (2014). https://doi.org/10.1145/2579561, http://doi.acm.org/10.1145/2579561
Wen, Y., O’Boyle, M.F.: Merge or separate?: multi-job scheduling for opencl kernels on CPU/GPU platforms. In: Proceedings of the General Purpose GPUs, pp. 22–31. GPGPU-10, ACM, New York, USA (2017). https://doi.org/10.1145/3038228.3038235, http://doi.acm.org/10.1145/3038228.3038235
Yang, C., et al.: Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: IEEE International Conference on Cluster Computing, pp. 19–28, September 2010). https://doi.org/10.1109/CLUSTER.2010.12
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: Finepar: irregularity-aware fine-grained workload partitioning on integrated architectures. In: 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 27–38, Febuary 2017. https://doi.org/10.1109/CGO.2017.7863726
Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE TPDS 28(3), 905–918 (2017). https://doi.org/10.1109/TPDS.2016.2586074
Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a “codelet" program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. EXADAPT 2011, ACM, New York, USA (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Geng, T., Amaris, M., Zuckerman, S., Goldman, A., Gao, G.R., Gaudiot, JL. (2020). PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2020. Lecture Notes in Computer Science(), vol 12326. Springer, Cham. https://doi.org/10.1007/978-3-030-63171-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-63171-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63170-3
Online ISBN: 978-3-030-63171-0
eBook Packages: Computer ScienceComputer Science (R0)