PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures

Geng, Tongsheng; Amaris, Marcos; Zuckerman, Stéphane; Goldman, Alfredo; Gao, Guang R.; Gaudiot, Jean-Luc

doi:10.1007/978-3-030-63171-0_8

Tongsheng Geng¹¹,
Marcos Amaris¹²,
Stéphane Zuckerman¹³,
Alfredo Goldman¹²,
Guang R. Gao¹⁴ &
…
Jean-Luc Gaudiot¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12326))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

344 Accesses
2 Citations

Abstract

While High Performance Computing systems are increasingly based on heterogeneous cores, their effectiveness depends on how well the scheduler can allocate workloads onto appropriate computing devices and how communication and computation can be overlapped. With different types of resources integrated into one system, the complexity of the scheduler correspondingly increases. Moreover, for applications with varying problem sizes on different heterogeneous resources, the optimal scheduling approach may vary accordingly. We thus present PDAWL, an event-driven profile-based Iterative Dynamic Adaptive Work-Load balance scheduling approach to dynamically and adaptively adjust workload to efficiently utilize heterogeneous resources. It combines online scheduling (DAWL), which can adaptively adjust workload based on available real time heterogeneous resources, with an offline machine learning (profile-based estimation model) which can build a device-specific communication computation estimation model. Our scheduling approach is tested on control-regular applications, Stencil kernel (based on a Jacobi Algorithm) and Sparse Matrix-Vector Multiplication (SpMV) in an event-driven runtime system. Experimental results show that PDAWL is either on-par or far outperforms whichever yields the best results (CPU or GPU).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arteaga, J., Zuckerman, S., Gao, G.R.: Generating fine-grain multithreaded applications using a multigrain approach. ACM Trans. Archit. Code Optim. 14(4), 1–47 (2017). https://doi.org/10.1145/3155288
Article Google Scholar
Barnes, B.J., Rountree, B., Lowenthal, D.K., Reeves, J., de Supinski, B., Schulz, M.: A regression-based approach to scalability prediction. In: Proceedings of the 22Nd Annual International Conference on Supercomputing, pp. 368–377. ICS 2008, ACM, New York, USA (2008). https://doi.org/10.1145/1375527.1375580
Chen, Q., Guo, M.: Contention and locality-aware work-stealing for iterative applications in multi-socket computers. IEEE Trans. Comput. 67(6), 784–798 (2018). https://doi.org/10.1109/TC.2017.2783932
Article MathSciNet Google Scholar
Chow, E., Anzt, H., Scott, J., Dongarra, J.: Using jacobi iterations and blocking for solving sparse triangular systems in incomplete factorization preconditioning. J. Parallel Distrib. Comput. 119, 219–230 (2018)
Article Google Scholar
Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74. GPGPU-3, ACM, New York, USA (2010). https://doi.org/10.1145/1735688.1735702, http://doi.acm.org/10.1145/1735688.1735702
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 December 2011. https://doi.org/10.1145/2049662.2049663, http://doi.acm.org/10.1145/2049662.2049663
García, V., Gomez-Luna, J., Grass, T., Rico, A., Ayguade, E., Pena, A.J.: Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In: 2016 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10 September 2016. https://doi.org/10.1109/IISWC.2016.7581277
Geng, T., et al.: The importance of efficient fine-grain synchronization for many-core systems. In: Ding, C., Criswell, J., Wu, P. (eds.) LCPC 2016. LNCS, vol. 10136, pp. 203–217. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52709-3_16
Chapter Google Scholar
Lee, V.W., et al.: Debunking the 100x gpu vs. cpu myth: An evaluation of throughput computing on CPU and GPU. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, pp. 451–460. ISCA 2010, ACM, New York, USA (2010). https://doi.org/10.1145/1815961.1816021, http://doi.acm.org/10.1145/1815961.1816021
Levon, J., Elie, P.: Oprofile: A system profiler for linux (2004)
Google Scholar
List, T.S.: November 2017. http://www.top500.org
Luk, C.K., Hong, S., Kim, H.: Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 45–55. MICRO 42, ACM, New York, USA (2009). https://doi.org/10.1145/1669112.1669121, http://doi.acm.org/10.1145/1669112.1669121
Lutz, T., Fensch, C., Cole, M.: Partans: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Arch. Code Optim. (TACO) 9(4), 59 (2013)
Google Scholar
Margiolas, C., O’Boyle, M.F.P.: Portable and transparent software managed scheduling on accelerators for fair resource sharing. In: 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 82–93, March 2016
Google Scholar
O’Boyle, M.F.P., Wang, Z., Grewe, D.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). pp. 1–10. CGO 2013, IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/CGO.2013.6494993, http://dx.doi.org/10.1109/CGO.2013.6494993
Sant’Ana, L., Cordeiro, D., Camargo, R.: PLB-HeC: a profile-based load-balancing algorithm for heterogeneous CPU-GPU clusters. In: 2015 IEEE International Conference on Cluster Computing, pp. 96–105, September 2015. https://doi.org/10.1109/CLUSTER.2015.24
San’Ana, L., Cordeiro, D., de Camargo, R.Y.: PLB-HAC: dynamic load-balancing for heterogeneous accelerator clusters. In: Yahyapour, R. (ed.) Euro-Par 2019. LNCS, vol. 11725, pp. 197–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29400-7_15
Chapter Google Scholar
Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 633–644. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-6_63
Chapter Google Scholar
Tribbey, W.: Modern database systems. In: Kim, W. (ed.) Modern Database Systems, chap. Numerical Recipes: The Art of Scientific Computing (3rd Edition) is Written by William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, and Published by Cambridge University Press, 2007, Hardback, pp. 30–31, ISBN 978-0-521-88068-8, 1235 Pp. ACM Press/Addison-Wesley Publishing Co., New York, USA (1995). https://doi.org/10.1145/1874391.187410, http://dx.doi.org/10.1145/1874391.187410
Van Craeynest, K., Jaleel, A., Eeckhout, L., Narvaez, P., Emer, J.: Scheduling heterogeneous multi-cores through performance impact estimation (pie). SIGARCH Comput. Archit. News 40(3), 213–224 (2012). https://doi.org/10.1145/2366231.2337184, http://doi.acm.org/10.1145/2366231.2337184
Wang, Z., Tournavitis, G., Franke, B., O’boyle, M.F.P.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM Trans. Archit. Code Optim. 11(1), 1–26 (2014). https://doi.org/10.1145/2579561, http://doi.acm.org/10.1145/2579561
Wen, Y., O’Boyle, M.F.: Merge or separate?: multi-job scheduling for opencl kernels on CPU/GPU platforms. In: Proceedings of the General Purpose GPUs, pp. 22–31. GPGPU-10, ACM, New York, USA (2017). https://doi.org/10.1145/3038228.3038235, http://doi.acm.org/10.1145/3038228.3038235
Yang, C., et al.: Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: IEEE International Conference on Cluster Computing, pp. 19–28, September 2010). https://doi.org/10.1109/CLUSTER.2010.12
Zhang, F., Wu, B., Zhai, J., He, B., Chen, W.: Finepar: irregularity-aware fine-grained workload partitioning on integrated architectures. In: 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 27–38, Febuary 2017. https://doi.org/10.1109/CGO.2017.7863726
Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE TPDS 28(3), 905–918 (2017). https://doi.org/10.1109/TPDS.2016.2586074
Article Google Scholar
Zuckerman, S., Suetterlein, J., Knauerhase, R., Gao, G.R.: Using a “codelet" program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. EXADAPT 2011, ACM, New York, USA (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Irvine, CA, USA
Tongsheng Geng & Jean-Luc Gaudiot
University of São Paulo, São Paulo, Brazil
Marcos Amaris & Alfredo Goldman
Laboratoire ETIS, CY Paris Universités, ENSEA, CNRS, Paris, France
Stéphane Zuckerman
University of Delaware, Delaware, USA
Guang R. Gao

Authors

Tongsheng Geng
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Amaris
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Zuckerman
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Goldman
View author publications
You can also search for this author in PubMed Google Scholar
Guang R. Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Gaudiot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tongsheng Geng .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, CA, USA
Walfredo Cirne
Google, Seattle, WA, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geng, T., Amaris, M., Zuckerman, S., Goldman, A., Gao, G.R., Gaudiot, JL. (2020). PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2020. Lecture Notes in Computer Science(), vol 12326. Springer, Cham. https://doi.org/10.1007/978-3-030-63171-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-63171-0_8
Published: 16 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63170-3
Online ISBN: 978-3-030-63171-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics