Optimizing Memory Bandwidth Efficiency with User-Preferred Kernel Merge

Jumah, Nabeeh; Kunkel, Julian

doi:10.1007/978-3-030-48340-1_6

Nabeeh Jumah²² &
Julian Kunkel²³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11997))

Included in the following conference series:

European Conference on Parallel Processing

1294 Accesses

Abstract

Earth system modeling computations use stencils extensively while running many kernels. Optimal coding of the stencils is essential to efficiently use memory bandwidth of an underlying hardware. This is important as stencil computations are memory bound.

Even when the code within one kernel is written to optimally use the memory bandwidth, there are still opportunities for further optimization at the inter-kernel level. Stencils naturally exhibit data locality, and executing a sequence of stencils within separate kernels could waste caching capabilities. Interprocedural optimizations such as merging of kernels bears the potential to improve the use of the caches. However, due to semantic restrictions, it is difficult to achieve on general purpose languages.

Some tools were developed to automatically fuse loops instead of the manual optimization. However, scientists still implement fusion in different levels of loop nests manually to find optimal performance. To allow scientists to still apply loop fusions equal to manual loop fusion, we develop a technique to automatically analyze the code and allow scientists to select their preferred fusions by providing automatic dependency analysis and code transformation; this also bears the potential for automatic tools that make smart choices on behalf of the user. Our work is done using GGDML language extensions which enables performance portability over different architectures using a single source code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Refer to https://github.com/aimes-project/ShallowWaterEquations.
2.
Refer to https://github.com/aimes-project/ShallowWaterEquations.
3.
The streaming benchmark ‘stream_sp_mem_avx’ from the ‘Likwid’ tools measured 67 GBytes/s on the processor.

References

CSCS GridTools. https://github.com/GridTools/gridtools
Casulli, V.: Semi-implicit finite difference methods for the two-dimensional shallow water equations. J. Comput. Phys. 86(1), 56–74 (1990)
Article MathSciNet Google Scholar
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)
Article Google Scholar
Fousek, J., Filipovič, J., Madzin, M.: Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Comput. Archit. News 39(4), 98–99 (2011)
Article Google Scholar
Fraboulet, A., Kodary, K., Mignotte, A.: Loop fusion for memory space optimization. In: Proceedings of the 14th International Symposium on Systems Synthesis, pp. 95–100. ACM (2001)
Google Scholar
Jum’ah, N., Kunkel, J.: Performance portability of earth system models with user-controlled GGDML code translation. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 693–710. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_50
Chapter Google Scholar
Jumah, N., Kunkel, J.: Automatic vectorization of stencil codes with the GGDML language extensions. In: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2019, pp. 2:1–2:7. ACM, New York (2019)
Google Scholar
Jumah, N., Kunkel, J.M., Zängl, G., Yashiro, H., Dubos, T., Meurdesoif, T.: GGDML: icosahedral models language extensions. J. Comput. Sci. Technol. Updates 4(1), 1–10 (2017)
Article Google Scholar
Kennedy, K., McKinley, K.S.: Maximizing loop parallelism and improving data locality via loop fusion and distribution. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) LCPC 1993. LNCS, vol. 768, pp. 301–320. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57659-2_18
Chapter Google Scholar
McKinley, K.S., Carr, S., Tseng, C.-W.: Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. (TOPLAS) 18(4), 424–453 (1996)
Article Google Scholar
Meng, J., Morozov, V.A., Vishwanath, V., Kumaran, K.: Dataflow-driven GPU performance projection for multi-kernel transformations. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 82. IEEE Computer Society Press (2012)
Google Scholar
Müller, M., Aoki, T.: Hybrid Fortran: high productivity GPU porting framework applied to Japanese weather prediction model. arXiv preprint arXiv:1710.08616 (2017)
Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 191–202. IEEE Press (2014)
Google Scholar
Wahib, M., Maruyama, N.: Automated GPU kernel transformations in large-scale production stencil applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp. 259–270. ACM (2015)
Google Scholar
Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Cyber, Physical and Social Computing (CPSCom), Green Computing and Communications (GreenCom), pp. 344–350. IEEE (2010)
Google Scholar
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: ACM SIGPLAN Notices, vol. 26, pp. 30–44. ACM (1991)
Google Scholar
Wu, H., Cadambi, S., Chakradhar, S.T.: Optimizing data warehousing applications for GPUs using dynamic stream scheduling and dispatch of fused and split kernels. US Patent 8,990,827, 24 March 2015
Google Scholar
Wu, H., Diamos, G., Wang, J., Cadambi, S., Yalamanchili, S., Chakradhar, S.: Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 2433–2442. IEEE (2012)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the German Research Foundation (DFG) through the Priority Programme 1648 Software for Exascale Computing SPPEXA (GZ: LU 1353/11-1). We also thank the ‘Regionales Rechenzentrum Erlangen’ (RRZE) at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), the Swiss National Supercomputing Center (CSCS), and NEC Deutschland, who provided access to their machines to run the experiments. We also thank Prof. John Thuburn – University of Exeter, for his help to develop the code of the shallow water equations.

Author information

Authors and Affiliations

Universität Hamburg, Hamburg, Germany
Nabeeh Jumah
University of Reading, Reading, UK
Julian Kunkel

Authors

Nabeeh Jumah
View author publications
You can also search for this author in PubMed Google Scholar
Julian Kunkel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nabeeh Jumah .

Editor information

Editors and Affiliations

Gesellschaft für Wissenschaftliche Datenverarbeitung mbH, Göttingen, Germany
Ulrich Schwardmann
Gesellschaft für Wissenschaftliche Datenverarbeitung mbH, Göttingen, Germany
Christian Boehme
CiTIUS, Santiago de Compostela, Spain
Dora B. Heras
University of Rome "Tor Vergata", Rome, Italy
Valeria Cardellini
Inria Bordeaux Sud-Ouest, Talence, France
Emmanuel Jeannot
Engineering Sardegna, Cagliari, Italy
Antonio Salis
University of Turin, Torino, Italy
Claudio Schifanella
University College Dublin, Dublin, Ireland
Ravi Reddy Manumachu
DLR-AS, Göttingen, Germany
Dieter Schwamborn
University of Pisa, Pisa, Italy
Laura Ricci
Ajou University, Suwon, Korea (Republic of)
Oh Sangyoon
RRZE Friedrich-Alexander-Universität, Erlangen, Germany
Thomas Gruber
ICAR-CNR, Napoli, Italy
Laura Antonelli
Tennessee Technological University, Cookeville, TN, USA
Stephen L. Scott

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jumah, N., Kunkel, J. (2020). Optimizing Memory Bandwidth Efficiency with User-Preferred Kernel Merge. In: Schwardmann, U., et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-48340-1_6
Published: 29 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48339-5
Online ISBN: 978-3-030-48340-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics