Skip to main content

Optimizing Memory Bandwidth Efficiency with User-Preferred Kernel Merge

  • Conference paper
  • First Online:
Euro-Par 2019: Parallel Processing Workshops (Euro-Par 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11997))

Included in the following conference series:

  • 1294 Accesses

Abstract

Earth system modeling computations use stencils extensively while running many kernels. Optimal coding of the stencils is essential to efficiently use memory bandwidth of an underlying hardware. This is important as stencil computations are memory bound.

Even when the code within one kernel is written to optimally use the memory bandwidth, there are still opportunities for further optimization at the inter-kernel level. Stencils naturally exhibit data locality, and executing a sequence of stencils within separate kernels could waste caching capabilities. Interprocedural optimizations such as merging of kernels bears the potential to improve the use of the caches. However, due to semantic restrictions, it is difficult to achieve on general purpose languages.

Some tools were developed to automatically fuse loops instead of the manual optimization. However, scientists still implement fusion in different levels of loop nests manually to find optimal performance. To allow scientists to still apply loop fusions equal to manual loop fusion, we develop a technique to automatically analyze the code and allow scientists to select their preferred fusions by providing automatic dependency analysis and code transformation; this also bears the potential for automatic tools that make smart choices on behalf of the user. Our work is done using GGDML language extensions which enables performance portability over different architectures using a single source code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Refer to https://github.com/aimes-project/ShallowWaterEquations.

  2. 2.

    Refer to https://github.com/aimes-project/ShallowWaterEquations.

  3. 3.

    The streaming benchmark ‘stream_sp_mem_avx’ from the ‘Likwid’ tools measured 67 GBytes/s on the processor.

References

  1. CSCS GridTools. https://github.com/GridTools/gridtools

  2. Casulli, V.: Semi-implicit finite difference methods for the two-dimensional shallow water equations. J. Comput. Phys. 86(1), 56–74 (1990)

    Article  MathSciNet  Google Scholar 

  3. Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)

    Article  Google Scholar 

  4. Fousek, J., Filipovič, J., Madzin, M.: Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Comput. Archit. News 39(4), 98–99 (2011)

    Article  Google Scholar 

  5. Fraboulet, A., Kodary, K., Mignotte, A.: Loop fusion for memory space optimization. In: Proceedings of the 14th International Symposium on Systems Synthesis, pp. 95–100. ACM (2001)

    Google Scholar 

  6. Jum’ah, N., Kunkel, J.: Performance portability of earth system models with user-controlled GGDML code translation. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 693–710. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_50

    Chapter  Google Scholar 

  7. Jumah, N., Kunkel, J.: Automatic vectorization of stencil codes with the GGDML language extensions. In: Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2019, pp. 2:1–2:7. ACM, New York (2019)

    Google Scholar 

  8. Jumah, N., Kunkel, J.M., Zängl, G., Yashiro, H., Dubos, T., Meurdesoif, T.: GGDML: icosahedral models language extensions. J. Comput. Sci. Technol. Updates 4(1), 1–10 (2017)

    Article  Google Scholar 

  9. Kennedy, K., McKinley, K.S.: Maximizing loop parallelism and improving data locality via loop fusion and distribution. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) LCPC 1993. LNCS, vol. 768, pp. 301–320. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57659-2_18

    Chapter  Google Scholar 

  10. McKinley, K.S., Carr, S., Tseng, C.-W.: Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. (TOPLAS) 18(4), 424–453 (1996)

    Article  Google Scholar 

  11. Meng, J., Morozov, V.A., Vishwanath, V., Kumaran, K.: Dataflow-driven GPU performance projection for multi-kernel transformations. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 82. IEEE Computer Society Press (2012)

    Google Scholar 

  12. Müller, M., Aoki, T.: Hybrid Fortran: high productivity GPU porting framework applied to Japanese weather prediction model. arXiv preprint arXiv:1710.08616 (2017)

  13. Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 191–202. IEEE Press (2014)

    Google Scholar 

  14. Wahib, M., Maruyama, N.: Automated GPU kernel transformations in large-scale production stencil applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pp. 259–270. ACM (2015)

    Google Scholar 

  15. Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Cyber, Physical and Social Computing (CPSCom), Green Computing and Communications (GreenCom), pp. 344–350. IEEE (2010)

    Google Scholar 

  16. Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: ACM SIGPLAN Notices, vol. 26, pp. 30–44. ACM (1991)

    Google Scholar 

  17. Wu, H., Cadambi, S., Chakradhar, S.T.: Optimizing data warehousing applications for GPUs using dynamic stream scheduling and dispatch of fused and split kernels. US Patent 8,990,827, 24 March 2015

    Google Scholar 

  18. Wu, H., Diamos, G., Wang, J., Cadambi, S., Yalamanchili, S., Chakradhar, S.: Optimizing data warehousing applications for GPUs using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 2433–2442. IEEE (2012)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the German Research Foundation (DFG) through the Priority Programme 1648 Software for Exascale Computing SPPEXA (GZ: LU 1353/11-1). We also thank the ‘Regionales Rechenzentrum Erlangen’ (RRZE) at Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), the Swiss National Supercomputing Center (CSCS), and NEC Deutschland, who provided access to their machines to run the experiments. We also thank Prof. John Thuburn – University of Exeter, for his help to develop the code of the shallow water equations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nabeeh Jumah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jumah, N., Kunkel, J. (2020). Optimizing Memory Bandwidth Efficiency with User-Preferred Kernel Merge. In: Schwardmann, U., et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-48340-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-48339-5

  • Online ISBN: 978-3-030-48340-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics