A Predictive Performance Model for Stencil Codes on Multicore CPUs

Schäfer, Andreas; Fey, Dietmar

doi:10.1007/978-3-642-38718-0_40

A Predictive Performance Model for Stencil Codes on Multicore CPUs

Andreas Schäfer¹⁹ &
Dietmar Fey¹⁹

Conference paper

2066 Accesses
1 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Abstract

In this paper we present an analytical performance model which yields estimates for the performance of stencil based simulations. Unlike previous models, we do neither rely on prototype implementations, nor do we examine the computational intensity only. Our model allows for memory optimizations such as cache blocking and non-temporal stores. Multi-threading, loop-unrolling, and vectorization are covered, too. The model is built from a sequence of 1D loops. For each loop we map the different parts of the instruction stream to the corresponding CPU pipelines and estimate their throughput. The load/store streams may be affected not only by their destination (the cache level or NUMA domain they target), but also by concurrent access of other threads. Evaluation of a Jacobi solver and the Himeno benchmark shows that the model is accurate enough to capture real live kernels.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allen, G., Dramlitsch, T., Foster, I., Karonis, N., Ripeanu, M., Seidel, E., Toonen, B.: Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus. In: SC 2001: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (2001), http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=582086
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008), http://portal.acm.org/citation.cfm?id=1413370.1413375
Henretty, T., Stock, K., Pouchet, L.-N., Franchetti, F., Ramanujam, J., Sadayappan, P.: Data layout transformation for stencil computations on short-vector SIMD architectures. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 225–245. Springer, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=1987237.1987255
Chapter Google Scholar
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, MSP 2005, pp. 36–43. ACM, New York (2005), http://doi.acm.org/10.1145/1111583.1111589
Chapter Google Scholar
Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers. In: SC 2011: Proceedings of the 2011 ACM/IEEE Conference on Supercomputing, Seattle, WA (2011)
Google Scholar
Maruyama, T., Yoshida, T., Kan, R., Yamazaki, I., Yamamura, S., Takahashi, N., Hondou, M., Okano, H.: Sparc64 viiifx: A new-generation octocore processor for petascale computing. IEEE Micro 30(2), 30–40 (2010), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5446249
Article Google Scholar
Murray, M.: Sandy Bridge: Intel’s Next-Generation Microarchitecture Revealed (2010), http://www.extremetech.com/computing/83848-sandy-bridge-intels-nextgeneration-microarchitecture-revealed (accessed April 09, 2012)
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/SC.2010.2
Chapter Google Scholar
Phillips, E.H., Fatica, M.: Implementing the himeno benchmark with cuda on gpu clusters. In: IPDPS, pp. 1–10. IEEE (2010)
Google Scholar
Schäfer, A., Fey, D.: LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) EuroPVM/MPI 2008. LNCS, vol. 5205, pp. 285–294. Springer, Heidelberg (2008)
Chapter Google Scholar
Shimokawabe, T., Aoki, T., Takaki, T., Yamanaka, A., Nukada, A., Endo, T., Maruyama, N., Matsuoka, S.: Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. In: SC 2011: Proceedings of the 2011 ACM/IEEE Conference on Supercomputing, Seattle, WA (2011)
Google Scholar
Treibig, J., Hager, G.: Introducing a performance model for bandwidth-limited loop kernels. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 615–624. Springer, Heidelberg (2010), http://dl.acm.org/citation.cfm?id=1882792.1882865
Chapter Google Scholar
Treibig, J., Hager, G., Wellein, G.: Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In: Lee, W.C., Yuan, X. (eds.) ICPP Workshops, pp. 207–216. IEEE Computer Society (2010)
Google Scholar
Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Annual International Computer Software and Applications Conference, vol. 1, pp. 579–586 (2009)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009), http://doi.acm.org/10.1145/1498765.1498785
Article Google Scholar
Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T.: A fully integrated multi-cpu, gpu and memory controller 32nm processor. In: 2011 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 264–266 (February 2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
Andreas Schäfer & Dietmar Fey

Authors

Andreas Schäfer
View author publications
You can also search for this author in PubMed Google Scholar
Dietmar Fey
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INPT (ENSEEIHT) - IRIT, University of Toulouse, 31062, Toulouse, France
Michel Daydé
Lawrence Berkeley National Laboratory, 94720-8139, Berkeley, CA, USA
Osni Marques
Information Technology Center, The University of Tokyo, 113-8658, Tokyo, Japan
Kengo Nakajima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schäfer, A., Fey, D. (2013). A Predictive Performance Model for Stencil Codes on Multicore CPUs. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-38718-0_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38717-3
Online ISBN: 978-3-642-38718-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics