Skip to main content

A Predictive Performance Model for Stencil Codes on Multicore CPUs

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Abstract

In this paper we present an analytical performance model which yields estimates for the performance of stencil based simulations. Unlike previous models, we do neither rely on prototype implementations, nor do we examine the computational intensity only. Our model allows for memory optimizations such as cache blocking and non-temporal stores. Multi-threading, loop-unrolling, and vectorization are covered, too. The model is built from a sequence of 1D loops. For each loop we map the different parts of the instruction stream to the corresponding CPU pipelines and estimate their throughput. The load/store streams may be affected not only by their destination (the cache level or NUMA domain they target), but also by concurrent access of other threads. Evaluation of a Jacobi solver and the Himeno benchmark shows that the model is accurate enough to capture real live kernels.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allen, G., Dramlitsch, T., Foster, I., Karonis, N., Ripeanu, M., Seidel, E., Toonen, B.: Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus. In: SC 2001: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (2001), http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=582086

  2. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008), http://portal.acm.org/citation.cfm?id=1413370.1413375

  3. Henretty, T., Stock, K., Pouchet, L.-N., Franchetti, F., Ramanujam, J., Sadayappan, P.: Data layout transformation for stencil computations on short-vector SIMD architectures. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 225–245. Springer, Heidelberg (2011), http://dl.acm.org/citation.cfm?id=1987237.1987255

    Chapter  Google Scholar 

  4. Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of the 2005 Workshop on Memory System Performance, MSP 2005, pp. 36–43. ACM, New York (2005), http://doi.acm.org/10.1145/1111583.1111589

    Chapter  Google Scholar 

  5. Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers. In: SC 2011: Proceedings of the 2011 ACM/IEEE Conference on Supercomputing, Seattle, WA (2011)

    Google Scholar 

  6. Maruyama, T., Yoshida, T., Kan, R., Yamazaki, I., Yamamura, S., Takahashi, N., Hondou, M., Okano, H.: Sparc64 viiifx: A new-generation octocore processor for petascale computing. IEEE Micro 30(2), 30–40 (2010), http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5446249

    Article  Google Scholar 

  7. Murray, M.: Sandy Bridge: Intel’s Next-Generation Microarchitecture Revealed (2010), http://www.extremetech.com/computing/83848-sandy-bridge-intels-nextgeneration-microarchitecture-revealed (accessed April 09, 2012)

  8. Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/SC.2010.2

    Chapter  Google Scholar 

  9. Phillips, E.H., Fatica, M.: Implementing the himeno benchmark with cuda on gpu clusters. In: IPDPS, pp. 1–10. IEEE (2010)

    Google Scholar 

  10. Schäfer, A., Fey, D.: LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes. In: Lastovetsky, A., Kechadi, T., Dongarra, J. (eds.) EuroPVM/MPI 2008. LNCS, vol. 5205, pp. 285–294. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  11. Shimokawabe, T., Aoki, T., Takaki, T., Yamanaka, A., Nukada, A., Endo, T., Maruyama, N., Matsuoka, S.: Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer. In: SC 2011: Proceedings of the 2011 ACM/IEEE Conference on Supercomputing, Seattle, WA (2011)

    Google Scholar 

  12. Treibig, J., Hager, G.: Introducing a performance model for bandwidth-limited loop kernels. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 615–624. Springer, Heidelberg (2010), http://dl.acm.org/citation.cfm?id=1882792.1882865

    Chapter  Google Scholar 

  13. Treibig, J., Hager, G., Wellein, G.: Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In: Lee, W.C., Yuan, X. (eds.) ICPP Workshops, pp. 207–216. IEEE Computer Society (2010)

    Google Scholar 

  14. Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Annual International Computer Software and Applications Conference, vol. 1, pp. 579–586 (2009)

    Google Scholar 

  15. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009), http://doi.acm.org/10.1145/1498765.1498785

    Article  Google Scholar 

  16. Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T.: A fully integrated multi-cpu, gpu and memory controller 32nm processor. In: 2011 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 264–266 (February 2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Schäfer, A., Fey, D. (2013). A Predictive Performance Model for Stencil Codes on Multicore CPUs. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38718-0_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38717-3

  • Online ISBN: 978-3-642-38718-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics