ABSTRACT
In this paper, we explain the steps we have taken to port a large, industry-grade computational fluid dynamics application to the Intel® Xeon Phi™coprocessor using the C/C++ Array Notation extensions of Intel® Cilk™Plus. An essential part of the performance refactoring process for the Xeon Phi coprocessor is to achieve high-quality SIMD-vectorization. Even though there are other ways to vectorize code, the Array Notation extensions has proven to work best for our application. We have encapsulated the Array Notation extension syntax in a C++ wrapper class to drastically reduce the refactoring effort. In addition the architecture independency of Array Notation extensions minimizes porting and tuning efforts further. In this paper, we study how our approach helps the compiler to generate vectorized code. Derived from that study, we summarize our key learnings and findings as well as current limitations. Finally, we present a performance evaluation of the ported computational fluid dynamics application by using the introduced C++ wrapper class and differentiate our solution to other related solutions.
- Cilk Plus/LLVM. Website. Available online at http://cilkplus.github.io.Google Scholar
- GCC 4.9 Release Series. Website. Available online at https://gcc.gnu.org/gcc-4.9/changes.html.Google Scholar
- Intel Cilk Plus. Website. Available online at https://www.cilkplus.org.Google Scholar
- Intel Developer Zone: Additional Predefined Macros. Website. Available online at https://software.intel.com/en-us/node/514528.Google Scholar
- Intel Developer Zone: Data Alignment to Assist Vectorization. Website. Available online at https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization.Google Scholar
- Intel Developer Zone: Extensions for Array Notation. Website. Available online at https://software.intel.com/de-de/node/522647.Google Scholar
- Intel Developer Zone: Intel Math Kernel Library (Intel MKL). Website. Available online at https://software.intel.com/en-us/intel-mkl.Google Scholar
- Intel Xeon Phi User's Group (IXPUG). Website. Available online at https://www.ixpug.org.Google Scholar
- Intel ® Math Library. Website. Available online at https://software.intel.com/de-de/node/522652.Google Scholar
- Introduction to the Intel ® SIMD Data Layout Templates (Intel ® SDLT). Website. Available online at https://software.intel.com/en-us/node/600110;.Google Scholar
- N3396: Dynamic memory allocation for over-aligned data. Website. Available online at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3396.htm.Google Scholar
- Optimizing Memory Bandwidth on Stream Triad. Website. Available online at https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad.Google Scholar
- TRACE. Website. Available online at http://www.dlr.de/sc/en/desktopdefault.aspx/tabid-5142/8655 read-3183.Google Scholar
- The openmp api specification for parallel programming. Website, 2013. Available online at http://www.openmp.org/visited on Nov. 14th 2013.Google Scholar
- Pierre Estérie, Joel Falcou, Mathias Gaunard, and Jean-Thierry Lapresté. Boost.simd: Generic programming for portable simdization. In Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP '14, pages 1--8, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- A. Fog. VCL. C++ vector class library. Website, 2014. Available online at http://www.agner.org/optimize/#vectorclass.Google Scholar
- Matthias Kretz and Volker Lindenstruth. Vc: A c++ library for explicit vectorization. Software: Practice and Experience, 42(11):1409--1430, 2012. Google ScholarDigital Library
- Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, and Wolfgang Nagel. Scout: A Source-to-Source Transformator for SIMD-Optimizations. In 4th Workshop on Productivity and Performance (PROPER 2011), Bordeaux, France, August 2011.Google Scholar
- Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, and Wolfgang Nagel. Auto-Vectorization Techniques for Modern SIMD Architectures. In 16th International Workshop on Compilers for Parallel Computing (CPC 2012), Padova, Italy, January 2012.Google Scholar
- Roland Leißa, Sebastian Hack, and Ingo Wald. Extending a c-like language for portable simd programming. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 65--74, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- S. Maleki, Yaoqing Gao, M. J. Garzaran, T. Wong, and D. A. Padua. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372--382, 2011. Google ScholarDigital Library
- M. Pharr and W. R. Mark. ispc: A spmd compiler for high-performance cpu programming. In Innovative Parallel Computing (InPar), 2012, pages 1--13, May 2012.Google ScholarCross Ref
- Julien Sebot and Nathalie Drach-Temam. Memory bandwidth: The true bottleneck of simd multimedia performance on a superscalar processor. In Rizos Sakellariou, John Gurd, Len Freeman, and John Keane, editors, Euro-Par 2001 Parallel Processing, volume 2150 of Lecture Notes in Computer Science, pages 439--447. Springer Berlin Heidelberg, 2001. Google ScholarDigital Library
- W. Sutherland. The viscosity of gases and molecular force. Philosoph. Mag. 5, 36:507--531, 1893.Google Scholar
Recommendations
Boundary element quadrature schemes for multi- and many-core architectures
In the paper we study the performance of the regularized boundary element quadrature routines implemented in the BEM4I library developed by the authors. Apart from the results obtained on the classical multi-core architecture represented by the Intel ...
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
IWOCL '21: Proceedings of the 9th International Workshop on OpenCLThe Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant ...
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataAnalytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, ...
Comments