ABSTRACT
Weather and climate simulations are a major application driver in high-performance computing (HPC). With the end of Dennard scaling and Moore's law, the HPC industry increasingly employs specialized computation accelerators to increase computational throughput. Manycore architectures, such as Intel's Knights Landing (KNL), are a representative example of future processing devices. However, software has to be modified to use these devices efficiently. In this work, we demonstrate how an existing domain-specific language that has been designed for CPUs and GPUs can be extended to Manycore architectures such as KNL. We achieve comparable performance to the NVIDIA Tesla P100 GPU architecture on hand-tuned representative stencils of the dynamical core of the COSMO weather model and its radiation code. Further, we present performance within a factor of two of the P100 of the full DSL-based GPU-optimized COSMO dycore code. We find that optimizing code to full performance on modern manycore architectures requires similar effort and hardware knowledge as for GPUs. Further, we show limitations of the present approaches, and outline our lessons learned and possible principles for design of future DSLs for accelerators in the weather and climate domain.
- Samantha V. Adams, Rupert W. Ford, M. Hambley, J. M. Hobson, I. Kavcic, C. M. Maynard, T. Melvin, Eike Hermann Müller, S. Mullerworth, A. R. Porter, Mike Rezny, Ben Shipway, and R. Wong. 2018. LFRic: Meeting the challenges of scalability and performance portability in Weather and Climate models. CoRR abs/1809.07267 (2018). arXiv:1809.07267 http://arxiv.org/abs/1809.07267Google Scholar
- Valentin Clement, Sylvaine Ferrachat, Oliver Fuhrer, Xavier Lapillonne, Carlos E. Osuna, Robert Pincus, Jon Rood, and William Sawyer. 2018. The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC '18). ACM, New York, NY, USA, Article 2, 10 pages. Google ScholarDigital Library
- COSMO. 1998. Consortium for Small-scale Modeling. http://www.cosmo-model.org/Google Scholar
- G Doms and M Baldauf. 2018. A Description of the Nonhydrostatic Regional COSMO-Model. http://www.cosmo-model.org/content/model/documentation/core/default.htmGoogle Scholar
- H. Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. 2012. Manycore performance-portability: Kokkos multidimensional array library. Scientific Programming 20 (2012), 89--114. Google ScholarDigital Library
- Oliver Fuhrer, Tarun Chadha, Torsten Hoefler, Grzegorz Kwasniewski, Xavier Lapillonne, David Leutwyler, Daniel Lüthi, Carlos Osuna, Christoph Schär, Thomas C. Schulthess, and Hannes Vogt. 2018. Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0. Geoscientific Model Development 11, 4 (May 2018), 1665--1681.Google ScholarCross Ref
- Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and Thomas Christoph Schulthess. 2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomputing Frontiers and Innovations 1, 1 (June 2014), 45-62-62. Google ScholarDigital Library
- Mark Govett, Jim Rosinski, Jacques Middlecoff, Tom Henderson, Jin Lee, Alexander MacDonald, Ning Wang, Paul Madden, Julie Schramm, and Antonio Duarte. 2017. Parallelization and Performance of the NIM Weather Model on CPU, GPU, and MIC Processors. Bulletin of the American Meteorological Society 98, 10 (2017), 2201--2213. arXiv:https://doi.org/10.1175/BAMS-D-15-00278.1Google ScholarCross Ref
- Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC Transparent Compilation to Heterogeneous Hardware. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 1, 13 pages. Google ScholarDigital Library
- Tobias Gysi, Carlos Osuna, Oliver Fuhrer, Mauro Bianco, and Thomas C. Schulthess. 2015. STELLA: A Domain-specific Tool for Structured Grid Methods in Weather and Climate Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, 41:1--41:12. Google ScholarDigital Library
- Intel Corporation. 2016. Intel® 64 and IA-32 Architectures Optimization Reference Manual. https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdfGoogle Scholar
- Intel Corporation. 2017. Intel® Xeon Phi™ Coprocessor x200 Product Family Datasheet. https://www.intel.com.br/content/dam/www/public/us/en/documents/datasheets/xeon-phi-coprocessor-x200-family-datasheet.pdfGoogle Scholar
- Intel Corporation. 2018. Product Change Notification 116378 - 00. https://qdms.intel.com/dm/i.aspx/9C54A9A7-BF37-4496-B268-BD2746EA54D3/PCN116378-00.pdfGoogle Scholar
- Jim Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming (Knights Landing Edition). Morgan Kaufmann, Boston. Google ScholarDigital Library
- John Michalakes, Michael J. Iacono, and Elizabeth R. Jessup. 2016. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture. Parallel Processing Letters 26 (2016), 1--16.Google ScholarCross Ref
- J. Mielikainen, B. Huang, and A. H.-L. Huang. 2014. Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme. Geoscientific Model Development Discussions 7, 6 (Dec. 2014), 8941--8973.Google ScholarCross Ref
- T. A. J. Ouermi, Aaron Knoll, Robert Michael Kirby, and Martin Berzins. 2017. OpenMP 4 Fortran Modernization of WSM6 for KNL. In PEARC. Google ScholarDigital Library
- Sabela Ramos and Torsten Hoefler. 2017. Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Orlando, FL, USA, 297--306.Google Scholar
- Bodo Ritter and Jean-Francois Geleyn. 1992. A Comprehensive Radiation Scheme for Numerical Weather Prediction Models with Potential Applications in Climate Simulations. Monthly Weather Review 120, 2 (Feb. 1992), 303--325.Google ScholarCross Ref
- T. C. Schulthess, P. Bauer, N. Wedi, O. Fuhrer, T. Hoefler, and C. Schär. 2019. Reflecting on the Goal and Baseline for Exascale Computing: A Roadmap Based on Weather and Climate Simulations. Computing in Science Engineering 21, 1 (Jan. 2019), 30--41.Google ScholarDigital Library
- Pascal Spörri. 2017. COSMO C++ Dynamical Core Training Course - Introduction and Code Flow. https://wiki.c2sm.ethz.ch/pub/COSMO/CXXDynamicalCore/20170403_-_1_-_CPP_Dycore_Intro_Code_Flow.pdfGoogle Scholar
- Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. 2018. TOP500 List -- November 2018. https://www.top500.org/lists/2018/11/Google Scholar
- Lukasz Szustak, Krzysztof Rojek, and Pawel Gepner. 2014. Using Intel Xeon Phi Coprocessor to Accelerate Computations in MPDATA Algorithm. In Parallel Processing and Applied Mathematics (Lecture Notes in Computer Science), Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Waśniewski (Eds.). Springer Berlin Heidelberg, 582--592.Google Scholar
- Llewellyn H. Thomas. 1949. Elliptic Problems in Linear Differential Equations over a Network. Watson Science Computer Laboratory Report. Columbia University, New York, NY, USA.Google Scholar
- Louis J. Wicker and William C. Skamarock. 2002. Time-Splitting Methods for Elastic Models Using Forward Time Schemes. Monthly Weather Review 130, 8 (Aug. 2002), 2088--2097.Google ScholarCross Ref
Index Terms
- Porting the COSMO Weather Model to Manycore CPUs
Recommendations
Implementing Genetic Algorithm Accelerated By Intel Xeon Phi
SoICT '17: Proceedings of the 8th International Symposium on Information and Communication TechnologyIn this paper, genetic algorithm (GA) accelerated by Intel Xeon Phi coprocessor based on Intel Many Integrated Chip (MIC) Architecture is proposed and called GAPhi framework. The GAPhi framework solves the power-aware task scheduling (PATS) problems in ...
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingRecent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse ...
OpenMP 4 Fortran Modernization of WSM6 for KNL
PEARC '17: Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and ImpactParallel code portability in the petascale era requires modifying existing codes to support new architectures with large core counts and SIMD vector units. OpenMP is a well established and increasingly supported vehicle for portable parallelization. As ...
Comments