ABSTRACT
High performance computing has been developing rapidly over the past decades. Nowadays, most super computers are still using traditional pure CPU architecture or very limited heterogeneous devices like Nvidia GPU. Many other vendor's devices are also very reliable, such as Intel GPU, which was first introduced few years ago. As programming methods on most heterogeneous devices are always very strongly restricted by vendors, porting from one platform to another is usually difficult and time-costing. With the great help of ONEAPI, applications can be ported to Intel GPU and Intel FPGA with almost zero cost, rather than rebuild kernel codes in CUDA. This paper demonstrates the porting work of an important and widely used application, the GW approximation in the First Principle. We introduce the whole workflow of how to port applications to Intel ONEAPI, which allows us to execute application on all platforms. To fully use all devices, we introduce an automatic heterogeneous device sneaking and workload balance method in this paper. Moreover, we develop a series of methods of how to achieve workload balance for each single core in GPU, and design a better reduction method than all previous works. The tests are taken on 2 Intel Xe-HP ATS-P GPUs, and the results show that our method achieves an another 65x acceleration over traditional OpenMP + MPI implementation.
- X. Gonze, B. Amadon, P.M. Anglade, (2009). ABINIT: First-principles approach to material and nanosystem properties. Computer Physics Communications, 180(12), 2582-2615. doi: 10.1016/j.cpc.2009.07.007.Google ScholarCross Ref
- X. Gonze(2005). A brief introduction to the ABINIT software package. Zeitschrift für Kristallographie-Crystalline Materials, 220(5-6), 558-562. doi: 10.1524/zkri.220.5.558.65066.Google ScholarCross Ref
- G. Asharov, and Y. Lindell(2017). A full proof of the BGW protocol for perfectly secure multiparty computation. Journal of Cryptology, 30(1), 58-151. doi: 10.1007/s00145-015-9214-4.Google ScholarDigital Library
- R.F. Hendry(2021). Elements and (first) principles in chemistry. Synthese, 198(14), 3391–3411. doi: 10.1007/s11229-019-02312-8.Google ScholarCross Ref
- BerkeleyGW manual, Post-Processing/Visualize/Overview, http://manual.berkeleygw.org/2.0/visualize-overview/ (accessed Aug. 19, 2022).Google Scholar
- Q. Sun, X. Zhang, S. Banerjee, (2020). Recent developments in the PySCF program package. The Journal of chemical physics, 153(2), 024109. doi: 10.1063/5.0006074.Google ScholarCross Ref
- Q. Sun, T.C. Berkelbach, N.S. Blunt, (2018). PySCF: the Python‐based simulations of chemistry framework. Wiley Interdisciplinary Reviews: Computational Molecular Science, 8(1), e1340. doi: 10.1002/wcms.1340.Google ScholarCross Ref
- J. Enkovaara, C. Rostgaard, J.J. Mortensen, (2010). Electronic structure calculations with GPAW: a real-space implementation of the projector augmented-wave method. Journal of physics: Condensed matter, 22(25), 253202. doi: 10.1088/0953-8984/22/25/253202.Google ScholarCross Ref
- G.E. Engel, and B. Farid(1993). Generalized plasmon-pole model and plasmon band structures of crystals. Physical Review B, 47(23), 15931. doi: 10.1103/PhysRevB.47.15931.Google ScholarCross Ref
- J. Lischner, S. Sharifzadeh, J. Deslippe, (2014). Effects of self-consistency and plasmon-pole models on G W calculations for closed-shell molecules. Physical Review B, 90(11), 115130. doi: 10.1103/PhysRevB.90.115130.Google ScholarCross Ref
- Y. Wang, Y. Zhou, Q.S. Wang, (2021). Developing medical ultrasound beamforming application on GPU and FPGA using oneAPI. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 360-370). IEEE. doi: 10.1109/IPDPSW52791.2021.00064.Google ScholarCross Ref
- A. Alpay, B. Soproni, H. Wünsche, (2022). Exploring the possibility of a hipSYCL-based implementation of oneAPI. In International Workshop on OpenCL (pp. 1-12). doi: 10.1145/3529538.3530005.Google ScholarDigital Library
- P. A. Martínez, B. Peccerillo, S. Bartolini, el al(2022). Applying Intel's oneAPI to a machine learning case study. Concurrency and Computation: Practice and Experience, 34(13), e6917. doi: 10.1002/cpe.6917.Google ScholarCross Ref
- L. Sousa, N. Roma, P. Tomás(2021). Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings (vol. 12820). Springer Nature. doi: 10.1007/978-3-030-85665-6.Google ScholarDigital Library
- S. Christgau, T. Steinke(2020). Porting a legacy cuda stencil code to oneapi. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 359-367). IEEE. doi: 10.1109/IPDPSW50202.2020.00070.Google Scholar
- R. Kashino, R. Kobayashi, N. Fujita, (2022, January). Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment. In International Conference on High Performance Computing in Asia-Pacific Region (pp. 84-93). doi: 10.1145/3492805.3492817.Google ScholarDigital Library
- M. Krainiuk, M. Goli, V.R. Pascuzzi(2021, November). oneAPI Open-Source Math Library Interface. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (pp. 22-32). IEEE. doi: 10.1109/P3HPC54578.2021.00006.Google Scholar
- E. Marinelli, and R. Appuswamy(2021). OneJoin: Cross-Architecture, Scalable Edit Similarity Join for DNA Data Storage Using oneAPI, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS@VLDB) (pp. 37-46).Google Scholar
- E. Marinelli, and R. Appuswamy(2021). XJoin: Portable, parallel hash join across diverse XPU architectures with oneAPI. In Proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021) (pp. 1-5). doi: 10.1145/3465998.3466012.Google ScholarDigital Library
- V. Malyshkin(2021). Parallel Computing Technologies: 16th International Conference, PaCT 2021, Kaliningrad, Russia, September 13–18, 2021, Proceedings (vol. 12942). Springer Nature. doi: 10.1007/978-3-030-86359-3.Google ScholarDigital Library
Index Terms
- An Accelerated First Principle Method Implemented on IntelGPU
Recommendations
Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
AbstractThe aim of SYCL is to reduce the gap between the performance and code portability of the main accelerators used in HPC, such as multi-vendor CPUs, GPUs, and FPGAs. To evaluate SYCL’s performance portability, this paper uses the k-means algorithm ...
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingIn heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...
A CUDA implementation of the Continuous Space Language Model
The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was ...
Comments