research-article

An Accelerated First Principle Method Implemented on IntelGPU

Authors:
Le Xu

School of Computer Science and Technology, University of Science and Technology of China, China

School of Computer Science and Technology, University of Science and Technology of China, China
View Profile

,
Hong An

School of Computer Science and Technology, University of Science and Technology of China, China

School of Computer Science and Technology, University of Science and Technology of China, China
View Profile

CSAE '22: Proceedings of the 6th International Conference on Computer Science and Application EngineeringOctober 2022Article No.: 56Pages 1–7https://doi.org/10.1145/3565387.3565443

Published:13 December 2022Publication History

CSAE '22: Proceedings of the 6th International Conference on Computer Science and Application Engineering

Pages 1–7

ABSTRACT

High performance computing has been developing rapidly over the past decades. Nowadays, most super computers are still using traditional pure CPU architecture or very limited heterogeneous devices like Nvidia GPU. Many other vendor's devices are also very reliable, such as Intel GPU, which was first introduced few years ago. As programming methods on most heterogeneous devices are always very strongly restricted by vendors, porting from one platform to another is usually difficult and time-costing. With the great help of ONEAPI, applications can be ported to Intel GPU and Intel FPGA with almost zero cost, rather than rebuild kernel codes in CUDA. This paper demonstrates the porting work of an important and widely used application, the GW approximation in the First Principle. We introduce the whole workflow of how to port applications to Intel ONEAPI, which allows us to execute application on all platforms. To fully use all devices, we introduce an automatic heterogeneous device sneaking and workload balance method in this paper. Moreover, we develop a series of methods of how to achieve workload balance for each single core in GPU, and design a better reduction method than all previous works. The tests are taken on 2 Intel Xe-HP ATS-P GPUs, and the results show that our method achieves an another 65x acceleration over traditional OpenMP + MPI implementation.

References

X. Gonze, B. Amadon, P.M. Anglade, (2009). ABINIT: First-principles approach to material and nanosystem properties. Computer Physics Communications, 180(12), 2582-2615. doi: 10.1016/j.cpc.2009.07.007.Google ScholarCross Ref
X. Gonze(2005). A brief introduction to the ABINIT software package. Zeitschrift für Kristallographie-Crystalline Materials, 220(5-6), 558-562. doi: 10.1524/zkri.220.5.558.65066.Google ScholarCross Ref
G. Asharov, and Y. Lindell(2017). A full proof of the BGW protocol for perfectly secure multiparty computation. Journal of Cryptology, 30(1), 58-151. doi: 10.1007/s00145-015-9214-4.Google ScholarDigital Library
R.F. Hendry(2021). Elements and (first) principles in chemistry. Synthese, 198(14), 3391–3411. doi: 10.1007/s11229-019-02312-8.Google ScholarCross Ref
BerkeleyGW manual, Post-Processing/Visualize/Overview, http://manual.berkeleygw.org/2.0/visualize-overview/ (accessed Aug. 19, 2022).Google Scholar
Q. Sun, X. Zhang, S. Banerjee, (2020). Recent developments in the PySCF program package. The Journal of chemical physics, 153(2), 024109. doi: 10.1063/5.0006074.Google ScholarCross Ref
Q. Sun, T.C. Berkelbach, N.S. Blunt, (2018). PySCF: the Python‐based simulations of chemistry framework. Wiley Interdisciplinary Reviews: Computational Molecular Science, 8(1), e1340. doi: 10.1002/wcms.1340.Google ScholarCross Ref
J. Enkovaara, C. Rostgaard, J.J. Mortensen, (2010). Electronic structure calculations with GPAW: a real-space implementation of the projector augmented-wave method. Journal of physics: Condensed matter, 22(25), 253202. doi: 10.1088/0953-8984/22/25/253202.Google ScholarCross Ref
G.E. Engel, and B. Farid(1993). Generalized plasmon-pole model and plasmon band structures of crystals. Physical Review B, 47(23), 15931. doi: 10.1103/PhysRevB.47.15931.Google ScholarCross Ref
J. Lischner, S. Sharifzadeh, J. Deslippe, (2014). Effects of self-consistency and plasmon-pole models on G W calculations for closed-shell molecules. Physical Review B, 90(11), 115130. doi: 10.1103/PhysRevB.90.115130.Google ScholarCross Ref
Y. Wang, Y. Zhou, Q.S. Wang, (2021). Developing medical ultrasound beamforming application on GPU and FPGA using oneAPI. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 360-370). IEEE. doi: 10.1109/IPDPSW52791.2021.00064.Google ScholarCross Ref
A. Alpay, B. Soproni, H. Wünsche, (2022). Exploring the possibility of a hipSYCL-based implementation of oneAPI. In International Workshop on OpenCL (pp. 1-12). doi: 10.1145/3529538.3530005.Google ScholarDigital Library
P. A. Martínez, B. Peccerillo, S. Bartolini, el al(2022). Applying Intel's oneAPI to a machine learning case study. Concurrency and Computation: Practice and Experience, 34(13), e6917. doi: 10.1002/cpe.6917.Google ScholarCross Ref
L. Sousa, N. Roma, P. Tomás(2021). Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings (vol. 12820). Springer Nature. doi: 10.1007/978-3-030-85665-6.Google ScholarDigital Library
S. Christgau, T. Steinke(2020). Porting a legacy cuda stencil code to oneapi. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 359-367). IEEE. doi: 10.1109/IPDPSW50202.2020.00070.Google Scholar
R. Kashino, R. Kobayashi, N. Fujita, (2022, January). Multi-hetero Acceleration by GPU and FPGA for Astrophysics Simulation on oneAPI Environment. In International Conference on High Performance Computing in Asia-Pacific Region (pp. 84-93). doi: 10.1145/3492805.3492817.Google ScholarDigital Library
M. Krainiuk, M. Goli, V.R. Pascuzzi(2021, November). oneAPI Open-Source Math Library Interface. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (pp. 22-32). IEEE. doi: 10.1109/P3HPC54578.2021.00006.Google Scholar
E. Marinelli, and R. Appuswamy(2021). OneJoin: Cross-Architecture, Scalable Edit Similarity Join for DNA Data Storage Using oneAPI, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS@VLDB) (pp. 37-46).Google Scholar
E. Marinelli, and R. Appuswamy(2021). XJoin: Portable, parallel hash join across diverse XPU architectures with oneAPI. In Proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021) (pp. 1-5). doi: 10.1145/3465998.3466012.Google ScholarDigital Library
V. Malyshkin(2021). Parallel Computing Technologies: 16th International Conference, PaCT 2021, Kaliningrad, Russia, September 13–18, 2021, Proceedings (vol. 12942). Springer Nature. doi: 10.1007/978-3-030-86359-3.Google ScholarDigital Library

Index Terms

An Accelerated First Principle Method Implemented on IntelGPU
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
Abstract
The aim of SYCL is to reduce the gap between the performance and code portability of the main accelerators used in HPC, such as multi-vendor CPUs, GPUs, and FPGAs. To evaluate SYCL’s performance portability, this paper uses the k-means algorithm ...
Read More
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...
Read More
A CUDA implementation of the Continuous Space Language Model

The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). A detailed explanation of the CSLM algorithm is provided. Implementation was ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CSAE '22: Proceedings of the 6th International Conference on Computer Science and Application Engineering
October 2022
411 pages
ISBN:9781450396004
DOI:10.1145/3565387
Editor:
Ali Emrouznejad
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 December 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
First Principle Method
GPU
High Performance Computing
Memory Optimization
ONEAPI
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate368of770submissions,48%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 30
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

An Accelerated First Principle Method Implemented on IntelGPU

CSAE '22: Proceedings of the 6th International Conference on Computer Science and Application Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

A CUDA implementation of the Continuous Space Language Model