No abstract available.
Proceeding Downloads
Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library
Matrix-matrix multiplication is used for various linear algebra algorithms such as matrix decomposition and tensor contraction. NVIDIA Tensor Core is a mixed-precision matrix-matrix multiplication and addition computing unit, where the theoretical peak ...
Efficient Large Integer Multiplication with Arm SVE Instructions
In this study, we implement large integer multiplication with the Arm Scalable Vector Extension (SVE) instructions. SVE is a single instruction, multiple data (SIMD) instruction set for the Arm AArch64 architecture. We use a reduced-radix representation ...
Effectiveness of the Oversubscribing Scheduling on Supercomputer Systems
High responsiveness is substantial for users’ satisfaction in supercomputer systems. Recently, the use of interactive jobs in addition to traditional batch jobs is attracting attention. It is getting important to handle those jobs consolidated for ...
A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners
Mixed precision Krylov solvers with the Jacobi preconditioner often show significant convergence degradation when the Jacobi preconditioner is computed in low precision such as FP16 and BF16. It is found that this convergence degradation is attributed ...
Fault Tolerance for Ensemble-based Molecular-Continuum Flow Simulations
Molecular dynamics (MD) simulations exhibit big computational efforts, which makes them very time-consuming. This particularly holds for molecular-continuum simulations in fluid dynamics, which rely on the simulation of MD ensembles that are coupled to ...
Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLAS
Krylov subspace algorithms are important methods for solving linear systems. In order to efficiently solve large-scale linear systems, parallelism techniques are often applied. However, parallelism often enlarge the non-associativity of floating-point ...
A Case Study on DaCe Portability & Performance for Batched Discrete Fourier Transforms
With the emergence of new computer architectures, portability and performance-portability become significant concerns for developing HPC applications. This work reports our experience and lessons learned using DaCe to create and optimize batched ...
Memory Usage Prediction of HPC Workloads Using Feature Engineering and Machine Learning
In High Performance Computing (HPC) systems, numerous applications of varying scale and domain are scheduled to run concurrently, and share the available CPU and memory capacities among themselves. Applications whose run-time memory usage are not known ...
Associative Operator Precedence Parsing: A Method To Increase Data Parsing Parallelism
Many data often come with a high volume in textual format (JSON, XML, CSV). Because parsing can easily dominate data analysis time, researchers have been working on parallelizing parsing. Operator Precedence Parsing (OPP), among candidate parsing methods,...
Fault-Tolerant LOBPCG for Nuclear CI Calculations
Exascale computing platforms with millions of compute units and with thousands of nodes are predicted to experience frequent faults which interrupt applications’ execution. In this context resilience against faults becomes important. We examine user and ...
Parallelization of Automatic Tuning for Hyperparameter Optimization of Pedestrian Route Prediction Applications using Machine Learning
We study software automatic tuning. Automatic tuning tools using iterative one-dimensional search estimate hyperparameters of machine learning programs. Iterative one-dimensional search searches the parameter space consisting of possible values of the ...
LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI
Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud ...
GPU–FPGA-accelerated Radiative Transfer Simulation with Inter-FPGA Communication
- Ryohei Kobayashi,
- Norihisa Fujita,
- Yoshiki Yamaguchi,
- Taisuke Boku,
- Kohji Yoshikawa,
- Makito Abe,
- Masayuki Umemura
The complementary use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) is a major topic of interest in the high-performance computing (HPC) field. GPU–FPGA-accelerated computing is an effective tool for multiphysics ...
Exploiting Data Parallelism in Graph-Based Simultaneous Localization and Mapping: A Case Study with GPU Accelerations
Graph-based simultaneous localization and mapping (G-SLAM) is an intuitive SLAM implementation where graphs are used to represent poses, landmarks and sensor measurements when a mobile robot builds a map of the environment and locates itself in it. ...
ESSPER: Elastic and Scalable FPGA-Cluster System for High-Performance Reconfigurable Computing with Supercomputer Fugaku
FPGA clusters have yet to be a mainstream of HPC, even for accelerators, and several challenges exist in their architecture and system organization. This work presents ESSPER, a flexible and scalable FPGA cluster prototype system for reconfigurable HPC ...
Index Terms
- Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
HPCAsia '23 | 34 | 15 | 44% |
HPCAsia '23 Workshops | 10 | 9 | 90% |
HPCAsia '19 | 32 | 15 | 47% |
HPCAsia '18 | 67 | 30 | 45% |
Overall | 143 | 69 | 48% |