Keywords

1 Introduction

Tsunamis are a natural disaster that may cause many fatalities and extensive damage of infrastructures in coastal areas. In 2011, the 2011 Tohoku earthquake tsunami, which was generated by a 9.0-magnitude earthquake, struck the northeastern part of the Pacific coast of Japan. Its inundated area ranged up to 2000 km along the Japanese coastlines and spread out over 500 km\(^2\) [4, 13]. The number of fatalities was more than 19,000, and many houses and infrastructures were destroyed [4, 18]. Since the damaged area was widespread, the Japanese government was not able to grasp the entire picture of the damage shortly after the tsunami. Under this situation, it was difficult to promptly cope with the disaster [11]. Therefore, a forecast system that can quickly estimate the damage of a tsunami inundation after an earthquake is required worldwide in order to provide an early understanding of damage situations.

We have developed a real-time tsunami inundation forecast system using the vector supercomputer SX-ACE [15] to quickly estimate tsunami inundations and infrastructure damages [9]. The forecast system is automatically activated by a large earthquake that may generate a tsunami, simulates the tsunami inundation, and finally visualizes the results of the damage estimations on a Web site. The total processing time is within 20 min for a coastal city region at the resolution of a 10-m grid size [7, 15]. The forecast system has also been implemented as the disaster information system of the Cabinet Office of Japan, however, the forecast area is needed to been expanded from a city region to the southwestern part of the Pacific coast of Japan, resulting in the coastal length of 5,500 km. Therefore, in the implementation for the Cabinet Office, the grid size and the total processing time are changed from 10 m to 30 m and from 20 min to 30 min due to the processing time of the SX-ACE system. In addition, since the entire Pacific coast of Japan has the high probability of damages by large tsunamis, it is necessary to expand the forecast areas. Moreover, the resolution of a 10-m grid size is required for accurate forecasting. Therefore, we need performance improvements of the tsunami inundation simulation in order to improve disaster prevention and mitigation by expanding forecast areas and increasing resolutions.

Modern supercomputers consist of multi-nodes, each of which is equipped with multi-core processors. These processors have recently employed vector (SIMD) instructions to further exploit loop-level parallelism. Intel Corporation released a Xeon Phi processor (Knights Landing) in 2016 and a Xeon Gold (Skylake) in 2017, and NEC Corporation developed an SX-Aurora TSUBASA processor in 2018. We evaluated the performance of our tsunami inundation on the Xeon Phi processor, and found that Xeon Phi needs optimizations for vectorization to achieve a high performance [14]. However, since the performance of Xeon Phi did not exceed that of SX-ACE, the forecast system cannot increase resolutions and expand forecast areas by Xeon Phi. In this paper, we discuss the performance of the tsunami inundation simulation on the latest vector supercomputer SX-Aurora TSUBASA [20] and Xeon processor (Skylake) and the possibilities to increase the forecast resolutions and areas.

The rest of this paper is organized as follows. In Sect. 2, we present related work on performance evaluations using supercomputers. Section 3 explains the tsunami inundation simulation. Section 4 describes the experimental setup of the performance evaluation, which includes an overview of SX-Aurora TSUBASA. Section 5 presents the results of the performance evaluations. We conclude this paper in Sect. 6 with a brief summary and mention of future work.

2 Related Work

A real-time tsunami forecast system requires reduction in execution time in order to provide adequate disaster prevention and mitigation. To this end, there have been studies on reducing the execution time of tsunami simulations using high-performance computing. Christgau et al. [3] evaluated the performance of Intel Xeon Phi (Knights Corner), Xeon (Ivy Bridge), and NVIDIA Tesla K40m GPGPU (general-purpose computing on graphics processing units) using the tsunami simulation EasyWave, which is a TUNAMI-F1 model [6]. They optimized the program to increase the reusability of cache data. Specifically, the computer domain was divided into smaller blocks that fit into the last level cache on Xeon and Xeon Phi. The program was also vectorized by the Intel compiler. They parallelized it using OpenMP on Xeon and MPI on XeonPhi, and the program has a CUDA version for GPGPU. They compared the performances of the program with an 18,001 \(\times \) 16,301 grid on Xeon, Xeon Phi, and Tesla K40m. The performance of the single Xeon processor (10 cores) was 1.23 and 1.8 times higher than those of the single Xeon Phi processor (61 cores) and the single Tesla K40m GPGPU (2880 cores), respectively. They found that the tsunami simulation EasyWave was memory intensive and required the load reduction of the memory to achieve higher performance, and the vector processing on Xeon and Xeon Phi was useful for accelerating the tsunami simulation. We will discuss the potential of vector processing and high memory bandwidth by comparing the performances of vector supercomputers, SX-Aurora TSUBASA and SX-ACE, as well as Xeon Gold in this paper.

Nagasu et al. [16] developed a field-programmable gate array (FPGA)-based tsunami simulation system using the method of splitting tsunami (MOST) model, which solves shallow water equations for ocean-wide tsunami propagation. The machine contained an ALTERA Arria 10 FPGA, two DDR3 PC12800 SDRAMs, and a PCI-Express Gen2 interface. They evaluated the performance and power consumption of the tsunami propagation simulation with a 2581 \(\times \) 2879 grid in the Pacific Ocean on the FPGA system, NVIDIA Tesla K20c, and AMD Radeon R9 280X GPGPU. The sustained performances were 383.0 Gflop/s on the FPGA system, 44.6 Gflop/s on Tesla K20c and 219.2 Gflop/s on Radeon R9 280X, respectively. The performance per power on the system was 17.2 and 7.1 times higher than those on Tesla K20c and Radeon R9 280x, respectively. The research demonstrated the significance of power efficiency. However, their resolutions were several kilometers and the tsunami inundation behavior was not simulated. This paper conducts the tsunami inundation simulation with higher resolutions that is necessary for the forecast system.

Arce Acuña et al. [1] developed a non-linear spherical shallow water model to compute tsunami propagation across an ocean on GPGPU. The model utilized the tree-based mesh refinement method to generate a computational mesh and was used to replicate the 2004 Indonesia tsunami off the Indian Ocean (1,500 km \(\times \) 500 km) with grid sizes from 1852 m to 50 m. The execution time for ten hours of tsunami behaviors was 8.19 min on eight NVIDIA Tesla P100 GPGPUs. These results indicated that their model had a potential for real-time tsunami forecasting using GPGPU. However, the area of the 50-m grid size was a city region. Meanwhile, we will evaluate the performance of our tsunami inundation simulation using the 10-m grid size along coastlines.

3 Tsunami Inundation Simulation

Our tsunami inundation simulation is based on the TUNAMI model [10] that solves non-linear shallow water equations with the staggered leap-frog finite difference method. The governing equations are

$$\begin{aligned}&\frac{\partial \eta }{\partial t}+\frac{\partial M}{\partial x}+\frac{\partial N}{\partial y}=0, \end{aligned}$$
(1)
$$\begin{aligned}&\begin{aligned} \frac{\partial M}{\partial t}+\frac{\partial }{\partial x}\Bigl (\frac{M^2}{D}\Bigr )+\frac{\partial }{\partial y}\Bigl (\frac{MN}{D}\Bigr )+gD\frac{\partial \eta }{\partial x} \\ +\frac{gn^2}{D^{7/3}}M\sqrt{M^2+N^2}=0, \\ \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned}&\begin{aligned} \frac{\partial N}{\partial t}+\frac{\partial }{\partial x}\Bigl (\frac{MN}{D}\Bigr )+\frac{\partial }{\partial y}\Bigl (\frac{N^2}{D}\Bigr )+gD\frac{\partial \eta }{\partial y} \\ +\frac{gn^2}{D^{7/3}}N\sqrt{M^2+N^2}=0, \end{aligned} \end{aligned}$$
(3)

where \(\eta \) is the vertical displacement of the water surface, \(D\) is the total water depth, \(g\) is the gravitational acceleration, \(n\) is Manning’s roughness coefficient, \(x\) and \(y\) are the horizontal directions. \(M\) and \(N\) are the discharge fluxes in the x and y directions, respectively. \(M\) and \(N\) are given by

$$\begin{aligned} M=\int _{-h}^{\eta }udz=\bar{u}(\eta +h), \end{aligned}$$
(4)
$$\begin{aligned} N=\int _{-h}^{\eta }vdz=\bar{v}(\eta +h), \end{aligned}$$
(5)

where h is the still water depth. \(u\) and \(v\) are the water velocities in the x and y directions, respectively. \(\bar{u}\) and \(\bar{v}\) are the average water velocities in the x and y directions, respectively. The tsunami source is calculated using Okada’s equation [19], and the damage estimation is based on a study by Koshimura et al. [10].

The tsunami inundation simulation leverages a nested grid system to simulate the multiscale nature of a tsunami. The computational domain is a polygonal domain [5], which can reduce the number of computations in the simulation [14]. Also, the system of the Cabinet Office simulates the coastal length of 5,500 km, and the simulated area is divided into fifteen subareas, and each subarea is individually simulated because the geographical coordinates of the subareas are different and a different tidal level is used for each region.

4 Experimental Setup for Performance Evaluation

We evaluate the performance of the tsunami inundation simulation by using SX-Aurora TSUBASA, SX-ACE, and Xeon Gold. Table 1 lists the specifications of each system. Here, SP and DP indicate single precision and double precision floating-point operations, respectively.

Table 1. Specifications of systems used in the evaluation

4.1 SX-Aurora TSUBASA

SX-Aurora TSUBASA is a new vector supercomputer released in 2018 [8, 20]. It consists of one or more card-type vector engines (VEs) and a vector host (VH). The VE is the main part of SX-Aurora TSUBASA and contains a vector processor, a 16 MB shared last-level cache (LLC), and six HBM2 memory modules, as shown in Fig. 1. The vector processor has eight vector cores and two types of the clock frequency: 1.6 GHz and 1.4 GHz. In this evaluation, we use the VE of 1.4 GHz. The peak performances of single precision floating-point operations on a vector core and a vector processor are 537.6 Gflop/s and 4.3 Tflop/s, respectively. The memory bandwidths are 358 GB/s per vector core and 1.22 TB/s per vector processor. The ratios of the memory bandwidth to the floating-point operation per second, so called the Bytes/Flop ratio (B/F), on the vector core and the vector processor are 0.67 and 0.28, respectively. The VH is a standard x86 Linux server and executes an operating system (OS) of the VE. While the conventional SX series runs an OS on a vector processor, SX-Aurora TSUBASA offloads OS-related processing on the VH since the VH is suitable for scalar processing such as the OS-related processing. Figure 2 shows the maximum configuration of one VH that contains eight VEs, two Infiniband interfaces, and two Xeon processors. Moreover, SX-Aurora TSUBASA can compose a large system by connecting the VHs via the Infiniband switch.

Fig. 1.
figure 1

Block diagram of a vector engine

Fig. 2.
figure 2

Block diagram of SX-Aurora TSUBASA system

The SX-Aurora TSUBASA system supports Red Hat Enterprise Linux and CentOS as the OS, and the VH provides OS functions such as process scheduling and handling of system calls invoked by the application running on the VEs. The development environment is NEC MPI and NEC Software Development Kit for Vector Engine, which contains NEC Fortran compiler, NEC C/C++ compiler, NEC Numeric Library Collection, NEC Parallel Debugger, and NEC Ftrace Viewer. The Fortran and C/C++ compilers support functions of automatic optimization, automatic vectorization, automatic parallelization, and OpenMP. NEC MPI is implemented in accordance with MPI-3.1.

4.2 SX-ACE

SX-ACE, which is also a vector supercomputer, was released in 2014. It consists of one vector processor and a 64 GB memory [12]. The vector processor has four vector cores, each of which is comprised of a scalar processing unit, a vector processing unit, and an on-chip private cache with 1 MB. The peak performances on a vector core and a vector processor are 64 Gflop/s and 256 Gflop/s, respectively. The memory bandwidth is 256 GB/s per vector core and per vector processor. The B/F ratios of a vector core and a vector processor are 4.0 and 1.0, respectively. The SX-ACE system is composed of 512 nodes connected via a custom interconnect network (called IXS) by an 8 GB/s bandwidth.

The OS of the SX-ACE is SUPER-UX, which is based on the UNIX System V with extensions for performance and functionality. The FORTRAN compiler for SX-ACE, FORTRAN90/SX, has functions of automatic optimization, automatic vectorization, automatic parallelization and OpenMP. The MPI Library, MPI2/SX, implements the Message Passing Interface specifications MPI-1.3 and MPI-2.1.

4.3 Xeon Gold

The Xeon Gold system consists of two Intel Xeon Gold 6126 processors (released in 2017) with a 192 GB memory and 128 GB/s memory bandwidth. The processor contains 12 cores and a shared last-level cache of 19.5 MB. The peak performances of single precision floating-point operations on a core and a processor are 147.2 Gflop/s and 1766.4 Gflop/s, respectively. The B/F of a processor is 0.07. Each node is connected with an InfiniBand EDR at 12.5 GB/s.

The OS of the Xeon Gold system is Red Hat Enterprise Linux, and the programming environment is Intel Parallel Studio XE Cluster Edition, which contains Fortran and C/C++ compilers and an MPI library. The Fortran compiler has the functions of automatic optimization, automatic vectorization, automatic parallelization, and OpenMP.

Fig. 3.
figure 3

Computational domains for Kochi prefecture

Table 2. Computational domain

4.4 Evaluated Model and Compiler Options

We evaluate the performances of a tsunami inundation simulation in the Kochi prefecture region in Japan with a 10-m grid size, as shown in Fig. 3. This coastal length is about 720 km, and this region is one of the fifteen subareas in the system of Cabinet Office of Japan. Table 2 lists the specifications of the model grid sizes and the number of grid points in each domains. Domains 4 and 5 refer to the polygonal domains. The simulation case is the tsunami generated by the 9.0-magnitude Nankai Trough earthquake [2], and tsunami inundation behaviors are calculated with the time interval of 0.2 s for six hours after the earthquake. Here, the simulation is executed with a single precision floating-point operation. Figure 4 shows the simulation results of the maximum tsunami inundation depth and structural damage. In the traditional tsunami forecast, the items of prediction information were tsunami arrival times and tsunami heights on coastlines. But our forecast system can estimate tsunami inundations and infrastructure damages of coastal land areas.

Fig. 4.
figure 4

Simulation results. a Maximum tsunami inundation depth. b Structural damage (houses)

Table 3. Fortran compiler’s options and directives on each system

Table 3 lists Fortran compiler’s versions, options, and directives on each system. The options, -Chopt, -O3, and -msched-block are high-level optimization options, and -finline-functions, -pi, and -ipo are inline options. As Xeon Gold is unable to vectorize two do-loops, which are time-consuming do-loops, just with the compiler options, we insert the vector always directive into the simulation code. -xCORE-AVX512 also creates the object code including the AVX512 operations.

5 Performance Results

5.1 Single Core Performance

Figure 5 shows the execution times of a single core on SX-Aurora TSUBASA, SX-ACE and Xeon Gold. The CPU time is the running time of the CPU. The memory time indicates the stall time of the CPU due to data loads from the memory. Directive and Non-directive indicate the cases with and without the vectorization directive on Xeon Gold, respectively. The vectorization directive vector always can vectorize two time-consuming do-loops, which take up about 80 % of the total execution time even when the loops are vectorized. The CPU time of the directive case is less than half that of the non-directive case. This result shows that the vector processing is key in the achievement of a high performance on the Xeon Gold processor.

Fig. 5.
figure 5

Execution times on a single core. CPU: running time of CPU, Memory: Stall time of CPU due to data loads

Table 4. Code B/F and actual B/F on SX-ACE and SX-Aurora TSUBASA

The execution times of both the CPU and memory on SX-Aurora TSUBASA is the shortest among all the systems. SX-Aurora TSUBASA can calculate the tsunami inundation simulation in 58% and 36% of the execution times of SX-ACE and Xeon Gold (directive), respectively. The memory time on SX-Aurora TSUBASA is 40% and 11% of those on SX-ACE and Xeon Gold, respectively. This is because SX-Aurora TSUBASA has the highest LLC bandwidth among all the systems. Moreover, the memory accesses on SX-Aurora TSUBASA can be hidden by vector arithmetic operations since SX-Aurora TSUBASA has long vector pipelines whose vector length is 256. SX-ACE can also hide them and has 61% of the execution time of Xeon Gold (directive). The memory time on SX-ACE is 25% of that on Xeon Gold (directive).

To analyze the performance using the B/F ratios, we use two metrics: a code B/F and an actual B/F. The code B/F is defined as the necessary data in bytes divided by the floating-point operations in a code. The numbers of memory operations and floating-point operations are counted from the simulation code. Since a processor contains a cache memory, the amount of data from a memory is smaller than that of the necessary data. Thus, we utilize the actual B/F calculated from the numbers of block memory accesses and the floating-point operations in the actual behavior of a processor when the simulation is executed. These execution statistics on SX-Aurora TSUBASA and SX-ACE are obtained by using the NEC profiler tool proginf [17]. Table 4 shows the code B/F and actual B/F on SX-Aurora TSUBASA and SX-ACE. The code B/F of 1.8 is reduced to 0.48 and 0.66 of the actual B/F on SX-Aurora TSUBASA and SX-ACE, respectively. This result shows that the cache memory can reduce the memory loads of the tsunami inundation simulation on these processors.

Fig. 6.
figure 6

Multi core performance. a Execution times. b Scalability

The B/F of a vector core on SX-Aurora TSUBASA is 0.67, which is smaller than the code B/F. This indicates that the tsunami inundation simulation can be considered a memory-bandwidth-limited program for SX-Aurora TSUBASA. In addition, since the actual B/F of 0.48 is smaller than the B/F of 0.67, the bandwidth between the LLC and the memory should be sufficient. Therefore, the bandwidth between the vector core and the LLC (LLC bandwidth) poses a performance bottleneck, and the performance of the simulation is limited by the performance of the LLC. Although the LLC bandwidth becomes a bottleneck on SX-Aurora TSUBASA, the execution time is the shortest among all the systems. This is because the bandwidth of the LCC on SX-Aurora TSUBASA is the highest.

On SX-ACE, the B/F of the vector core is 4.0, which is greater than the code B/F. Therefore, the tsunami inundation simulation can be considered a computation-bound program for SX-ACE and does not require the full memory bandwidth. Although the peak performance of the core is a mere 64 Gflop/s, the CPU time is shorter than that of Xeon Gold (directive). Here, the calculation performance is 12.8 Gflop/s and the computational efficiency reaches 20%. This is because the B/F of the vector core on SX-ACE is high and the vector core can efficiently execute the simulation.

In the case of Xeon Gold, as the bandwidth of core is not officially announced, we assume that one core can utilize the full memory bandwidth of 128 GB/s. Its B/F of the core is 0.87 and smaller than the code B/F. Therefore, the tsunami inundation simulation can be considered a memory-bandwidth-limited program for Xeon Gold, and the bandwidth of the core or memory causes a performance bottleneck. As the memory system of Xeon has the lowest performance among three systems, the execution time on Xeon becomes the longest.

From these evaluations and analyses, we clarify that SX-Aurora TSUBASA has the shortest execution time among all the systems due to its high LLC bandwidth. This performance has a possibility to improve the functions of the forecast system.

Fig. 7.
figure 7

MPI communication and idle times with 512 cores on SX-Aurora TSUBASA and SX-ACE

5.2 Multi-core Performance

We evaluate the parallel performances of the tsunami inundation simulation using 512 cores of SX-Aurora TSUBASA and SX-ACE, and 576 cores of Xeon Gold. The peak system performances of single precision floating-point operations on the SX-Aurora TSUBASA, SX-ACE, and Xeon Gold systems are 275.3 Tflop/s, 32.8 Tflop/s, and 84.8 Tflop/s, respectively. Figure 6a shows the execution times of the tsunami inundation simulation. The execution time of SX-Aurora TSUBASA is the shortest time among all the systems due to the highest-core sustained performance. Specifically, it takes three minutes on 512 cores, which means 64 VEs and eight nodes. Figure 6b depicts the scalability of each system. SX-ACE and Xeon Gold achieve super-scalability when the numbers of cores are 256 on SX-ACE, and 288 and 576 on Xeon Gold. This is due to a load imbalance of the tsunami inundation simulation that utilizes the polygonal domain. The polygonal domain has the characteristic of variation in the number of grid points in each MPI process. Specifically, since the variation in the number of grid points per MPI process at small parallel number is larger than that at large parallel number, the load imbalance at small parallel number is large. Therefore, the performances at 128 and 144 cores are low and the super-scalability is generated. Figure 6b also shows that the scalability on SX-Aurora TSUBASA is the lowest among all the systems. The performance of 512 cores on SX-Aurora TSUBASA is 2.4 times higher than that on 128 cores. Meanwhile, the 512 cores on SX-ACE has a 3.6 times higher performance of 128 cores, and the performances of 576 cores on Xeon Gold is 4.4 times higher than that of 144 cores. This is because the network bandwidth per core between the nodes on SX-Aurora TSUBASA is lower than those on SX-ACE and Xeon Gold. The peak network bandwidths per core are 0.4 GB/s on SX-Aurora TSUBASA, 2 GB/s on SX-ACE, and 0.5 GB/s on Xeon Gold. Figure 7 shows the MPI communication times and idle times of communications in the cases of 512 cores on SX-Aurora TSUBASA and SX-ACE. The communication time on SX-Aurora TSUBASA is 1.96 times longer than that on SX-ACE, and the idle time with SX-Aurora TSUBASA has a longer idle time than with SX-ACE.

We set a criterion of the execution time of the tsunami inundation simulation for the forecast system as five minutes in order to terminate the total processing from system start-up to visualization of the results within 30 min. In this evaluation, SX-Aurora TSUBASA, SX-ACE, and Xeon Gold achieve this time on 200 cores (25 processors), 384 cores (96 processors), and 492 cores (41 processors), respectively. In the case of the system used by the Cabinet Office of Japan, the southwestern part of the Pacific coast of Japan is now simulated with the 30-m grid size. When the grid size is the 10-m size, the number of grid points increases by a factor of 9, and the time step is supposed to be one third of the 30-m size case, thus the overall calculation amount is 27 times larger than that of this evaluation case. Moreover, when the forecast system covers the entire Pacific coast of Japan, we assume that length of the coast becomes triple, and thus the calculation amount is also about 81 times larger than that of this evaluation case. Table 5 lists the number of processors required for the forecast system with the 10-m grid size within five minutes. SX-Aurora TSUBASA can build the system by using the smallest number of processors. The numbers of nodes are 85 and 254 nodes in the cases of the southwestern part of the Pacific coast and entire Pacific coast of Japan, respectively. The forecast system for the entire Pacific coast of Japan with the 10-m grid size is feasible by SX-Aurora TSUBASA. Therefore, SX-Aurora TSUBASA has a high potential for increasing resolution and expanding forecast areas for the forecast system.

Table 5. Number of processors required for the forecast system

6 Summary

We have developed a real-time tsunami inundation forecast system in order to provide an early understanding of tsunami damage situations. Our system has been operated as a Japanese disaster system on the southwestern part of the Pacific coast of Japan. In order to increase the resolutions of simulation and expand the forecast areas, higher-performance supercomputers are required. In this paper, we evaluate the performance of our tsunami inundation simulation using SX-Aurora TUBASA, SX-ACE, and Xeon Gold. SX-Aurora TSUBASA has the highest sustained performance among all the systems. Specifically, the core performance of SX-Aurora TSUBASA is 1.7 and 2.8 times higher than those of SX-ACE and Xeon Gold, respectively. SX-Aurora TSUBASA can simulate a tsunami inundation in Kochi prefecture with the 10-m grid within five minutes using 200 cores. In the same scenario, SX-ACE and Xeon Gold need 384 cores and 492 cores, respectively. We found that SX-Aurora TSUBASA has the high potential to simulate the entire Pacific coast of Japan with the required time and resolution constrains. However, the performance of SX-Aurora TSUBASA is limited by the LLC bandwidth. The network bandwidth on SX-Aurora TSUBASA is also lower than those on SX-ACE and Xeon Gold.

In future work, we plan to redevaluate the performance of the multi-core environment by changing the domain decomposition for enhancement of the parallel performance in the polygonal domain method [5]. We will also implement the tsunami inundation simulation in general-purpose computing on graphics processing units (GPGPU) and investigate the performance characteristics of the GPGPUs.