# Test-driving RISC-V Vector hardware for HPC

Joseph K. L. Lee<sup>1</sup>[0000-0002-1648-2740], Maurice Jamieson<sup>1</sup>[0000-0003-1626-4871], Nick Brown<sup>1</sup>[0000-0003-2925-7275], and Ricardo Jesus<sup>1</sup>[0000-0002-9651-4756]

EPCC, University of Edinburgh, Bayes Centre, 47 Potterrow, Edinburgh, United Kingdom {j.lee,m.jamieson,n.brown}@epcc.ed.ac.uk, rjj@ed.ac.uk

Abstract. Whilst the RISC-V Vector extension (RVV) has been ratified, at the time of writing both hardware implementations and open source software support are still limited for vectorisation on RISC-V. This is important because vectorisation is crucial to obtaining good performance for High Performance Computing (HPC) workloads and, as of April 2023, the Allwinner D1 SoC, containing the XuanTie C906 processor, is the only mass-produced and commercially available hardware supporting RVV. This paper surveys the current state of RISC-V vectorisation as of 2023, reporting the landscape of both the hardware and software ecosystem. Driving our discussion from experiences in setting up the Allwinner D1 as part of the EPCC RISC-V testbed, we report the results of benchmarking the Allwinner D1 using the RAJA Performance Suite, which demonstrated reasonable vectorisation speedup using vendor-provided compiler, as well as favourable performance compared to the StarFive VisionFive V2 with SiFive's U74 processor.

## 1 Introduction

Vector instructions bring many benefits to an Instruction Set Architecture (ISA), for instance they enable applications to exploit data parallelism, reduce code size, increase instruction bandwidth and improve energy efficiency. Many modern applications including machine learning, graphics, digital signal processing, and cryptography are built around algorithms that are designed to heavily take advantage of vector instructions. Indeed vectorisation was a traditional way in which HPC was undertaken on the likes of the Cray-1 and Thinking Machines' CM series before distributed memory parallelism became widespread. Modern day variants of these ideas, such as AVX-512, the NEC SX-Aurora Vector Engine and the flexibility provided by Arm SVE in the A64FX, are highly successful.

Over the past years RISC-V has become a well-established open ISA standard, where RISC-V is the fifth major RISC ISA design from the Univerity of California Berkeley, preceded by RISC-I, RISC-II, SOAR, and SPUR. The most powerful feature of RISC-V in comparison to other RISC designs, such as the SPARC, PowerPC, MIPS and Arm, is its modular design. In practice this means that a small base integer ISA is specified and then ISA extensions, such as floating-point and vector support, can be chosen and added to the CPU

implementation. Vector support has been a key extension for RISC-V since its inception, We also pun on the use of the Roman numeral "V" to signify "variations" and "vectors", as support for a range of architecture research, including various data-parallel accelerators, is an explicit goal of the ISA design. [23]

Version 1.0 [13] of the RISC-V vector extension (RVV) was ratified in late 2021. Similarly to Arm SVE, it is inherently vector length agnostic (VLA) and the same code can be executed on implementations with different vector lengths, and the element size and vector length can also be reconfigured at run time. Whereas the x86 AVX and Arm NEON use the vector length specific (VLS) approach of packed SIMD and the code will need to be re-optimised and recompiled for each vector processor, VLA code remains portable across different vector processor design and generations.

RVV has already been used in production for physical RISC-V hardware, for example T-Head's XuanTie C906 core provides RVV v0.7.1 and has made a submission for MLPerf Tiny Inference[16], a benchmark designed to measure trained neural network performance for low power devices. However, as an emerging standard it is not entirely straightforward to utilise and test the RISC-V vector extension. This paper aims to evaluate the current landscape when it comes to RISC-V vectorisation and assess the potential gain from utilising RISC-V vectors for HPC applications. Ultimately our objective is to provide guidance for users interested in testing or adopting available vector hardware using experiences we have gained from setting up the EPCC RISC-V testbed[4]. The key contributions of this paper are:

- 1. We review the state of play of the RISC-V vector extension and available processor implementations
- 2. We evaluate the availability of open source software such as compiler toolchains and Linux kernels to support running vectorised code on available hardware
- 3. We perform benchmarks and evaluate vectorisation efficiency using a currently available compiler and commercially available RISC-V vector processor

## 2 Background and related work

## 2.1 V Extension

The RISC-V 'V' standard extension introduces 32 new vector registers, and requires a minimum vector register length (VLEN) of 128 bits up to a maximum 65,536 bits.<sup>1</sup> This can be compared to SVE, which also has a minimum vector length of 128 bits, but only a maximum of 2048 bits. Another feature of the vector instruction set is that multiple vector registers can be grouped together as a single combined vector and this is known as *LMUL*. Whilst previously one could only group 2, 4 or 8 registers, in RVV v1.0 fractional groupings of  $\frac{1}{2}$ ,

<sup>&</sup>lt;sup>1</sup> The Zvl32b and Zvl64b extensions allow for a smaller minimum VLEN of 32 and 64 bits respectively

 $\frac{1}{4}$  and  $\frac{1}{8}$  are also allowed where part of a single vector register will be used. These features of the instruction set provide great flexibility because, within a single code, the vector length can be varied by different groupings of vector registers dynamically, and this is therefore particularly useful when operating on mixed-width values. Combined with the fact that the same compiled code can run on hardware implementations with significantly different vector width and automatically exploit the widest vector lengths, RVV encourages portable code with greater utilisation of vector register resources without the need for platform-specific optimisation.

Prior to the ratification of v1.0 of the V extension, the beta version of RVV, v0.7.1, was adopted in production for example by the XuanTie C906 processor and BSC's Vitruvius+[27] which is part of the European Processor Initiative (EPI) project. Even though the difference between the v1.0 and v0.7.1 is fairly minimal, the two versions are incompatible in terms of source code or binary. One major difference is the lack of support for fractional LMUL in version 0.7.1.

#### 2.2 Intrinsics

At the time of writing, the official RISC-V task group is converging towards v1.0 of the C intrinsics API [20], which is expected to be released later in 2023. Currently, LLVM supports v0.10 of the intrinsics specification and mainline GCC provides no support at all. It is in the roadmap of both compilers to support v1.0 in the future once it is ratified. However the XuanTie 900 series toolchain, which is a modified version of the GCC 8.4 compiler targeting the C906 and C910 supports a custom set of intrinsics for v0.7.1 and v1.0. As does the LLVM compiler from BSC for the EPI project's RISC-V Toolchain [15] providing their set of v0.7.1 and v1.0 intrinsics. These bespoke compiler versions can be useful when developing for vectorisation due to limitations in the mainline compilers.

### 2.3 P Extension

It should also be noted that there is packed a SIMD 'P' extension to the base ISA which uses the floating point registers and is aimed at embedded cores and low-power digital signal processing (DSP) applications, such as audio and video encoding/decoding, image interpretation and computer vision. The extension has not yet been ratified, the latest version is v0.9.11 [8], and provides a large number of SIMD and partial-SIMD instructions, such as 8/16-bit minimum and maximum instructions (including SMIN8, UMIN8, SMAX16 and UMAX16), and 16/32-bit multiply with 64-bit add/subtract instructions (including SMAL, SMALBB and SMAR64).

#### 2.4 Related work

Even though RVV has been ratified relatively recently, studies focusing on other (scalable) vector ISAs can be applicable when wishing to improve vector performance for RISC-V. For example, there has been studies comparing the performance of Arm SVE against NEON [31] and AVX [35], and evaluating the

vectorisation efficiency and usage on mini-apps for available SVE compilers [30]. Another parameter which has significant impact on performance with the VLA programming model is the implementation vector length, where [28] and [32] study the performance of a variety of vectorised applications with different vector lengths using the gem5 simulator for Arm SVE and RVV respectively.

There is currently a rapid development of research-based RVV enabled hardware underway, for example ETH Zurich have introduced Ara [22] and its upgrade [29], and BSC introduced Vitruvius+ [27]. Whilst none are yet massproduced or widely available, these RISC-V vector accelerator designs have been taped-out and their performance compared in [27].

## 3 RVV CPU Implementations

There is a broad selection of IP cores which have implemented RVV and this is summarised in Table 1. RISC-V cores on this list target a wide range of applications, including edge artificial intelligence/machine learning (SiFive X280), general high-performance application (SiFive P series), and decoupled vector accelerator (Ara/Vitruvius+). The decoupled accelerator approach is especially interesting because this allows vector instructions to be offloaded from the scalar pipeline, and paired with support for long vectors, for instance 256 double precision elements per vector register are supported by the Vitruvius+, these present high performance RISC-V vector accelerators for HPC workloads. In taped-out implementations the New Ara core reports achieving 37.1 GFLOPS per Watt [29] and Vitruvius+ reports 47.3 GFLOPS per Watt [27] on matrix multiplication benchmarks.

Table 1. List of available RVV processors. The last three entries are open source.

| Processor                    | Vector Length                     | RVV version                       |
|------------------------------|-----------------------------------|-----------------------------------|
| SiFive P270/P470/P670 [10]   | 256-bit/128-bit/dual 128-bit      | 1.0                               |
| SiFive X280 [9]              | 512-bit                           | 1.0                               |
| Andes NX27V [7]              | Configurable from 128 to 512-bit  | 1.0                               |
| Andes AX45MPV [6]            | Configurable from 128 to 1024-bit | 1.0                               |
| Vitruvius+ [27]              | 16384-bit                         | 0.7.1 (update to $1.0$ in future) |
| Hwacha [34] (V4 [33])        | 512-bit                           | custom                            |
| New Ara [29]                 | Configurable e.g. 4096-bit        | 1.0                               |
| Tenstorrent BOOM-ocelot [17] | Configurable from 128-bit         | 1.0                               |
| T-Head XuanTie C906 [18]     | 128-bit                           | 0.7.1                             |

These energy efficiency numbers delivered by the New Ara and Vitruvius+ cores are impressive, especially considering that they are still research prototypes rather than production parts. For comparison, whilst the Green 500 reports whole systems rather than the individual machine components, based on the November 2022 list those HPC machines that are able to achieve greater than 50 GFLOPS per Watt are based around either the AMD Instinct or Nvidia Grace Hopper GPUs. These represent mature technologies with a rich lineage, whereas by comparison the New Ara and Vitruvius+ are the first generation of RISC-V vector accelerators and therefore as time progresses are likely to significantly increase their performance and energy efficiency.

**Physical cores** At the time of writing, the only mass-produced and commercially available physical RISC-V vector core is the XuanTie C906 from T-Head, which is the chip division of Alibaba. This contains 128-bit wide vector registers, and supports vector element sizes of 8, 16, and 32 bits. Noticeable by its absence however is support for elements of size 64 bits, meaning that the XuanTie C906 does not support 64 bit double precision floating point. This is a major disadvantage for HPC, where the vast majority of our workloads are in double precision. Nevertheless it is still interesting to benchmark with single precision workloads as understanding the performance and software ecosystem can provide insights around RVV, albeit at single precision. The XuanTie C906 core is available as part of the Allwinner D1 SoC, part of the EPCC RISC-V testbed and the main system on which we perform our vector benchmarks in Section 5.

## 4 Toolchain and software support

In this section we review the current status of the RISC-V open source software ecosystem which supports compiling and running vectorised code on RVV processors.

#### 4.1 Compiler toolchain

**GNU** At the time of writing, the upstream GNU compiler toolchain does not support the vector extension. There is a branch, *rvv-next* [24], which provides limited support for RVV v1.0 and an older deleted branch *rvv-0.7.1* which targeted RVV v0.7.1. T-head provides a modified GNU toolchain which targets their C906 CPU [11], and contains optimised vectorisation for v0.7.1. This is the compiler used in this paper to benchmark the C906 CPU. Since the compiler is optimised for the C906, it generates code specifically for 128-bit vector width.

However, it should be noted that in recent weeks the T-Head GNU compiler has been removed from their download page and-so is no longer available. Because the compiler is under the GNU licence, it has been mirrored at [4].

**LLVM** LLVM 15 and 16 support RVV v1.0, and several of the auto-vectorisation characteristics have been studied in [21]. LLVM supports compiling vector length agnostic RVV code via the *scalable-vectorization=on* flag, as well as vector length specific via the *riscv-v-vector-bits-min=N* flag (where N is the fixed vector width in bits). LLVM also supports standard extensions with minimum vector length  $Zvl^*$  and its counterpart for embedded processors  $Zve^*$ . Since LLVM only targets RVV v1.0 and cannot run natively on the physical hardware available, it is

not tested in this paper. A rollback tool that translates generated RVV v1.0 to v0.7.1 has been developed and is reported, along with a performance comparison against GCC, in [26] for both VLS and VLA modes.

## 4.2 Linux kernel

Whilst there is now general availability of common Linux distributions for RISC-V boards, including Debian, Ubuntu and Fedora [2], many are early developer variants [3] or unsupported releases [1]. The Sipeed Linux image for the Allwinner D1, is easy to deploy using the proprietary tools and supports vectorisation out of the box. However, due to the proprietary, protected format of the bootloader, Linux images must be built using cross-compilation tools on another host and vendor-specific patches must be applied to *buildroot*. Furthermore, the T-Head specific GCC compiler version must also be used for this to ensure that the resulting image is RVV compatible.

This requirement to rebuild the bootloader and apply vendor patches is not only time consuming but also requires considerable knowledge and expertise to achieve. This is definitely an area in which the vendors of these boards could improve upon to open up their systems further and lower the barrier to entry.

#### 4.3 Performance analysis tooling and instrumentation

The RISC-V hardware ecosystem is moving very quickly and the HiFive Unmatched, released in late May 2021, and Allwinner Nezha D1, released in April 2021, are an example of where the software support sometimes struggles to keep up, especially when board and/or CPU specific support is required by tooling. Profiling tools are an example of this problem, where support for tools such as *perf* has lagged the hardware.

For instance, with the HiFive Unmatched, the Linux kernel version 5.18 only supports instruction and cycle count hardware events for *perf*, and in order to obtain further events then one must patch the kernel and OpenSBI [5]. With the Allwinner D1, containing the XuanTie C906 core, official support for *perf* was only released in the Linux kernel version 6.2 on February 19th, 2023, almost two years after the hardware was made available.

This lack of performance analysis tooling is a major drawback for HPC workloads, where it is imperative that programmers can gain insights around performance bottlenecks in codes and use this feedback to then optimise their applications.

#### 4.4 Emulation

Given the limited physical hardware currently available that supports RVV, and none that supports v1.0, an obvious alternative is to run RVV-based codes under emulation. There are two main emulators for RISC-V, QEMU and Spike. Current upstream QEMU supports RVV v1.0 along with the *zve32f* and *zve64f* 

standards which provide 32-bit and 64-bit vectorisation floating point support for embedded RISC-V CPUs respectively. Versions of QEMU prior to December 20th, 2021 supported RVV v0.7.1 only.

Likewise, Spike also supports RVV v1.0 and releases prior to November 12th, 2019 support v0.7.1. However, whilst emulation might appear to be a good choice for those wishing to experiment with RISC-V vectorisation in their applications, in absolute terms the application will run far slower than on physical hardware. Even for exploratory purposes this could be an issue as it will potentially limit the scale of testcases that can be executed.

**Vehave** Developed by Barcelona Supercomputing Center (BSC), Vehave [14] is a functional emulator based on QEMU which is able to dynamically handle and emulate vector instructions when running vectorised binaries on hardware that does not support the vector extension. There are separate versions supporting RVV v1.0 and v0.7. Whilst this provides a convenient way of supporting RVV on hardware that is not equipped with this RISC-V extension, it is far slower than the performance that would be provided by a physical CPU.

## 4.5 Softcores

Whilst the C906 is the only RVV hard CPU core readily available, there are a number of RVV softcores, such as the Andes NX27V [7], Andes AX45MPV [6] and Tenstorrent BOOM-ocelot [17] that can be included in field-programmable gate array (FPGA) designs to test RISC-V vectorisation codes. However, creating soft-core FPGA designs requires comprehensive knowledge of the FPGA tooling and logic circuit design, such as *negative slack* [12].

## 4.6 Libraries

Most HPC libraries can be cross-compiled for RISC-V, but there tend to be limited vectorisation optimisation applied within these. One library which already includes vector optimisation is OpenBLAS, which has been optimised for RVV v0.7.1 (specifically for XuanTie C906/C910)[36]. At the time of writing there are numerous efforts on-going across the community to optimise HPC libraries for RVV, and within the next year we will likely see significantly increased support in this regard.

## 5 Benchmarks

### 5.1 System

The main RISC-V system that we benchmark in this paper is the Allwinner D1, which contains a C906 processor and supports RVV v0.7.1 with 128-bit vector registers. For comparison against a scalar-only RISC-V CPU, we use the StarFive VisionFive V2 board (VF2), which contains a StarFive JH7110 processor (quad

core SiFive U74). In order to provide some context with similar vector designs already in use for HPC, we also performed runs on a Fujitsu A64FX system (Armv8), which supports fixed length SIMD (NEON), as well as vector length agnostic (SVE), instruction sets.

Because the C906 only contains a single core, all benchmarks are run on a single core to enable direct comparison across CPUs, and only NEON with 128-bit vector width is used on A64FX for an objective evaluation (the XuanTie GCC compiler only generates fixed 128-bit vector instructions). These systems are summarised in Table 2. It should be noted that we recognise the A64FX processor is designed for HPC applications and completely different in nature to the RISC-V cores, which are designed for embedded and single-board computers (SBC). However, a comparison against the A64FX is still valuable as it can highlight important differences and potential design improvements for an HPCclass RISC-V processor in the future.

|              | Allwinner D1                     | StarFive JH7110<br>(VF2)                  | A64FX                                                                                        |
|--------------|----------------------------------|-------------------------------------------|----------------------------------------------------------------------------------------------|
| Processor    | XuanTie C906                     | SiFive U74                                | Fujitsu A64FX                                                                                |
| Clock speed  | 1.0GHz                           | 1.5GHz                                    | 1.8GHz                                                                                       |
| Cores        | 1                                | 4                                         | 48                                                                                           |
| Cache        | 32 KB I-cache + 32 KB<br>D-cache | 32 KB I-cache + 32 KB<br>D-cache + 2MB L2 | 64 KB I-cache + 64KB<br>D-cache, 8 MB shared L2<br>cache per 12 cores (core<br>memory group) |
| Memory       | 512MB DDR3                       | 8GB DDR4                                  | 32GB HBM2                                                                                    |
| ISA          | RV64GC+V0.7                      | RV64GC                                    | ARMv8.2 with SVE                                                                             |
| Vector width | 128bit                           | N/A                                       | dual 128-bit (NEON) /<br>dual 512-bit (SVE)                                                  |

 Table 2. Compute system specifications

### 5.2 Methodology

To evaluate the vectorisation performance we use the RAJA Performance Suite (RAJAPerf) [19], which comprises the following sets of benchmarks: ALGO-RITHM, APPS, BASIC, LCALS (Livermore Compiler Analysis Loop Suite), POLYBENCH, and STREAM (Babel Stream). Since the C906 only supports vector element sizes up to 32-bit, we configure the benchmark to use the singleprecision floating point data type. The compilers and respective compiler flags for RISC-V and Arm systems are specified in Table 3. The benchmark timings are averaged over three runs.

## 5.3 Results

Table 4 summarises the list of kernels which are vectorised by the XuanTie GCC 8.4 compiler. It can be seen that 30 of the 64 kernels are successfully vectorised

| Tab | le | 3. | Com | piler | specif | fications |
|-----|----|----|-----|-------|--------|-----------|
|     |    |    |     |       |        |           |

| Name               | Compiler        | Vector width | Compiler flags                    |
|--------------------|-----------------|--------------|-----------------------------------|
| RV-GCC8.4-scalar   | XuanTie GCC 8.4 | N/A          | -03 -march=rv64gc -ffast-math     |
| RV-GCC8.4-vector   | XuanTie GCC 8.4 | 128-bit      | -03 -march=rv64gcv0p7 -ffast-math |
| ARM-GCC11.2-scalar | GCC 11.2        | N/A          | -03 -ffast-math -mcpu=a64fx       |
|                    |                 |              | -march=armv8.2-a+nosimd+nosve     |
| ARM-GCC11.2-vector | GCC 11.2        | 128-bit      | -03 -ffast-math -mcpu=a64fx       |
|                    |                 |              | -march=armv8.2-a+simd+nosve       |

by the compiler, but for 7 of these only the scalar code and no vector instructions were executed at runtime. This is due to the compiler's oversensitivity to loop ranges, and the scalar branch is preferred and executed even when a vectorised branch is available. Clang 15.0, which generates RVV v1.0 assembly, is capable of vectorising more kernels than GCC 8.4; for a full comparison, see [26].

Figure 1 reports runtimes for the RAJAPerf kernel normalised against the kernel's scalar runtime. For the A64FX, normalisation is against running in scalar mode on the A64FX, whereas for the Allwinner D1 and StarFive JH7110 it is normalised against running scalar on the D1. The orange and purple bars show the vectorisation performance difference on the A64FX and D1 respectively, and the green bars show a comparison of the scalar performance between the JH7110 (VF2) and the D1.

It can be observed from these plots that for most linear algebra kernels, the vectorised code on the RISC-V D1 is faster compared to its scalar counterpart, at around 84% faster for AXPY, 53% for GEMM, 45% for GEMVER, 40% for ATAX, and 46% for MVT. Vectorised code also sustain much higher bandwidth for streaming kernels such as Stream ADD, COPY, DOT, MUL, and TRIAD. In only one case, the FIR kernel, is the vectorised code slower than its scalar counterpart. Whilst in most cases the speedup from RVV on the D1 is not as significant as from NEON on A64FX, there are some exceptions; for example, matrix multiplication kernels on the A64FX compiled with ARM-GCC11.2-vector did not execute the vector instructions. Therefore, the runtime performance was the same as the scalar executable. Furthermore, the vectorised A64FX PRESSURE kernel was almost three times slower than the scalar version.

When comparing the RISC-V processors AllWinner D1 and StarFive JH7110, it can be observed that for high arithmetic intensity kernels the JH7110 (VF2), which has a higher clock frequency, significantly outperforms the D1. For example, GEMM is six times faster on the VF2 compared to running scalar on D1, and four times faster than the vectorised version of this benchmark on the D1. However, even though the theoretic memory bandwidth for the VF2 is higher than the D1, these benchmarking results demonstrate that with vectorisation the D1 executes the streaming kernels faster than the VF2. For example, Stream ADD is 82% faster and COPY is 77% faster on the D1. This is the reason why we observe that the D1 can perform low arithmetic intensity operations faster than VF2, for example AXPY on D1 with vectorisation enabled is 71% faster than the VF2 which is running in scalar mode.

 $\textbf{Table 4. RAJA Performance Suite Kernels vectorised by RV-GCC8.4-vector$ 

| Kernels        |                                            |           |
|----------------|--------------------------------------------|-----------|
| Vectorised and | executed                                   | Total: 23 |
| Algorithm:     | MEMCPY, MEMSET, REDUCE_SUM                 |           |
| Apps:          | ENERGY, FIR, PRESSURE                      |           |
| Basic:         | AXPY, AXPY_ATOMIC, REDUCE3_INT             |           |
| Lcals:         | GEN_LIN_RECUR                              |           |
| Polybench:     | 2MM, 3MM, ATAX, FDTD_2D, GEMM, GEMVER,     |           |
|                | GESUMMV, MVT                               |           |
| Stream:        | ADD, COPY, DOT, MUL, TRIAD                 |           |
| Vectorised     |                                            | Total: 7  |
| Lcals:         | FIRST_SUM, FIRST_DIFF, HYDRO_1D, HYDRO_2D, |           |
|                | TRIDIAG_ELIM                               |           |
| Polybench:     | JACOBI_1D, JACOBI_2D                       |           |
| Scalar         |                                            | Total: 34 |
| Algorithm:     | SCAN, SORT, SORTPAIRS                      |           |
| Apps:          | CONVECTION3DPA, DEL_DOT_VEC_2D,            |           |
|                | DIFFUSION3DPA, HALOEXCHANGE,               |           |
|                | HALOEXCHANGE_FUSED, LTIMES,                |           |
|                | LTIMES_NOVIEW, MASS3DPA,                   |           |
|                | NODAL_ACCUMULATION_3D, VOL3D               |           |
| Basic:         | IF_QUAD, INDEXLIST, INDEXLIST_3LOOP,       |           |
|                | INIT_VIEW1D, INIT_VIEW1D_OFFSET, INIT3,    |           |
|                | MAT_MAT_SHARED, MULADDSUB, NESTED_INIT,    |           |
|                | PLATOMIC, PLREDUCE, REDUCE_STRUCT,         |           |
|                | TRAP_INT                                   |           |
| Lcals:         | DIFF_PREDICT, EOS, FIRST_MIN, INT_PREDICT, |           |
|                | PLANCKIAN                                  |           |
| Polybench:     | ADI, FLOYD_WARSHALL, HEAT_3D               |           |







(c) Polybench Kernels

## 6 Conclusions and recommendations

At the time of writing, generating and testing RVV codes on the currently available physical CPUs is problematic due to the mismatch between the available tooling, such as GCC and Clang, and the RVV version (v0.7.1) implemented in hardware. However, as demonstrated in Section 5.3, compiling for RVV on the D1 can result in codes being up to 80% faster than the scalar alternative (RAJAPerf AXPY and Stream ADD). The standardisation of tooling with v1.0 RVV and intrinsics will greatly simplify the development of vectorised codes in the future, running on RVV v1.0 compliant CPUs. Therefore our view is that, whilst at the time of writing there are challenges around developing and running vectorised code on RISC-V due to the immaturity of tooling and hardware, in the medium term these challenges will be solved and RVV provides a strong foundation for leveraging RISC-V for high performance workloads. Furthermore, the improved auto-vectorisation of LLVM, coupled with increased VLEN in future CPUs, is expected to increase kernel runtime performance even further.

Although the later versions of the T-Head GCC toolchain supports both RVV v0.7 and v1.0, neither the mainstream GCC or LLVM toolchains support v0.7. Whilst it is understandable that the toolchain development teams only want to support the ratified version of RVV, the currently available RVV hard CPU cores only support v0.7 and the runtime performance benefits of leveraging RVV on the C906-based devices are tangible, as shown in Section 5.3. Furthermore, T-Head have proven that it is possible to provide RVV v0.7 and RVV v1.0 support within the GCC toolchain, providing the -march=rv64gcv0p7 and -march=rv64gcv1p0 compiler options. With the large volume of RVV v0.7 devices in circulation we would like to see support for both v0.7 and v1.0 RVV in mainstream GCC and Clang / LLVM toolchains.

### 6.1 Recommendations

In order to leverage the runtime performance benefits of vectorisation on current RISC-V hardware and to minimise the impact of the code incompatibilities between RVV v0.7 and v1.0 [25], we recommend the use of the T-Head GCC 8.4 auto-vectorisation and not using the T-Head RVV v0.7 intrinsic API. This will ensure that codes can simply be recompiled, without modification, to target RVV v1.0 compatible hardware. Another option, is to generate code for RVV v1.0 using GCC or Clang / LLVM auto-vectorisation or the v1.0 intrinsics API, and utilise a conversion tool such as [26] to create binaries for RVV v0.7 hardware.

We would also recommend building RVV-enabled Linux images with a patched mainstream *buildroot* using the T-Head GCC 8.4 compiler, as support for the Allwinner D1 has recently been added.

## 7 Acknowledgement

The authors would like to thank the ExCALIBUR H&ES RISC-V testbed for access to compute resource used in this work.

## References

- Architectures/RISC-v/allwinner fedora project wiki, https://fedoraproject. org/wiki/Architectures/RISC-V/Allwinner
- Architectures/RISC-v/installing fedora project wiki, https://fedoraproject. org/wiki/Architectures/RISC-V/Installing
- 3. Download ubuntu for RISC-v platforms, https://ubuntu.com/download/risc-v
- 4. ExCALIBUR H&ES RISC-V testbed, http://riscv.epcc.ed.ac.uk/
- 5. How to setup additional 'perf' events on the HiFive unmatched, https://arch.cs.ucdavis.edu/blog/2022-09-15-perf-hifive
- RISC-V: AX45MPV, https://www.andestech.com/en/products-solutions/ andescore-processors/riscv-ax45mpv/
- 7. RISC-V:NX27V, https://www.andestech.com/en/products-solutions/ andescore-processors/riscv-nx27v/
- 8. riscv-p-spec/P-ext-proposal.pdf at master · riscv/riscv-p-spec · GitHub, https: //github.com/riscv/riscv-p-spec/blob/master/P-ext-proposal.pdf
- 9. SiFive Intelligence X280, https://www.sifive.com/cores/intelligence-x280
- 10. SiFive Performance, https://www.sifive.com/cores/performance
- 11. T-Head Open Chip Community Download, https://occ.t-head.cn/community/download
- 12. Timing analyzer clock analysis, https://www.intel.com/content/www/us/ en/programmable/support/support-resources/design-examples/designsoftware/timinganalyzer/clocking/tq-clock.html
- 13. RISC-V "V" Vector Extension 1.0 (2021), https://github.com/riscv/riscv-v-spec/releases/tag/v1.0
- 14. Vehave User Guide · Wiki · EPI-public / RISC-V Vector Environment · GitLab (Nov 2021), https://repo.hca.bsc.es/gitlab/epi-public/risc-vvector-simulation-environment/-/wikis/Vehave-User-Guide
- 15. BSC RISC-V Vector Toolchain · Wiki · EPI-public / RISC-V Vector Environment · GitLab (Feb 2022), https://repo.hca.bsc.es/gitlab/epi-public/riscv-vector-simulation-environment/-/wikis/BSC-RISC%E2%80%90V-Vector-Toolchain
- MLCommons MLPerf Inference Tiny v0.7 Results (Apr 2022), https://mlcommons.org/
- 17. Ocelot: The Berkeley Out-of-Order RISC-V Processor with Vector Support (Mar 2023), https://github.com/tenstorrent/riscv-ocelot
- 18. OpenC906 (Mar 2023), https://github.com/T-head-Semi/openc906
- 19. RAJA Performance Suite (Feb 2023), https://github.com/LLNL/RAJAPerf
- 20. RISC-V Vector Extension Intrinsic Document (Mar 2023), https://github.com/ riscv-non-isa/rvv-intrinsic-doc
- Adit, N., Sampson, A.: Performance Left on the Table: An Evaluation of Compiler Autovectorization for RISC-V. IEEE Micro 42(5), 41–48 (Sep 2022). https://doi.org/10.1109/MM.2022.3184867, conference Name: IEEE Micro
- 22. Cavalcante, M., Schuiki, F., Zaruba, F., Schaffner, M., Benini, L.: Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28(2), 530–543 (2020). https://doi.org/10.1109/TVLSI.2019.2950087
- Editors Andrew Waterman and Krste Asanovič: The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document Version 20191213. RISC-V FOUNDA-TION (Dec 2019)

- 14 J. K. L. Lee et al.
- 24. GNU, International, R.V.: Risc-v gnu compiler toolchain (rvv-next branch), https: //github.com/riscv-collab/riscv-gnu-toolchain/tree/rvv-next
- 25. Hsiangkai Wang, Zakk Chen, Kito Cheng, Yi-Hsiu, Roger Ferrer Ibanez, Nick Knight, Mingjie Xing: RISC-V vector extension intrinsic API reference manual, https://occ-oss-prod.oss-cn-hangzhou.aliyuncs.com/resource/ /1663142187133/Xuantie+900+Series+RVV-0.7.1+Intrinsic+Manual.pdf# section\*.243
- 26. Lee, J.K.L., Jamieson, M., Brown, N.: Backporting risc-v vector assembly. Proceedings for the First International workshop on RISC-V for HPC (Mar 2023), under peer review
- Minervini, F., Palomar, O., Unsal, O., Reggiani, E., Quiroga, J., Marimon, J., Rojas, C., Figueras, R., Ruiz, A., Gonzalez, A., Mendoza, J., Vargas, I., Hernandez, C., Cabre, J., Khoirunisya, L., Bouhali, M., Pavon, J., Moll, F., Olivieri, M., Kovac, M., Kovac, M., Dragic, L., Valero, M., Cristal, A.: Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications. ACM Transactions on Architecture and Code Optimization (Dec 2022). https://doi.org/10.1145/3575861, just Accepted
- Odajima, T., Kodama, Y., Sato, M.: Performance and power consumption analysis of Arm Scalable Vector Extension. The Journal of Supercomputing 77(6), 5757– 5778 (Jun 2021). https://doi.org/10.1007/s11227-020-03495-5
- Perotti, M., Cavalcante, M., Wistoff, N., Andri, R., Cavigelli, L., Benini, L.: A "New Ara" for Vector Computing: An Open Source Highly Efficient RISC-V V 1.0 Vector Processor Design. In: 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Processors (ASAP). pp. 43–51 (Jul 2022). https://doi.org/10.1109/ASAP54787.2022.00017, iSSN: 2160-052X
- Poenaru, A., McIntosh-Smith, S.: Evaluating the Effectiveness of a Vector-Length-Agnostic Instruction Set. In: Malawski, M., Rzadca, K. (eds.) Euro-Par 2020: Parallel Processing. pp. 98–114. Lecture Notes in Computer Science, Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-57675-2\_7
- 31. Pohl, A., Greese, M., Cosenza, B., Juurlink, B.: A Performance Analysis of Vector Length Agnostic Code. 2019 International Conference on High Performance Computing & Simulation (HPCS) pp. 159–164 (Jul 2019). https://doi.org/10.1109/HPCS48598.2019.9188238, conference Name: 2019 International Conference on High Performance Computing & Simulation (HPCS) ISBN: 9781728144849 Place: Dublin, Ireland Publisher: IEEE
- 32. Ramírez, C., Hernández, C.A., Palomar, O., Unsal, O., Ramírez, M.A., Cristal, A.: A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures. ACM Transactions on Architecture and Code Optimization 17(4), 1–30 (Dec 2020). https://doi.org/10.1145/3422667
- 33. Schmidt, C., Ou, A., Asanović, K.: Hwacha V4: Decoupled Data Parallel Custom Extension https://riscv.org/wp-content/uploads/2018/12/Hwacha-A-Data-Parallel-RISC-V-Extension-and-Implementation-Schmidt-Ou-.pdf
- 34. Schmidt, C., Wright, J., Wang, Z., Chang, E., Ou, A., Bae, W., Huang, S., Milovanović, V., Flynn, A., Richards, B., Asanović, K., Alon, E., Nikolić, B.: An Eight-Core 1.44-GHz RISC-V Vector Processor in 16-nm FinFET. IEEE Journal of Solid-State Circuits 57(1), 140–152 (Jan 2022). https://doi.org/10.1109/JSSC.2021.3118046, conference Name: IEEE Journal of Solid-State Circuits
- Soria-Pardos, V., Armejach, A., Suárez, D., Moretó, M.: On the use of many-core Marvell ThunderX2 processor for HPC workloads. The Journal of Supercomputing 77(4), 3315–3338 (Apr 2021). https://doi.org/10.1007/s11227-020-03397-6

36. Xianyi, Z.: OpenBLAS (Mar 2023), https://github.com/xianyi/OpenBLAS