Abstract
Time-of-Flight (ToF) camera systems are increasingly capable of analyzing larger 3D spaces and providing more detailed and precise results. To increase the speed-to-solution during development, testing and validation of such systems, light propagation simulation is employed. One such simulation, RSim, was previously performed on single workstations, however, the increase in detail required for newer ToF hardware necessitates cluster-level parallelism in order to maintain an experiment latency which enables productive design work. Celerity is a high-level parallel API and runtime system for clusters of accelerators intended to simplify the development of domain science applications. It automatically manages data and work distribution, while also transparently enabling asynchronous compute and communication overlapping. In this paper, we present a use case study of porting the full RSim application to GPU clusters using the Celerity system. In order to improve scalability, a new parallelization scheme was employed for the core simulation task, and Celerity was extended with a high-level split constraints feature which enables this scheme. We present strong- and weak-scaling experiments for the resulting application on three accelerator clusters and up to 128 GPUs, and also evaluate the relative programming effort required to distribute the application on multiple GPUs using different APIs.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Time-of-Flight (ToF) cameras generate pulses of light of a specific wavelength—usually not visible to the human eye—and measure how long it takes for these pulses of light to reflect across the surrounding surfaces and return to a built-in sensor. This allows the systems to create a reasonably accurate virtual 3D reconstruction of the room or scenery in front of the camera [13]. These systems need to be tested both during research and development as well as, eventually, for validation in safety-critical situations. Real-world testing can be expensive and time consuming, and it is generally not possible to provide rapid real-world feedback across a wide variety of testing scenarios for evaluating design changes during product development. Therefore, room impulse response simulations [29] are employed, which simulate the propagation of light through an arbitrary 3D scene, usually provided as a polygonal model. We will refer to the particular algorithmic implementation of this simulation which is used and extended in this paper as RSim [26].
With increasing detail requirements for modern ToF systems and their use cases, the corresponding simulations need to process complex 3D scenes. This results in increased memory and compute requirements, which can no longer be satisfied by a single GPU or shared memory node. As such, a distributed memory accelerator cluster version of RSim is needed.
Targeting a traditional GPU cluster HPC software stack—i.e. MPI+X—with this type of completely new porting effort of a real-world application necessitates a high level of expertise in parallelization and distributed software optimization on part of the developers. The resulting implementation is also often quite inflexible due to the low level of abstraction provided by the underlying systems, which makes further experimentation—both at the domain science level as well as the parallelization level—difficult and labor-intensive.
Meanwhile, modern HPC runtime systems seek to automate aspects like work and data distribution in an attempt to bridge the gap between a focus on allowing relatively straightforward implementation of domain science on the one hand and the complexities of large heterogeneous distributed memory clusters on the other hand. While systems such as Celerity [25] can greatly reduce the burden on the application programmer, whether or not sufficient scalability can be achieved in a larger, real-world application is often unclear.
The main focus of this paper is on the algorithms used in RSim, how they can be adapted to a parallel distributed memory topology, and how the high-level Celerity API enables a high-performance implementation of the resulting application. We will also discuss some minimal but effective API extensions which resulted from this work. The mathematical modeling, proof of accuracy and background of the room response simulation method [16] are not part of the scope of this work.
The remainder of this paper is structured as follows: Sect. 2 provides necessary background information on the RSim application and its core algorithms (Sect. 2.1) as well as the Celerity API and runtime system (Sect. 2.2). Section 3 describes the core of the port implementation and various optimizations, including algorithmic parallelization changes, which were implemented. In particular, Sect. 3.2.2 discusses an API addition made to Celerity to provide high-level user control over how work is split across GPUs in the cluster. A detailed performance evaluation across three GPU clusters, with several real-world scenes as well as weak scaling experiments, is presented in Sect. 4. We also provide some insight into the comparative productivity of the Celerity API in this application in Sect. 4.6. Related work is discussed in Sect. 5, before Sect. 6 concludes the paper.
2 Background
2.1 Room Response Simulation
In the type of room response simulation performed by the RSim application, some scene geometry (G) is presented as a set of triangle meshes, with surface material information (\(\rho \)). A light source (L) then generates a pulse of modulated light, which is reflected in G until it returns to the sensor (S), or the user-defined time limit of the simulation is reached. Based on the time series information of light being returned to each sensor pixel, a 3D approximation of the scene can be reconstructed.
For the distributed memory porting of this algorithm, the mathematical details of the simulation [16] are not directly relevant, however, they influence the computational phases of the algorithm and their data requirements in particular. The latter constrain the potential parallelization schemes, and thus we will now briefly outline the overall simulation structure and the core algorithms involved.
2.1.1 Algorithmic Overview
Figure 1 provides an overview of the RSim application and its most important algorithmic steps, as well as the data they require, generate and operate upon. The computational steps can be summarized as follows:
-
(1)
Reading the geometric input data and building a triangle mesh representation. This may optionally include building a geometric acceleration structure [10, 11, 14] in order to speed up subsequent computations.
-
(2)
Pre-computing three kinds of derived geometry data which are required by the main simulation loop:
-
(a)
The area of each triangle, stored in \(A_I\) for triangle I.
-
(b)
The mutual signal delay computation, a quantity derived from the distance of triangle pairs, stored in \(T_{IJ}\) for triangles I and J.
-
(c)
The visibility between each triangle pair, which evaluates the potential energy transfer between them stochastically. Stored in a matrix \(K_{IJ}\) indexed in the same fashion as \(T_{IJ}\).
-
(3)
The main simulation loop, for each radiosity time step \(t \in |0, T)\): Propagate radiosity, computing \(R_{It}\) for each triangle I, based on \(A_I\), \(T_{IJ}\) and \(K_{IJ}\) for all triangles J, as well as \(R_{It_p}, t_p \in |0,t)\).
-
(4)
Finally, deriving an approximation of the observed geometry by inverse distance computation based on the radiosity time series for each sensor element stored in \(R_{It}\).
To illustrate which of these algorithmic steps are most relevant for distributed memory parallelization and optimization, we will now briefly outline their computational complexity, for a number of scene triangles \(N = |G|\) and a time step count T. In this context it is important to note that, in all realistic use cases, \(N>> T\), that is, the number of geometric primitives in the scene is far higher than the time step count.
Steps 1 and 2a scale linearly with \(\mathcal {O}(N)\) and therefore do not contribute meaningfully to the overall computation time. Step 2b is trivially identified as \(\mathcal {O}(N^2)\), but the fixed factor is low and its overall time contribution to the simulation is still negligible. The final distance computation (step 4) performs cross-correlation across per-timestep radiosity, with a complexity of \(\mathcal {O}(N \cdot T^2)\). As T is much smaller than N, this step is also ultimately negligible in overall execution time.
The remaining steps, visibility computation (step 2c) and the main simulation loop (step 3), are highly relevant to performance and require in-depth treatment.
2.1.2 Mutual Visibility Computation
As the goal of this step is to produce an approximation of the amount of energy which can be exchanged by each pair of triangles, the simplest possible approach is very computationally intensive: for each triangle, cast R rays to all the other \(N-1\) triangles, and verify for each ray whether any of the remaining \(N-2\) triangles intersect that ray. This naïve approach has a complexity of \(\mathcal {O}(N^3 \cdot R)\). In real-world use of the simulation application, R is generally limited to an order of tens, while N can range from thousands to millions, and as such we can assume that the complexity for a naïve function populating \(K_{i,j}\) is \(\mathcal {O}(N^3)\).
This approach is comparatively easy to parallelize in a distributed memory setting, however, it is not meaningful to do so: while parallel efficiency would be excellent, actual work efficiency is much higher with more involved schemes. Generally, geometric acceleration structures are employed to optimize ray-scene queries. The scene is recursively subdivided into smaller regions, which are stored in some form of tree structure. Commonly used acceleration structures include Bounding Volume Hierarchies (BVHs) [17] such as Octrees [10], and KD-trees [14]. The crucial point for our purposes is that these structures reduce average-case query complexity of ray queries from \(\mathcal {O}(N)\) to \(\mathcal {O}(log(N))\), resulting in an overall complexity for mutual visibility computation of \(\mathcal {O}(N^2 \cdot log(N))\). However, both GPU parallelization and cluster parallelization of this process is non-trivial, and we will discuss it further in Sect. 3.
2.1.3 Radiosity Time Step Computation
In this central part of the simulation, in each time step \(t \in |0,T)\), for each triangle \(I \in G\), the light energy potentially transmitted from every other triangle J needs to be computed. This computation of the radiosity for I in timestep t (\(R_{It}\)) requires access to all of the following: \(A_I\), \(A_J\), \(\rho _I\), \(\rho _J\), \(T_{IJ}\), \(K_{IJ}\), and \(R_{J\tau }: \tau = \lfloor t - T_{IJ} \rfloor \).
From a performance and parallelization perspective, of these data accesses, the first four are not particularly interesting: they are simple 1D arrays of N elements, each of which can be easily replicated across the cluster and stored in full on each GPU.
The matrices \(T_{IJ}\) and \(K_{IJ}\) are \(N^2\) in size, which quickly becomes impossible to store on a single GPU, and also induces significant memory access costs in the main computation phase. It has previously been demonstrated that, particularly on GPUs with their high floating point computational output per byte of external memory bandwidth, \(T_{IJ}\) can be re-materialized on demand, thus not requiring any storage [26]. Conversely, computing each element of \(K_{IJ}\) is an expensive process and re-materialization is therefore not an option. However, this matrix can be stored in less precise data formats without significant loss in overall simulation accuracy [27]. As such, we use an FP16 data type to store these values. Even so, \(K_{IJ}\) needs to be distributed carefully across the cluster and individual GPUs to achieve meaningful problem size scaling on large parallel systems.
Finally, the most crucial limiting factor for cluster parallelization of RSim is the access to \(R_{J\tau }\) with \(\tau = \lfloor t - T_{IJ} \rfloor \). What this formula expresses is that the new radiosity value (at timestep t) of a given triangle may depend on any previously computed radiosity value from a timestep up to \(t-1\)—the actual element which will be accessed is data-dependent. This implies a direct data dependency from each simulation time step to all subsequent ones, with each node requiring data produced across the entire cluster. While the total amount of radiosity data computed in a single time step is only \(\mathcal {O}(N)\), the latency of its exchange is nonetheless a limiting factor, as we will elaborate in Sect. 3 and evaluate in Sect. 4.
2.2 SYCL and Celerity
Given the modern landscape of compute hardware available for highly data-parallel and bandwidth-intensive applications such as RSim, it seems obvious that accelerators, or more specifically GPUs, should be targeted. Previous implementations of RSim have accomplished this goal by leveraging low-level APIs such as Vulkan [26]. However, developing software with low-level GPU APIs at a full accelerator cluster level is a challenging endeavor, particularly while relying on the established de-facto standard approach to developing distributed GPU applications: “MPI + X”, where the Message Passing Interface (MPI) [20] is combined with a data parallel programming model such as Vulkan, CUDA, OpenCL or OpenMP.
Celerity is a high-level C++ runtime system for accelerator clusters which specifically focuses on programmability in the complex environment of distributed-memory GPU cluster computing [25]. It presents a fundamentally dataflow-based parallelism model reminiscent of single-GPU programming, where computation is transparently split and distributed across compute nodes. In order to ease adoption and leverage existing standards as far as possible, its programming interface is closely related to the industry-standard SYCL API [24], with minimal extensions required for operation on distributed memory [18]. Celerity internally performs fully distributed and asynchronous task and command graph generation, and has previously been demonstrated to scale up to 128 GPUs for traditional, compute-intensive HPC algorithms [22]. Figure 2 provides a high-level perspective on how SYCL and Celerity are involved in the execution of a program using the Celerity API on a GPU cluster.
A full overview of the SYCL and Celerity APIs is beyond the scope of this paper,Footnote 1 so in this section we will focus on how Celerity extends the data parallelism of SYCL kernels to distributed multi-GPU execution, and in particular those features which affect the porting of RSim.
A typical SYCL program is centered around buffers of data and kernels which manipulate them. The latter are wrapped in command groups and submitted to a queue, which is processed asynchronously with respect to the host process. All buffers need to be accessed through accessors, which additionally require meta-information on how a buffer will be accessed, i.e., for reading, writing or both. This allows the SYCL runtime to construct a task graph based on the dataflow of buffers through kernels. In a similar fashion to CUDA, SYCL abstracts the concept of a (GPU) thread: users express their programs as linear-looking kernel code, which is actually executed on an N-dimensional range of work items.
Celerity extends this concept to distributed data-parallel computation. Celerity kernels are written in the same way as in SYCL, but can be executed across multiple devices on different nodes, with all resulting data transfers handled completely transparently to the user.
The most fundamental extension to SYCL introduced by the Celerity API are range mappers. These are functors providing additional information about how buffers are accessed from a kernel. By evaluating these range mappers on sub-domains of the execution range, the Celerity runtime system infers which parts of a buffer will be read, and which ones will be written—at arbitrary granularity. This information in turn allows the Celerity runtime system to build graph representations of the computational steps of the application and their required data transfers, which are scheduled asynchronously on the nodes of the cluster.
3 Implementation and Optimizations
Porting the RSim code from a single-GPU version to Celerity follows a relatively straightforward recipe. For each GPU kernel, rather than submitting it to a local GPU work queue, it is instead submitted to a . This allows the Celerity runtime system to distribute the parallel work across the cluster, without any communication code being required in the user program.
The most challenging part of the porting process is correctly setting up the range mappers for the more complex computational steps. In this context, we would like to mention the option first introduced in Celerity 0.4.Footnote 2 It enables detailed detection and reporting of out-of-bounds buffer accesses and proved invaluable during development.
For our detailed discussion, we will focus on the implementation and optimization of the two most important algorithmic steps, mutual visibility computation and radiosity simulation time stepping, as previously described in Sect. 2.1.2 and Sect. 2.1.3 respectively. As we discussed before, the impact of all other elements of the full simulation application is negligible in terms of overall performance, and they were implemented in a straightforward manner.
3.1 Mutual Visibility Computation
In this section we will discuss the various optimizations in both data storage as well as algorithmic changes that contribute to reducing the runtime or computation complexity of populating the \(K_{ij}\) matrix.
3.1.1 Spatial Acceleration Structure—Octrees
The greatest algorithmic performance improvement for mutual visibility computation can be derived from using a spatial acceleration structure in order to speed up the ray-scene queries required to stochastically estimate the energy transfer between each pair of triangles. As previously discussed (Sect. 2.1.2), ray tracing and 3D space search are widely studied topics, especially in computer graphics. For the purpose of RSim, Octrees [10] have proven effective.
The basic idea of the octree data structure is that space is recursively subdivided into progressively smaller cuboids, with each being subdivided into 8 equally-sized partitions. This recursive process is stopped once the total number of primitives remaining in a cuboid is below some threshold.
While high-performance octree construction on GPUs has previously been demonstrated [15], the translation from a list of triangles to an octree representation is always done on the CPU in RSim for simplicity of implementation. Even for the largest real-world geometric scenes we tested (139k triangles) the total time for octree construction was measured at 0.29 s, which is completely negligible compared to the actual runtime of such a large simulation for any useful number of time steps. The calculation is also performed independently on all cluster nodes, eliminating the need for transferring this data structure.
Flattening the Octree Representation
As outlined in Sect. 2.1.2, the stochastic estimation of the final value \(K_{ij}\) required by the main radiosity time step simulation involves querying the octree data structure for R rays between every pair of triangles, or \(N^2 \cdot R\) times, resulting in a total complexity of \(\mathcal {O}(N^2 \cdot R \cdot log N)\). Therefore, in Celerity RSim, this step needs to be parallelized across all the GPUs in the cluster.
In the CPU implementation, traversing the octree usually involves following pointers in a node-based data structure. An option to make this structure more GPU-friendly is to convert from a node-based octree structure to a flattened tree. The latter consists of two densely packed lists, one with the node indices and the other with all triangles. This type of conversion was implemented via two dedicated s, with the index buffer ordered such that it represents a linear and cache-friendly mapping of the octree structure, i.e. in the order of nodes, subsequent child nodes, and leaf nodes. Figure 3 illustrates the conversion from a node-based tree to this flat representation. It shows an initial node with 8 child nodes, which points to a later location in the same index buffer, and a leaf node child of one of these nodes refers to the first of its 8 triangles in the second buffer dedicated to triangles.
Note that the general problem of representing a tree data structure in a memory-friendly way is not new or unique [3]—the RSim case is actually comparatively simple, as we do not need to modify the data structure once it has been built.
3.1.2 Storing the Mutual Visibility Matrix
As the individual entries in the \(K_{IJ}\) matrix represent stochastically determined radiosity exchange factors between two triangles, the storage precision of their values only has a very small impact on overall simulation accuracy—up to a point [27]. As such, for Celerity RSim, instead of storing \(K_{IJ}\) in standard IEEE format (32 bits), we instead leveraged Celerity and SYCL’s CUDA interoperability to directly access the
data format, a 16 bit
number with full arithmetic support on our target hardware.
Given that the function of the \(K_{IJ}\) matrix is to represent the mutual visibility between pairs of triangles, it is symmetric (\(K_{IJ}^\top = K_{IJ}\)). In particular, the values for K[i, j] and K[j, i] are always the same, and furthermore the diagonal is unused—as it would store the energy transfer from a given triangle to itself. As such, in previous CPU and single-node shared memory implementations of RSim, this matrix was stored in a dense linearized format, which has the advantage of only requiring storage for \(\frac{N^2-N}{2}\) rather than \(N^2\) elements.
However, in a distributed GPU cluster setting, such a storage choice leads to sub-optimal indexing reducing GPU performance, as well as non-contiguous distribution of the required data. As such, for the distributed RSim implementation, we chose to store the full matrix \(K_{IJ}\) despite its symmetry. Note that distributing the matrix in the cluster greatly alleviates storage concerns which originally drove the choice of linearization—two nodes are already sufficient to make up for the storage disadvantage.
When choosing this full 2D storage option, it is also remarkably easy to split the workload of populating \(K_{ij}\) such that each node only computes the part of the matrix it subsequently requires during radiosity propagation—all that is required is ensuring that the Celerity range mappers are set correctly for accesses to \(K_{IJ}\).
3.1.3 Custom Range Mappers
Due to the flexibility of Celerity’s custom range mappers we are able to specify the precise minimal ranges of data that are needed per work item in a given kernel even for non-default patterns. This enables Celerity to load only that subset into the local device memory, for arbitrary splits of work that the runtime system can decide upon. For example, the code in listing 1 shows the full definition of the range mapper used to access the rows of \(K_{ij}\), which informs Celerity that each GPU will need the rows of the matrix corresponding to the set of triangles whose radiosity it will propagate.

This range mapper functor maps from a 2D chunk c to a subrange labelled ret. The chunk, or more precisely its range and offset members, represents the execution range of a given part of the overall kernel execution. Celerity selects a split and subsequently invokes this functor with each involved sub-range to determine the required data accesses. In this particular case, the kernel needs to read the corresponding row of the buffer, and therefore sets the members of the returned sub-range accordingly.
3.2 Radiosity Propagation
The core radiosity propagation time step kernel is rather simple algorithmically, with the complexity in porting and optimizing it for distributed memory clusters primarily resulting from two factors: (i) individual chunks of work becoming too small to fully utilize GPUs, particularly in strong scaling scenarios, and (ii) the direct, cluster-wide and relatively low-latency data dependencies introduced by each subsequent time step depending on the radiosity results of all other nodes.
3.2.1 Parallelization Strategy for Many-GPU Scaling
The approach to parallelizing the radiosity propagation adopted by all previous parallel versions of RSim is to, in each time step t, distribute chunks of triangles to each parallel compute unit, such that each compute unit works on some exclusive subset of \(R_{It}\). This is a type of owner-computes output data composition, and allows for fully independent work on each chunk, with the only interaction being required at the boundary between two time steps.
Figure 4 illustrates this parallelization approach and the most crucial associated data access requirements. The timestep currently being computed produces one new row in R, and the elements within it can be distributed to different compute units, represented by distinct colored shading. Regardless of the extent of a given block of work, each requires access to all previously computed radiosities (shaded grey). However, the work distribution affects the data access requirements in K—e.g. the first colored block computing \(i=2,3\) in the illustration accesses the corresponding rows in K. Note that all elements additionally require access to the full extent of the buffers A and \(\rho \), but as previously discussed these are both invariant and comparatively tiny (containing only N elements), so we can ignore them for this analysis.
This parallelization strategy is optimal in the sense that it minimizes communication to only that implied by hard algorithmic constraints—the true dependencies across time steps. It allows for a total of N parallel elements. As N is generally at least in the order of 10000 in real-world examples, this is not a substantial drawback in previous parallelizations of RSim targeting only a single node with multi-core CPUs or a limited number of GPUs.
However, when scaling to clusters of GPUs, this limitation to N parallel elements can quickly become a bottleneck: the total computational effort of the simulation scales with \(N^2\), but increasing the number of GPUs used in the simulation in accordance with this scaling will leave increasingly fewer parallel elements per GPU. As a representative example, assuming \(N=30000\) and 128 GPUs, each GPU could only compute \(\approx 234\) parallel elements per time step, while GPU parallelization generally requires a degree of parallelism in the order of tens of thousands to effectively utilize the hardware.
Input Data Decomposition
As such, an alternative approach was required to achieve meaningful scalability on the GPU cluster target hardware of Celerity-RSim. Instead of the more obvious output data decomposition, an input data decomposition scheme was adopted, which is illustrated in Fig. 5. By allowing parallelization across the entire \(N^2\) input matrix, this decomposition provides ample parallelism for GPU clusters, however, it also introduces several new algorithmic implementation complexities:
-
1.
As several parallel work items contribute to a single output value, some synchronization or parallel aggregation scheme is necessary.
-
2.
Performing a full 2D parallelization leaves very little work per element, but significant output synchronization, resulting in very low efficiency.
-
3.
Launching this computation as a single 2D data-parallel kernel in the Celerity system would, in principle, allow the runtime to distribute parts of a row to separate GPUs or even distinct nodes in the cluster, extending the synchronization problem outlined above from a single GPU to a cluster-wide issue.
While it is possible to treat each of these issues in isolation, the solution we finally adopted is a result of combining mitigation strategies for all three, and ensuring that they do not substantially affect overall performance.
Fine-grained Input-parallelism Chunking We start our discussion with issue 2. The individual work per input element in a full 2D parallelization is not sufficient to effectively utilize the available hardware, memory bandwidth, and caching. A natural solution to this issue is chunking the computation, also frequently referred to as kernel coarsening.
In this scheme, each actual work item of the GPU kernel computes the resulting radiosity of the interaction of the given triangle I with a set of C (the fine-grained chunk size, set to 2 in the illustration in Fig. 5) paired triangles. In practice, this means that each work item still loops over a range of inputs, but unlike the output data decomposition case, C can be chosen to be far smaller than N, thereby still providing ample parallelism.
Output Aggregation This chunking approach has the additional advantage of greatly reducing the impact of the output synchronization or aggregation bottleneck (issue 1 above). Due to input parallelization, multiple parallel work items contribute to a single output value, and therefore some synchronization is required. Specifically, in the RSim algorithm, the output radiosity \(R_{It}\) is the sum of the radiosity contributions of all paired triangles.
Without chunking, the very fine granularity of the operations would necessitate a high-performance GPU-optimized reduction algorithm. However, as we can set C sufficiently high in practice, an implementation based simply on atomic addition operations is not only possible but also effective: each work item only writes back its result once, after computing several hundred individual triangle radiosity contributions, so only one atomic operation is necessary every several thousand GPU clock cycles. One drawback is that this limits functional performance portability to GPU architectures which feature efficient support for atomic additions on the data type used for the radiosity buffer.
While this scheme is very effective within a single GPU, it obviously does not scale to cluster data distribution, where the equivalent of an atomic addition operation would imply incurring a latency penalty which is several orders of magnitude higher. However, for use cases of this type, such synchronization is not necessary at the cluster level—as long as the program has a means by which to inform the runtime system of coarse-grained data parallelism constraints.
3.2.2 Split Constraints in Celerity
To enable this type of use case, which we refer to as split constraints, in the Celerity system, we have designed and implemented a relatively simple new API,Footnote 3 which provides enough flexibility to also cover a variety of similar optimizations.

Listing 2 shows the split constraints API and a small usage example. On line 9, a task is submitted to the distributed Celerity work queue, providing a which manages all interaction of this task with the runtime system. The additional call to
on line 12 constrains the data-parallel invocation on line 13 to be split across the cluster in chunks which at minimum respect the provided granularity
(\(2 \times 4\) elements in the example).
Note that these constraints compound with the already existing constraint that Celerity may only split kernels at work group boundaries. As such, in each kernel dimension, the effective split constraint is the least common multiple of the work group size and the optional user-defined split constraints.
In Celerity-RSim, this new API is employed to limit the system to performing row-wise cluster distribution of the input parallelism in \(K_{IJ}\), as illustrated in Fig. 5. Within a single GPU, the full 2D parallelism can still be used. This allows the implementation to leverage efficient local atomic operations for result aggregation, while still providing a high degree of flexibility to the runtime system in terms of data distribution, and only requiring one additional line of code at the application level.
4 Performance Evaluation
4.1 Experiment Setup
We evaluated the performance of Celerity-RSim across three types of scene geometries and three GPU clusters, performing both strong and weak scaling experiments. Table 1 lists the scene geometries we evaluated in our benchmarks. The traffic and rvl scenes represent two real-world use cases, available at three fixed levels of detail each, while the synth scene is algorithmically generated and therefore allows for arbitrary triangle counts, which enables weak scaling experiments. For all benchmarks, we computed 1000 light propagation simulation time steps.
The hardware and software stack for this evaluation is detailed in Table 2. We benchmarked the performance of Celerity-RSIM on three accelerator clusters: IFI, a local cluster based on consumer hardware used for development and medium-scale testing at the University of Innsbruck, as well as M100 (Marconi100) and LEO (Leonardo), former and current tier-0 European supercomputer systems hosted at CINECA.
While the underlying OS and CUDA versions differed between the three systems as detailed in the table, the higher-level software stack was uniform. For all systems LLVM 12.0.1 (built using the Spack package manager when not natively available on a given platform) was used to compile hipSYCL (now known as AdaptiveCPPFootnote 4) commit 1046a78777. This served as the basis for compiling Celerity and Celerity-RSim. For Celerity, an in-development version between releases v0.5.0 and v0.6.0 was used, specifically corresponding to the git revision d91cfea. We used the pre-installed version of OpenMPI on each target platform.
All data points were measured 10 times, and the median result is reported.
4.2 Baseline Celerity Efficiency
Before discussing cluster scalability, which is of course the main goal of the Celerity version of RSim, we first demonstrate that the baseline single-GPU and single-node performance of the Celerity version—which cluster-wide efficiency metrics will be derived from—is competitive.
Figure 6 shows a comparison of the execution time of the core radiosity simulation across four different versions of RSim: (i) Cuda-RSim, a baseline CUDA implementation of the algorithm supporting only a single GPU, (ii) Celerity-RSim, the Celerity version presented in this paper, which supports distributed memory cluster parallelism, (iii) Celerity-RSim-1D, which is our Celerity version without the input data decomposition scalability optimization discussed in Sect. 3.2.1, and (iv) RTX-RSim, a highly optimized hardware-specific implementation which supports multiple GPUs on a single shared memory node [27].
On a single GPU, we can observe that Celerity-RSim performs close to the CUDA baseline implementation. The output-parallel “1D” version actually outperforms the input-parallel variant by a notable margin. This is expected on a single or small number of GPUs, and is a tradeoff for the much higher level of parallelism available in the input-parallel scheme. RTX-RSim is a highly-optimized and hardware-specific low-level implementation, and is faster than all other versions on a single GPU.
Moving to four GPUs (the maximum available on a single node of the target cluster, and thus the limit for RTX-RSim), we note that Celerity-RSim with the new decomposition scheme we outlined in Sect. 3.2.1 scales more effectively than previous schemes even at a relatively small number of total GPUs. It reaches a parallel efficiency of above 95%, compared to 71% and 76% for Celerity-RSim-1D and RTX-RSim respectively.
Overall, the distributed-memory-ready, high-level Celerity-RSim compares favorably to the CUDA baseline on a single GPU. While it can not fully match the performance of the more complex, highly-optimized and vendor-specific RTX-RSim code, it nonetheless reaches within 10% of its performance in the 4-GPU case thanks to a greater focus on scalability.
4.3 Strong Scaling
For evaluating the cluster scalability of Celerity-RSim, first, we will present results for the more challenging strong scaling scenarios. Note that some of the real-world geometries do not fit into the memory of a single GPU on a given cluster. For discussing scalability and efficiency in these cases, we use relative metrics compared to the smallest number of GPUs which can compute the simulation.
Figures 7 and 8 illustrate the speedup for various real-world scenes on the IFI and M100 clusters respectively. On IFI, only the smallest traffic32251 geometry fits into a single GPU. For this small test case, near-linear parallel scaling is maintained up to 4 GPUs. On 8 GPUs, parallel efficiency is 88%, dropping to 75% on 16 and slightly above 50% for 32 GPUs. Note that with 32 GPUs, for this small simulation, the total absolute time is down to less than 4 s, leaving less than 4 milliseconds per time step, each of which algorithmically requires global data exchange.
This small absolute amount of compute time between required global radiosity data updates causes the drop in efficiency experienced by the small scenes. This problem is exacerbated further on M100, as the individual GPUs in this cluster are much faster, which increases the impact of communication overhead when benchmarking equivalent problem sizes. Note that larger geometric scenes scale to larger GPU counts, with the rvl139249 case showing positive scaling up to 128 GPUs.
A particularly noteworthy aspect of these results is the fact that, for some specific scenes at relatively low GPU counts, super-linear speedup can be observed. The most severe case is traffic124124 on IFI, reaching a relative efficiency of 1.2 on 16 GPUs. This behavior is consistent across repeated runs, and is related to caching behaviour of the various data structures involved in the simulation. As the per-GPU workload drops, as long as the computation time for each time step is still sufficiently large to compensate for the required global communication, the simulation on each GPU can actually be slightly more efficient as it only needs to access a smaller sub-range of the overall data set.
4.4 Weak Scaling
While the strong scaling behavior of existing test scenes is important, one of the main motivations for developing a GPU cluster version of RSim is the ability to process more detailed simulations on larger systems. To quantify the ability of Celerity-RSim to achieve this goal, we performed a weak scaling experiment with a synthetic scene geometry that can be scaled arbitrarily.
Figures 9 and 10 provide two perspectives on the result of this weak scaling evaluation. Note that the number of triangles in the scene as indicated in these plots grows with the square root of the number of GPUs, as the computational workload of the simulation scales with \(N^2\), as detailed in Sect. 2.1.
With the workload of the simulated geometry scaling with the number of GPUs involved in the simulation, Celerity-RSim achieves near-linear scaling at \(>92\%\) efficiency up to 64 GPUs, dropping to \(75\%\) for 128 GPUs. Particularly in Fig. 10, the previously discussed super-linear speedup is once again visible, with the 4 and 16 GPU showing slightly better per-GPU throughput than the single-GPU case. The reasons for this behaviour are similar to the strong scaling case—while the workload scales with \(N^2\), only some data structures do, leading to improved caching at some threshold points in the evaluation.
4.5 Weak Scaling on the Leonardo System
Figures 11 and 12 depict the results of performing the same weak scaling experiment shown in Sect. 4.4 for Marconi100 on the Leonardo supercomputer, in order to validate the performance on a second large-scale system.
The overall scalability picture is very similar, with a slightly earlier drop below 90% efficiency, but remaining above 70% at 128 GPUs. We believe that this slight reduction in parallel efficiency can be explained by the local baseline per-GPU performance of LEO being significantly higher (39 MTris/s compared to 26 MTris/s), resulting in lower per-time-step computation times. While communication latency on the new system is also improved, it does not scale to the same extent as the per-GPU performance, and therefore the algorithmically required all-to-all data exchange becomes more of a limiting factor. The non-linear scaling specifically at 4 GPUs which was already observed and discussed in previous experiments on other systems is visible here again, confirming that it is a property of the algorithmic relationship between data access and compute requirements, and largely independent of the hardware platform.
Note that we purposefully used the same problem sizes as for the previous weak scaling experiment, to allow for a 1:1 comparison. The problem size could be substantially increased on this system due to increased VRAM, which would likely result in slight improvements to the weak scaling results.
4.6 Development Effort
Quantifying application development effort or productivity is difficult in general, but particularly so for a comparative analysis of existing versions of the RSim application, for two main reasons: (i) no two versions of the application have a fully equivalent feature set. E.g. the CUDA version only supports a single GPU, the Vulkan version supports multiple GPUs and hardware raytracing, but not distributed memory cluster computing, and the Celerity version is fully distributed across a cluster; (ii) existing tools for measuring software complexity metrics, such as cyclomatic complexity, fail to capture the true semantics and complexity of the application. For example, they cannot follow the relationship between host- and device code in the Vulkan version, and underestimate the complexity impact of inline functors provided as C++ lambda expressions in Celerity code.
Nevertheless, we want to provide some data that, while not conclusive evidence, supports our claim that Celerity greatly reduces development effort for porting real-world applications to multi-GPU accelerator clusters. To enable as much of a one-to-one comparison as possible while focusing on a core feature enabled by Celerity, we will investigate the number of lines of code (LoC) required to provide multi-GPU support in the Vulkan version compared to the equivalent necessary code in Celerity. Furthermore, we will look specifically only at the main radiosity loop, as a comparison of the effort involved in \(K_{ij}\) computation might unfairly disadvantage the Vulkan version, given the complexity involved in utilizing hardware raytracing.
Note that we explicitly excluded any and all initialization, auxiliary, validation, and allocation code from these data points. This includes aspects such as dealing with descriptor sets, loading, and render pipeline issues in Vulkan, which total more than 1000 LoC and would greatly skew the comparison. We do not consider the lack of this complexity a particular achievement of the Celerity system, and therefore chose to focus only on the core data coherence and parallel workload distribution aspects.
For the core radiosity simulation, there are two notable differences: first of all, Celerity is a single-source infrastructure, while separate host and device code are necessary for Vulkan. Secondly, more than twice as many lines of code are necessary to implement this component in Vulkan compared to the Celerity version. These differences are down primarily to more expressive arithmetic types in the Celerity version, and less required boilerplate to launch GPU kernels and transfer parameters between the host and GPU spaces. We have mostly included this section of data to illustrate that the simplicity of Celerity when it comes to data distribution, coherence and exchange is not enabled by additional complexity in the specification of the computation itself.
However, the most interesting results are revealed by the next category of functionality. “Chunking for multi-GPU parallelism” counts code required to be able to distribute the calculation to an arbitrary number of GPUs. For Celerity, this is entirely implicit, and no code is required at all. “Data transfers for coherence”, for the Vulkan version, counts functionality strictly necessary to explicitly up/download chunks of the radiosity buffer to/from participating GPU memories in order to maintain coherence. For Celerity, all of this is managed by the runtime system, and the only user code necessary is a definition of range mappers in a declarative manner. Note that 21 lines of code is actually a rather large amount of code for this purpose compared to most Celerity applications, due to the unusual data access patterns required by RSim.
Overall, the Vulkan implementation, even with the previously outlined favorable selection criteria, requires 3.8 times as many lines of code to implement multi-GPU radiosity simulation compared to Celerity RSim. Crucially, this substantial effort still only achieves multi-GPU functionality on a single host system, while the Celerity version transparently scales to distributed memory clusters.
5 Related Work
As Celerity is a relatively young runtime system, there are comparatively few already-published works that have employed it for real-world applications. One of them is the port of Cronos [12], a gamma-ray emission simulation from astrophysics. However, the application in question implements a completely different kind of scientific problem (a finite volume scheme in a regularly-structured 3D grid) and therefore exhibits very different programming model requirements and performance characteristics. For this reason, we extend our related work analysis also to comparable programming models.
Many works have studied the room response algorithm, such as [26] where the authors focus on the performance of the algorithm on a single GPU using Vulkan and hardware raytracing. However, the work lacks distributed memory execution or discussion of high-level programming model requirements induced by this type of application. While it can achieve higher performance than our Celerity implementation on a single GPU by about 15%, it requires more knowledge and tuning compared to the high-level of abstraction enabled by the use of Celerity. Room-response simulations have also been applied to other scientific fields, such as image reconstruction from sound propagation [19]. In this work, the authors focus on the mathematical groundwork used to model sound propagation, while Fu and Li [9] elaborate on low-level architecture details that are significant when porting such use cases to GPUs. Similar works include sound propagation simulation through ray tracing using CPU multi-threading and SIMD parallelism [23] or fundamental research in solving large-scale radiosity problems on GPU-accelerated clusters [7].None of these works focus on high-level distributed-memory GPU parallelism, or API requirements while providing a qualitative analysis of API extensions that meet those requirements.
In terms of the programming model landscape, SYCL has gained popularity among HPC developers, motivating scientists to employ it for a variety of applications. For example, in [6] SYCL is compared to other GPU-based programming models, showing it is a viable option to achieve decent performance without requiring in-depth knowledge of the hardware. Similar approaches have been taken in [1] where SYCL is used to accelerate unstructured meshes, or in [4] where SYCL is used, supplemented with MPI for multi-GPU communication, to accelerate the processing of vast data sets for machine learning algorithms. Nevertheless, all of these works either are single-node only or depend on a communication API such as MPI for distributed-memory execution, with all the consequences as outlined in the introduction.
Looking beyond direct or indirect use of SYCL, similar high-level programming model endeavors include dOCAL [21] (based on OpenCL and CUDA), EPSILOD [5] (domain-specific, focuses on stencil codes), SkePU [8] (skeleton-based model using OpenCL and CUDA), Legion [2] (comparatively low-level with increased user effort), and many others. A comprehensive comparative analysis of these programming models is beyond the scope of this application-focused work; however, we will note that skeleton frameworks, when they support the use of distributed memory, generally take the approach of having the user express a specific parallel pattern explicitly. They then map these patterns to a given execution platform. Conversely, in Celerity, users only specify data dependencies using range mappers, and the runtime system implicitly derives a graph representation and a parallel execution plan from this dependency information, which can be evaluated at the required granularity for any desired degree of parallelism.
Kokkos [28] is arguably one of the closest related works, offering single-source programming for heterogeneous architectures while striving for performance portability through compute and memory hierarchy abstractions. However, distributed memory support is provided through its Remote Spaces extension, which is stated to be in an experimental development stage at this time.
6 Conclusion
We have presented Celerity-RSim, a room response simulation application ported to clusters of accelerators via the high-level Celerity runtime system and API. Without any explicit work or data distribution code, the Celerity version of this application achieves near-linear weak scaling on up to 64 GPUs, and a parallel efficiency of 75% on 128 GPUs while simulating a scene with 616K triangles. This demonstrates the real-world applicability of the Celerity system in a case of relatively unusual data generation and dependency patterns. Note that these performance results are achieved while requiring substantially less coding effort for multi-GPU parallelization of the core algorithm (\(\sim 1/4\) LoC) than an implementation based on a lower-level GPU compute API.
Of particular interest is a new parallelization strategy for RSim designed to greatly increase the available parallelism at only minimal cost to sequential efficiency. By adding a simple split constraint API, we can instruct the runtime system to constrain the distribution of a 2D input-parallel execution range in such a way as to not require any additional cluster-wide output synchronization, while benefitting from much higher parallelism at the individual GPU level.
Due to the extremely high throughput of modern supercomputer-level GPUs reducing the per-time-step execution time to the range of milliseconds, and the need for global communication between time steps, there is a practical limit to the number of GPUs which can be leveraged efficiently in a strong scaling scenario. Future work towards mitigating this issue would need to investigate fundamental algorithmic changes to reduce direct data dependencies on results computed in the prior time step.
Data availibility
No datasets were generated or analysed during the current study.
Notes
Readers may refer to [18, 22, 24, 25], as well as the Celerity documentation at https://celerity.github.io/docs/getting-started.
Added to the Celerity system in PR212, https://github.com/celerity/celerity-runtime/pull/212.
References
Afzal, A., Schmitt, C., Alhaddad, S., Grynko, Y., Teich, J., Forstner, J., Hannig, F.: Solving Maxwell’s equations with modern c++ and sycl: A case study. In: 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 1–8 (2018). https://doi.org/10.1109/ASAP.2018.8445127
Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). https://doi.org/10.1109/SC.2012.71
Bender, M., Demaine, E., Farach-Colton, M.: Cache-oblivious b-trees. In: Proceedings 41st Annual Symposium on Foundations of Computer Science, pp. 399–409 (2000). https://doi.org/10.1109/SFCS.2000.892128
Breyer, M., Daiß, G., Pflüger, D.: Performance-portable distributed k-nearest neighbors using locality-sensitive hashing and sycl. In: Proceedings of the 9th International Workshop on OpenCL. IWOCL’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3456669.3456692,
de Castro, M., Santamaria-Valenzuela, I., Torres, Y., Gonzalez-Escribano, A., Llanos, D.R.: Epsilod: efficient parallel skeleton for generic iterative stencil computations in distributed GPUS. J. Supercomput. 79(9), 9409–9442 (2023). https://doi.org/10.1007/s11227-022-05040-y
Deakin, T., McIntosh-Smith, S.: Evaluating the performance of hpc-style sycl applications. In: Proceedings of the International Workshop on OpenCL. IWOCL ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3388333.3388643
D’Azevedo, E., Hu, Z., Su, S.Q., Wong, K.: Solving a large scale radiosity problem on gpu-based parallel computers. J. Comput. Appl. Math. 270, 109–120 (2014). https://doi.org/10.1016/j.cam.2014.02.011
Ernstsson, A., Li, L., Kessler, C.: Skepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systems. Int. J. Parallel Prog. 46(1), 62–80 (2018). https://doi.org/10.1007/s10766-017-0490-5
Fu, Z., Li, J.: Gpu-based image method for room impulse response calculation. Multimedia Tools Appl. 75, 5205–5221 (2016)
Glassner, A.S.: Space subdivision for fast ray tracing. IEEE Comput. Graph. Appl. 4(10), 15–24 (1984)
Glassner, A.S.: An Introduction to Ray Tracing. Elsevier, Amsterdam (1989)
Gschwandtner, P., Kissmann, R., Huber, D., Salzmann, P., Knorr, F., Thoman, P., Fahringer, T.: Porting real-world applications to GPU clusters: a celerity and cronos case study. In: 2021 IEEE 17th International Conference on eScience (eScience), pp. 90–98 (2021). https://doi.org/10.1109/eScience51609.2021.00019
Hansard, M., Lee, S., Choi, O., Horaud, R.P.: Time-of-Flight Cameras: Principles, Methods and Applications. Springer, Berlin (2012)
Hapala, M., Havran, V.: Kd-tree traversal algorithms for ray tracing. Comput. Graph. Forum 30(1), 199–213 (2011)
Hossain, M.M., Tucker, T.M., Kurfess, T.R., Vuduc, R.W.: A GPU-parallel construction of volumetric tree. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, pp. 1–4 (2015)
Hranitzky, R.: A Scalable multi-DSP System for Room Impulse Response Simulation. Ph.D. thesis, Technical University of Graz (1997)
Karras, T., Aila, T.: Fast parallel construction of high-quality bounding volume hierarchies. In: Proceedings of the 5th High-Performance Graphics Conference, pp. 89–99. ACM (2013)
Knorr, F., Thoman, P., Fahringer, T.: Declarative data flow in a graph-based distributed memory runtime system. Int. J. Parallel Program. 1–22 (2022)
Lehmann, E.A., Johansson, A.M.: Prediction of energy decay in room impulse responses simulated with an image-source model. J. Acoust. Soc. Am. 124(1), 269–277 (2008). https://doi.org/10.1121/1.2936367
Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 3.1 (2015). https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
Rasch, A., Bigge, J., Wrodarczyk, M., Schulze, R., Gorlatch, S.: dOCAL: high-level distributed programming with OpenCL and CUDA. J. Supercomput. 76(7), 5117–5138 (2020). https://doi.org/10.1007/s11227-019-02829-2
Salzmann, P., Knorr, F., Thoman, P., Gschwandtner, P., Cosenza, B., Fahringer, T.: An asynchronous dataflow-driven execution model for distributed accelerator computing. In: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 82–93. IEEE (2023)
Schissler, C., Manocha, D.: Interactive sound propagation and rendering for large multi-source scenes. ACM Trans. Graph. (2017). https://doi.org/10.1145/3072959.2943779
The Khronos Group: SYCL Specification, Version 2020 Revision 8 (2023) https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html
Thoman, P., Salzmann, P., Cosenza, B., Fahringer, T.: Celerity: High-level C++ for accelerator clusters. In: Euro-Par 2019: Parallel Processing: 25th International Conference on Parallel and Distributed Computing, Göttingen, Germany, August 26–30, 2019, Proceedings 25, pp. 291–303. Springer (2019)
Thoman, P., Wippler, M., Hranitzky, R., Fahringer, T.: Rtx-rsim: accelerated Vulkan room response simulation for time-of-flight imaging. In: Proceedings of the International Workshop on OpenCL. IWOCL’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3388333.3388662
Thoman, P., Wippler, M., Hranitzky, R., Gschwandtner, P., Fahringer, T.: Multi-GPU room response simulation with hardware raytracing. Concurr. Comput. Pract. Exp. 34(4), e6663 (2022). https://doi.org/10.1002/cpe.6663, (https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.6663).
Trott, C.R., Lebrun-Grandié, D., Arndt, D., Ciesko, J., Dang, V., Ellingwood, N., Gayatri, R., Harvey, E., Hollman, D.S., Ibanez, D., Liber, N., Madsen, J., Miles, J., Poliakoff, D., Powell, A., Rajamanickam, S., Simberg, M., Sunderland, D., Turcksin, B., Wilke, J.: Kokkos 3: programming model extensions for the exascale era. IEEE Trans. Parallel Distrib. Syst. 33(4), 805–817 (2022). https://doi.org/10.1109/TPDS.2021.3097283
Vorländer, M.: Simulation of the transient and steady-state sound propagation in rooms using a new combined ray-tracing/image-source algorithm. J. Acoust. Soc. Am. 86(1), 172–178 (1989)
Acknowledgements
This research is supported by the Austrian Research Promotion Agency (FFG) via the UMUGUC Project (FFG #4814683), as well as the European High-Performance Computing Joint Undertaking (JU) project LIGATE under Grant Agreement #956137.
Funding
Open access funding provided by University of Innsbruck and Medical University of Innsbruck.
Author information
Authors and Affiliations
Contributions
P.T., and P.G. conceptualized the original research idea and wrote the main manuscript text. F.M.H. carried out implementation and experimentation work. T.F. provided feedback.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Thoman, P., Gschwandtner, P., Molina Heredina, F. et al. Celerity-RSim: Porting Light Propagation Simulation to Accelerator Clusters Using a High-Level API. Int J Parallel Prog 53, 17 (2025). https://doi.org/10.1007/s10766-025-00787-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10766-025-00787-2