# Instant-NeRF: <u>Instant</u> On-Device <u>Neural Radiance Field</u> Training via Algorithm-Accelerator Co-Designed Near-Memory Processing Yang (Katie) Zhao<sup>1</sup>, Shang Wu<sup>2</sup>, Jingqun Zhang<sup>1</sup>, Sixu Li<sup>1</sup>, Chaojian Li<sup>1</sup>, Yingyan (Celine) Lin<sup>1</sup> Georgia Institute of Technology, <sup>2</sup>Rice University {eiclab.gatech, jzhang3368, sli941, cli851, celine.lin}@gatech.edu, {sw99}@rice.edu Abstract—Instant on-device Neural Radiance Fields (NeRFs) are in growing demand for unleashing the promise of immersive AR/VR experiences, but are still limited by their prohibitive training time. Our profiling analysis reveals a memory-bound inefficiency in NeRF training. To tackle this inefficiency, near-memory processing (NMP) promises to be an effective solution, but also faces challenges due to the unique workloads of NeRFs, including the random hash table lookup, random point processing sequence, and heterogeneous bottleneck steps. Therefore, we propose the first NMP framework, Instant-NeRF, dedicated to enabling instant on-device NeRF training. Experiments on eight datasets consistently validate the effectiveness of Instant-NeRF. Index Terms—Neural Radiance Field, Algorithm-Accelerator Co-Design, Near-Memory Processing, On-Device Training # I. INTRODUCTION 3D scene reconstruction is crucial for numerous Augmented and Virtual Reality (AR/VR) applications [11]. Neural radiance fields (NeRFs) [13], [14] have yielded state-of-the-art (SOTA) rendering quality. Therefore, many researchers have tried to speed up NeRF training toward instant NeRF-based 3D reconstruction in many emerging AR/VR applications. Despite the success achieved in accelerating NeRF training on cloud GPUs [14], NeRF-based 3D reconstruction on edge devices [12] is still not feasible. To close the aforementioned gap between the desired instant ondevice 3D scene reconstruction and the currently achievable NeRF training efficiency on edge devices, we first conduct extensive profiling measurements of the SOTA efficient NeRF training method, iNGP [14], on a SOTA edge GPU, XNX [17], to identify the bottlenecks. Specifically, iNGP represents a 3D scene with a multiresolution hash table of trainable embedding vectors, followed by two small multi-layer perceptions (MLPs) for capturing the density and RGB colors, respectively. Our profiling analysis reveals that computing the embedding vectors and executing the MLPs mentioned above are the efficiency bottlenecks. Furthermore, we identify that these bottlenecks are caused by the bounded bandwidth of dynamic random-access memory (DRAM). Specifically, the memory bandwidth utilization is $5.24 \times \sim 21.44 \times$ higher than the corresponding Floating-Point Unit/Arithmetic-Logic Unit (FPU/ALU) utilization. The causes of this memory-bound inefficiency are that (1) the random hash table lookup requires a high memory bandwidth to fetch embedding vectors and (2) both the hash table and intermediate data of the MLPs require a much larger memory capacity than that of the on-chip cache capacity. To overcome the aforementioned bottlenecks, emerging near-memory processing (NMP) architectures [1], [3], [9] are promising solutions. This is because they can provide higher memory bandwidth by integrating computation logic units closer to the memory. For example, recent works deploy computation logic units at the bank level in DRAM and achieve around 10× peak bandwidth improvement [3]. Additionally, their per-bank memory capacity can be as Fig. 1: (a) Training time and (b) its breakdown, when running the SOTA efficient NeRF training method [14] on a cloud GPU (2080Ti [16]) and an edge GPU (XNX [17]). Here HT and HT\_b denote the hash table accesses and the corresponding back-propagation, MLPc and MLPcb denote the MLP processing for the color features and the corresponding back-propagation, and MLPd and MLPdb denote the MLP processing for the density features and the corresponding back-propagation, respectively. (See more details in Sec. II-A.) large as hundreds of megabytes (MB); it thus can provide sufficient on-chip memory for NeRF training. Despite their promise in alleviating the bottlenecks of NeRF training, directly applying NMP architectures to train iNGP [14] would not be efficient due to the following three challenges. First. the required random hash table lookups in iNGP can result in reduced effective memory bandwidth for NMP architectures. This is because the memory requests adopt a row-wise granularity (e.g., 1KB (kilobytes) [3], [18]), whereas each hash table entry (i.e., one embedding vector) only uses 32 bits. Furthermore, the random hash table lookups can cause bank conflicts if two memory requests access the same bank with different addresses, leading to serialized computations and increased latency. Second, the random processing sequence of points in a 3D scene can lead to non-sequential accesses to the same hash table entries, and thus incur long-latency memory accesses. Third, there exist heterogeneous bottleneck steps (e.g., index calculation via hash mapping function, hash table lookup, and MLP) as well as varying data types (e.g., integer 32-bit (INT32), floating-point 32bit (FP32)) in iNGP, which require dedicated support. To address the identified bottlenecks hindering instant on-device NeRF training, we make the following contributions: - We conduct extensive profiling measurements of the SOTA efficient NeRF training method [14] on SOTA edge devices over eight datasets, and identify the corresponding memory-bound efficient bottlenecks (Sec. II). Our profiling results can inspire future innovative NeRF training techniques. - We propose Instant-NeRF, an algorithm-accelerator co-design framework, to tackle the challenges of leveraging the promising NMP architecture to alleviate the memory-bound bottlenecks in Fig. 2: An illustration of vanilla NeRFs' [13] training process. NeRF training process. To the best of our knowledge, Instant-NeRF is the first to leverage an NMP architecture for achieving instant on-device NeRF-based 3D reconstruction. - Our Instant-NeRF algorithm (Sec. III) integrates a locality-sensitive 3D hash mapping function to map neighboring vertices in a 3D scene to neighboring hash table entries to tackle the memory bandwidth bottleneck and adopts a ray-first point streaming order to enhance the local register hit rates and reduce required memory access requests. - Our Instant-NeRF accelerator (Sec. IV) integrates a dedicated mapping scheme optimized for Instant-NeRF's algorithm and a mixed-precision computation logic to cope with different involved data types. Furthermore, we propose a heterogeneous inter-bank parallelism design, orchestrating the different computation and memory patterns in the heterogeneous bottleneck steps with the inter-bank parallelism opportunities while minimizing the interbank data movement overhead. - Comprehensive experiments (Sec. V) show that Instant-NeRF provides up to 266.1× speedup over SOTA edge GPU baselines while maintaining a similar rendering quality. ### II. BACKGROUND AND MOTIVATION # A. iNGP with SOTA NeRF Training Efficiency Vanilla NeRFs' Training Pipeline and Cost. Given images from sparsely sampled views of a scene, NeRFs learn to reconstruct the scene to generate images from any arbitrary view [13]. Fig. 2 shows vanilla NeRFs' training process, involving six steps. Specifically, **Step** (a) randomly selects pixels from the input images as a batch, where selected pixels' coordinates and viewing directions serve as NeRFs' inputs with their RGB colors being the corresponding ground truth labels during training; In Step (b), for each selected pixel, multiple 3D points are sampled along the ray that is formulated as $\mathbf{r} = \mathbf{o} + t\mathbf{d}$ $(t \in \{t_i\}, i \in [1, N])$ . Here $\mathbf{o}$ is the coordinate of the camera's position, d is the unit vector that points to the pixel from $\mathbf{o}$ , N is the total number of the sampled points along each ray, and $\{t_i\}$ denotes the set of the distance between o and point $\mathbf{o} + t_i \mathbf{d}$ ; In **Step** (c), given the i-th point on the ray $\mathbf{r}$ , the corresponding spatial location $\mathbf{o} + t_i \mathbf{d}$ and direction $\mathbf{d}$ are applied to an MLP model ( $F_{\Theta}$ in Fig. 2), which then outputs the color $c_i$ and density $\sigma_i$ of this sampled point; Step (d) synthesizes each pixel's color via volume rendering [10]: $$\hat{\mathbf{C}}(\mathbf{r}) = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i (t_{i+1} - t_i))) \mathbf{c}_i$$ (1) where $T_i = \exp(-\sum_{j=1}^i \sigma_j(t_{j+1} - t_j))$ ; **Step** (e) calculates the loss $\mathcal{L} = \sum_{\mathbf{r} \in \mathcal{R}} \left\| \hat{\mathbf{C}}(\mathbf{r}) - \mathbf{C}(\mathbf{r}) \right\|_2^2$ , where $\mathcal{R}$ is the ray set of the Fig. 3: How iNGP [14] implements Step (c) of vanilla NeRFs. current training batch and $\mathbf{C}(\mathbf{r})$ is the corresponding ground truth color; **Step** (f) does back-propagation. As the MLP model requires >1 million FLOPs per input point, vanilla NeRFs [13] training typically require >1 day per-scene even on a SOTA cloud GPU [16]. iNGP's Training Pipeline. To reduce the training cost of vanilla NeRFs, iNGP [14] replaces the MLP model in the above dominant Step (c) with a multi-resolution hash table of trainable embedding vectors and two much smaller MLPs, enabling efficient training (e.g., 305.8s/scene in Fig. 1(a)). Here the hash table encodes multi-resolution (i.e., a total of L resolutions) grids into T vectors per level with each having a length of F. Fig. 3 shows how iNGP [14] implements Step (c) of vanilla NeRFs with five steps: Step (1) - Hashing of cube vertices: Given an input point location $\mathbf{x} = (x_0, x_1, x_2)$ , L surrounding 3D cubes (one cube per level) are first found; Step (2) - Lookup embedding vectors: Based on the surrounding cubes, iNGP fetches corresponding embedding vectors from the hash table, which has T entries per level; **Step** (3) - Trilinear interpolation: Computing the embeddings of points at each level via trilinearly interpolating the embeddings of the corresponding eight surrounding vertices; Step (4) - Concatenation: iNGP concatenates the resulting embeddings of all levels as the inputs of the subsequent MLP models; Step (5) - Execute MLPs: Generating the density and RGB color features via the above MLP models. In this work, we use "HT" to denote **Steps** $(1\sim3)$ and "MLP<sub>d</sub>"/"MLP<sub>c</sub>" for the forward process of the density/color MLP in **Step** (5); The corresponding back-propagation that updates the embedding vectors and MLP parameters are denoted as "HT b" and " $MLP_{c_b}$ "/" $MLP_{d_b}$ ". # B. Profiling SOTA Efficient NeRF Training Method on GPUs To understand the bottleneck of iNGP [14] training, we first profile its training process on GPUs. Profiling Setup. Profiling Platform: We use SOTA GPUs including two edge GPUs (XNX [17] and TX2 [15]) and one cloud GPU (2080Ti [16]). Tab. I summarizes the device specifications of these GPUs as well as one SOTA edge GPU adopted by Meta's latest VR glass Quest Pro [12]. Since the adopted edge GPUs for profiling have a comparable on-chip cache size and FP32/INT32/FP16 computation performance, the profiling results can reflect the bottlenecks of NeRF-based on-device 3D reconstruction on VR/AR devices. Two-Stage Profiling Method: Stage (1) characterizes the runtime of each step (or kernel) in the training process to locate the dominant steps and Stage (2) profiles the DRAM bandwidth utilization and computation resource utilization of the located dominant steps to identify the source of inefficiency. The GPU runtime and resource utilization are measured by NVIDIA's nvprof toolbox. Algorithm & Datasets: We evaluate iNGP [14] on eight datasets of Synthetic-NeRF [13]. Each dataset takes 35,000 iterations with a batch size of 256K sampled points/iteration. TABLE I: A summary of the considered SOTA GPUs' specs. | Spec. | XNX [17] | Edge GPUs<br>TX2 [15] | Quest Pro* [12] | Cloud GPU<br>2080Ti [16] | | |---------------|-------------------------------------|-----------------------------------|-----------------------------------|----------------------------------|--| | Tech. | 16nm | 16nm | 7nm | 12nm | | | Power | 20W | 15W | 5W | 250W | | | DRAM | 128-bit 16GB<br>LPDDR4×<br>59.7GB/s | 128-bit 8GB<br>LPDDR4<br>25.6GB/s | 64-bit 12GB<br>LPDDR5<br>44.0GB/s | 352-bit 11GB<br>GDDR6<br>616GB/s | | | GPU L2 Cache | 512KB | 512KB | 1MB | 5.5MB | | | FP32/INT32 | 885 GFLOPS | 750 GFLOPS | 955 GFLOPS | 13.45 TFLOPS | | | FP16 | 1.69 TFLOPS | 1.50 TFLOPS | 1.85 TFLOPS | 26.9 TFLOPS | | | Training Time | 7088s/scene | 44653s/scene | N/A | 306s/scene | | <sup>\*:</sup> Specs. of Qualcomm Adreno 650 GPU in Meta's Quest Pro VR glass [12]. **Profiling Result Analysis.** Although iNGP reduces the training time on cloud GPUs to <6 minutes per scene, it still requires >1 hour per scene on the edge GPUs, as shown in Fig. 1(a). From the training time breakdown in Fig. 1(b), we can locate **four** efficiency-bottleneck steps/kernels: **HT**, **HT**<sub>b</sub>, **MLP**<sub>d</sub>, and **MLP**<sub>c</sub>. Note that as the training on XNX is 2.9× faster than that on TX2, we only visualize the profiling results on XNX in Fig. 1. These steps/kernels (with their back-propagation processes) account for 76.4% of the total training time. After locating the dominant steps/kernels, we measure their DRAM read/write throughput and FPU/ALU performance (i.e., FP or INT operations per second). Here the DRAM bandwidth utilization is calculated as the portion of the achieved DRAM throughput over the maximum bandwidth provided by the GPU. Similarly, we can calculate the computation resource utilization for FPU/ALU. Our profiling results show the following three observations: First, the steps/kernels exhibit DRAM bandwidth-bound bottleneck. where the DRAM bandwidth utilization $5.24 \times \sim 21.44 \times$ higher than the FPU/ALU utilization (see Fig. 4). Specifically, HT/MLP<sub>d</sub>/MLP<sub>d</sub> b/MLP<sub>c</sub>/MLP<sub>c</sub> b achieves 61.3%/47.5%/73.7%/47.5%/73.7% DRAM bandwidth utilization (given the 59.7GB/s maximum DRAM bandwidth), while the FP32/FP16/INT32 utilization of the five aforementioned steps/kernels is all $\leq 1.5\%/\leq 1.6\%/\leq 6.4\%$ , respectively. Note that both the DRAM and FPU/ALU utilization are relatively low for HT b, as HT b involves frequent write-after-read operations to update the embedding vector gradients, where idleness exists between the read and write operations. Second, the causes of the exhibited memory-bound inefficiency above are (1) random lookups to the hash table, which stores multi-resolution grids' embedding vectors, requires a high memory bandwidth and (2) the on-chip GPU cache memory capacity is too small for handling the hash table storage requirements and processing the MLPs. Specifically, each individual level of the hash table is 2MB, which is $2 \times \sim 4 \times$ larger than the available edge GPU cache capacity, let alone the 64MB intermediate data for the MLP processing, as suggested in Tab. II. Third, the the index calculation via hash mapping function [14], an important part of the hash table lookups, consumes a large portion of the total INT32 ALU utilization. Specifically, we observe that the INT32 ALU utilization, which is caused by the index calculation, is $4.2 \times \sim 160.7 \times$ higher than that of the FP32/FP16 utilization, which is caused by the computations of other steps/kernels. This calls for dedicated architecture support for the index calculation in iNGP. # C. Identified Opportunities for NMP-based NeRF Training As analyzed in Sec. II-B, offloading the detected memory-bound bottleneck steps to NMP architectures is promising in reducing the total training latency. We consider a type of DRAM widely used by edge devices [15], [17]: Low Power Multiple Dual In Memory Module 4 (LPDDR4) [18], as an example to discuss the opportunities of using NMP to accelerate the training process. As shown in Fig. 5, Fig. 4: The DRAM read/write throughput and computation logic utilization of the efficiency-bottleneck steps/kernels (and their corresponding back-propagation processes) when running iNGP [14] training method on a SOTA edge GPU [17]. an LPDDR4 channel typically has one rank with one die per rank; One LPDDR4 die contains 16 physical banks, which share a common I/O interface. While the I/O interface width is 16-bit and the internal prefetch structure has a width of 128-bit/physical bank (i.e., 16n prefetch structure [18]), the row buffer within each bank provides a data width of 1KB. This organization offers an intrinsic parallelism opportunity [3] for addressing the **memory bandwidth** bottleneck in iNGP training. Second, for a typically adopted 8/16GB 128-bit LPDDR4 memory system in edge devices [15], [17], each bank has 128MB~256MB capacity, providing a sufficient **memory capacity** for NeRFs training. Finally, as each bank contains subarrays, different subarrays can be accessed mostly independently [7]. Therefore, our proposed Instant-NeRF adopts a near-bank NMP architecture with subarray parallelism for enabling on-device NeRF training, as illustrated in Sec. IV. # III. INSTANT-NERF ALGORITHM We introduce two algorithmic techniques in Instant-NeRF to address the memory-bound bottlenecks of iNGP arising from (1) the need for random hash table lookups and (2) the point processing sequence for the randomly selected pixels in a batch. # A. Developed Locality-sensitive 3D Hash Mapping Function The embedding interpolation in iNGP always fetches the embeddings of the eight surrounding vertices in the 3D cube (see Fig. 3). Leveraging this to enhance the locality of hash table lookups, we propose to adopt Monton code [4], which maps neighboring vertices in a 3D scene to neighboring hash table entries, as a locality-sensitive 3D location hash mapping function. This hash mapping function can be formulated as: $$h(\mathbf{x}) = (f(x_0) + (f(x_1) \ll 1) + (f(x_2) \ll 2)) \mod T$$ (2) where T is the number of entries per hash table level and f(x) is a separate-one-by-two function such that two zero bits are inserted between every pair of the adjacent bits (e.g., $f(\underline{1011}_2) = \underline{1000001001}_2$ ). In this way, data locality during hash table lookups for one point's 3D cube is greatly enhanced. As shown in Fig. 6, with Morton encoding, 82.0% of the index distances between two neighboring vertices of one 3D cube is less than 16 entries in the hash table and none is larger than 5000; in contrast, for the original design in [14], only 55.4% of neighboring vertices have index distances $\leq$ 16 and 22.7% are >5000. Additionally, since the memory requests adopt a row-wise granularity with a commonly-used row size of 1KB [18], our hash mapping function needs 1.58 average memory requests for one 3D cube, while the original design requires 4.02 on average. Fig. 5: High-level LPDDR4 [18] DRAM organization and Instant-NeRF microarchitecture's integration location at each bank. # Fig. 6: The breakdown of index distances between two neighboring vertices. Fig. 7: (a) The number of points sharing the same cube and (b) the normalized effective memory bandwidth improvement thanks to the proposed algorithmic techniques. # B. Proposed Ray-first Point Streaming Order Instant-NeRF's algorithm further incorpo Instant-NeRF's algorithm further incorporates a ray-first point streaming order, where points along one ray are streamed into the accelerator for processing before moving on to the next ray. This streaming order offers two benefits. **First**, this streaming order enhances the local register hit rates and reduces unnecessary memory requests, since neighboring points along a ray with the same surrounding cube will lookup the same embeddings (as shown in Fig. 7(a)). **Second**, based on the fact that neighboring points along a ray tend to have neighboring surrounding cubes, we can combine the locality-sensitive hash mapping function with the ray-first point streaming order to further enhance the locality of hash table lookups: Our evaluation shows that this combination leads to $3.27 \times \sim 35.9 \times$ effective memory bandwidth improvement (as shown in Fig. 7(b)). # IV. INSTANT-NERF ACCELERATOR Our Instant-NeRF accelerator can consider one or several DRAM dies, where each bank is equipped with its own Instant-NeRF microarchitecture, as shown in Fig. 5. In this section, we first introduce Instant-NeRF's microarchitecture per bank that integrates a mixed-precision computation logic to cope with different data types in iNGP. Then, we present our optimized hash table mapping scheme for Instant-NeRF's algorithm. After that, we describe our heterogeneous inter-bank parallelism design, which orchestrates the heterogeneous steps with the inter-bank parallelism opportunities to minimize the costly inter-bank data movements. # A. Instant-NeRF's Microarchitecture As illustrated in Fig. 8, Instant-NeRF's microarchitecture comprises a compute engine (in blue) and a controller (in brown). Compute Engine: This engine is to compute iNGP's bottleneck steps and consists of a processing element (PE) array, a scratchpad memory, a crossbar, and hash registers for storing pre-defined parameters of the hashing function. Specifically, the PE array consists of separate (1) INT32 PE group and (2) FP32 PE group for corresponding training arithmetics: INT32 PEs for index calculations via hash mapping function and FP32 PEs for other computations. The scratchpad memory feeds input data to PEs from the crossbar and stores the output data of the PEs. In addition, the INT32 PEs allow direct parameter access from the hash registers. Controller: The controller has two main functionalities: (1) controlling the processing of the compute engine and (2) generating read/write commands/addresses for the memory banks. It includes an instruction FIFO, an instruction decoder, an address buffer, a compute engine control signal generator, a bank command generator, and a bank address generator. Here the instruction decoder reads instructions from the FIFO and controls the other blocks to generate proper signals to implement the required functionalities. Fig. 8: Instant-NeRF's microarchitecture per bank. To read memory data into the compute engine, write data to the memory banks, or load instructions into the controller, Instant-NeRF's microarchitecture adopts a commonly used design where a data transfer MUX is connected to each bank's global row buffer via a row-buffer sized register (i.e., r0 in Fig. 8) [3]. # B. Proposed Hash Table Mapping Scheme To ensure satisfactory multiple points throughput, points (e.g., 32 in our evaluation) are processed parallel HT/HT<sub>b</sub>. in Even with Instant-NeRF's algorithmic techniques, bank conflicts due to the random hash table lookups can still cause processing stalls. To further mitigate bank conflicts, we develop an optimized hash table mapping scheme that leverages subarray parallelism. mapping scheme Our Fig. 9: The normalized number of bank conflicts. divided into *intra-level hash table mapping* and *inter-level hash table mapping*. Intra-level Hash Table Mapping: Leveraging the statistics that >50% of the bank conflicts for one hash table level are incurred by memory requests with sequential addresses, we rearrange the sequential addresses to multiple subarrays. This allows these memory addresses to be requested in parallel, avoiding bank conflicts. Inter-level Hash Table Mapping: Fig. 9 shows the normalized number of bank conflicts for the 16 hash table levels after adopting the proposed intra-level hash table mapping scheme. We can observe that the processing time of different levels is unbalanced due to the unbalanced bank conflicts. To alleviate the accelerator resource under-utilization caused by these unbalanced processing times, we further adopt inter-level hash table mapping, | o. Inter-bank | | Inter-bank Data Movements ("No" Is Better Than "Yes") | | | | | | |---------------|-----------------------|-------------------------------------------------------|-----|---------------------------------------------------------|-------------------------------------------------|--|--| | Steps | Parallelism | Cat. 1: Parameter/Data Duplication for Parallelism | | Cat. 3: Intermediate Data<br>Transfer for A Single Step | Cat. 4: Param. Gradient<br>Partial Sum Transfer | | | | HT | Parameter Parallelism | Yes (Data) | No | No | No | | | | MLP | Data Parallelism | Yes (Parameter) | Yes | No | No | | | | MLP_b | Data Parallelism | No | No | No | Yes | | | | HT_b | Parameter Parallelism | No | Yes | No | No | | | Fig. 10: An example that illustrates the proposed heterogeneous inter-bank parallelism design on 2 physical memory banks, i.e., *parameter parallelism* and *data parallelism* for HT/HT\_b and MLP/MLP\_b, respectively. Here, Cat. is short for Category, and "Yes" and "No" denote whether the corresponding step incurs inter-bank data movements or not. where Levels 0~4, Levels 5~8, and Levels 9~10 are clustered into three groups. We further distribute these three groups and the other levels to different memory banks for balancing the overall processing time. # C. Proposed Heterogeneous Inter-Bank Parallelism Design There are typically two approaches for designing inter-bank parallelism: (1) data parallelism where each memory bank duplicates the parameters and processes different input data in parallel and (2) parameter parallelism where each bank keeps a part of the parameters and performs a fraction of computations based on the same inputs duplicated across banks. Due to the limitations imposed by the I/O interface and the internal prefetch width (see Fig. 5), the memory latency for accessing data from other banks is much higher than from the local bank. Therefore, minimizing data movement sizes across different banks is critical for maximizing the overall efficiency. We classify the causes of inter-bank data movements as four categories: Category ① parameter/data duplication due to the adopted parallelism approaches, Category 2 input/output data transfer between sequential steps, Category 3 intermediate data transfer within a single step, and Category @ parameter gradient partial sum transfer for gradient accumulations. Tab. II illustrates the parameter and data sizes of the bottleneck steps in iNGP training. Based on the causes of inter-bank data movements and different data sizes of these steps, we propose a heterogeneous inter-bank parallelism design to minimize the overall inter-bank data movements: we adopt parameter parallelism for HT/HT b (i.e., distributing the multi-resolution hash table to TABLE II: Parameter/data sizes for iNGP's bottleneck steps. | Steps | Param.⊲ | Input | Output | Intermediate | |--------|---------|-------|--------|---------------------| | | | Data≎ | Data≎ | Data <sup>⋄,†</sup> | | HT | 25MB | 3MB | 16MB | 0 | | MLP* | 0.014MB | 16MB | 1.5MB | 32MB | | MLP_b* | 0.014MB | 1.5MB | 16MB | 32MB | | HT_b | 25MB | 16MB | 0 | 0 | - \*: MLP stands for applying MLP<sub>d</sub> and MLP<sub>c</sub> sequentially. - <sup>d</sup>: The multiresolution hash table size and the two MLPs' weight size for HT/HT\_b and MLP/MLP\_b, respectively. - ♦: For a batch size of 256k sampled points. - †: The max intermediate data when doing level-by-level hash table lookups or layer-by-layer MLP processing. multiple banks), and leverage *data parallelism* for MLP/MLP<sub>b</sub> (we denote the sequential MLP<sub>d</sub> $\rightarrow$ MLP<sub>c</sub> as MLP hereafter). Proposed Inter-bank Parallelism Analysis. Fig. 10 exemplifies the bottleneck steps run on an Instant-NeRF accelerator with the proposed inter-bank parallelism design. This figure demonstrates how our parallelism design minimizes the inter-bank data movements for the four categories mentioned above. Firstly, the sizes of parameter/data duplication (Category ①) are minimized by duplicating the much smaller parameters/input data, such as parameters in MLP (Tab. II) and input data in HT. Second, we only need one set of data transferred between sequential steps (Category ②), e.g., the output data of HT which is the input data of MLP. Therefore, the inter-bank movement sizes incurred in Category ② are also largely reduced. Third, there is no intermediate data associated with Category ③. Finally, the partial sum transfer for the parameter gradient accumulations in Category ④ is now constrained to handle only those for the small MLPs, leading to reduced inter-bank gradient movement sizes. # V. EVALUATION ## A. Evaluation Setup Datasets: Eight datasets of Synthetic-NeRF [13]. Algorithm Baselines: The original NeRF [13] and three SOTA NeRF training methods [2], [5], [14]. Hardware Baselines: Two SOTA edge GPU baselines, XNX [17] and TX2 [15], whose specifications are shown in Tab. I. Implementation: We implement the Instant-NeRF microarchitecture in RTL; synthesize it with Design Compiler; and design the layout using Cadence Innovus based on a commercial 28nm CMOS technology. Instant-NeRF layout only uses 3 metal layers since DRAM die usually has 3 metal layers. The timing and power information of Instant-NeRF microarchitecture are derived from the post-layout simulation, which is further used to simulate the whole Instant-NeRF accelerator with DRAM. Configuration: Tab. III summarizes the configuration. We implement the Instant-NeRF accelerator using one DRAM die. Evaluation Methodology: We build a cycle-accurate simulator extended from Ramulator [8] to derive the timing and power results. ### B. Algorithm Evaluation For verifying the performance of our Instant-NeRF algorithm, we compare the PSNR scores (the higher the better) of SOTA efficient TABLE III: Instant-NeRF's accelerator parameters. | DRAM Configuration [17], [18] | | | | | | |-------------------------------------------------------|-----------------------------------------------------|-------------|---------------|--|--| | | LPDDR4-2400 | | | | | | Timing | LCL-tRCD-tRPpb: 4-4-6 | | | | | | | tRAS=9, tCCD=8, tRRD=2, tRCD=4 | | | | | | | tFAW=9, tWR=6, tRA=2*, tWA=7* | | | | | | Organization | 16GB total capacit | | | | | | | 128-bit I/O interface, 16-bit I/O interface/channel | | | | | | | 8 channels, 1 rank/channel | | | | | | | 1 chip/rank, 16 physical banks/chip | | | | | | | 1-2-4-8-16-32-64 subarrays/bank* | | | | | | | 1KB local*/global row buffer | | | | | | Instant-NeRF Microarchitecture Configuration per Bank | | | | | | | Tech. | 28nm | Frequency | 200 MHz | | | | Scratchpad | 2IZD | Computation | 256×INT32 PEs | | | | Memory | * /KB | | 256×FP32 PEs | | | <sup>\*:</sup> Parameters for subarray parallelism. TABLE IV: Benchmark our proposed Instant-NeRF algorithm and SOTA efficient NeRF algorithms in terms of the PSNR [6] (a higher value represents better rendering quality). | Methods | Avg. C | hair Drums | Ficus | Hotdog | Lego | Materials | Mic | Ship | |-------------------------------------------------------|-------------|------------|-------|--------|-------|-----------|-------|-------| | NeRF [13] | 31.01 33 | 3.00 25.01 | 30.13 | 36.18 | 32.54 | 29.62 | 32.91 | 28.65 | | FastNeRF [5] | 29.90 3 | 2.32 23.74 | 27.79 | 34.72 | 32.27 | 28.88 | 31.76 | 27.68 | | TensoRF [2] | 32.00 34 | 1.68 25.37 | 32.30 | 36.30 | 35.42 | 29.30 | 33.21 | 29.46 | | NeRF [13]<br>FastNeRF [5]<br>TensoRF [2]<br>iNGP [14] | 32.99 34 | 1.75 25.81 | 33.28 | 37.31 | 36.27 | 29.51 | 36.14 | 30.89 | | Ours | 32.76 34 | 1.47 25.69 | 33.12 | 37.06 | 35.94 | 29.33 | 35.86 | 30.61 | NeRF training algorithms and ours in Tab. IV. On average, our Instant-NeRF algorithm achieves $0.76 \sim 2.86$ higher PSNR than the baselines other than iNGP. Compared with iNGP, our proposed algorithm only degrades the average PSNR by 0.23. Nonetheless, our algorithm boosts the training efficiency by $1.15 \times$ on commercial 2080Ti GPU [16]. ### C. Hardware Evaluation Area and Power: The area of one Instant-NeRF microarchitecture is $3.6mm^2$ , which is only 1.5% of one DRAM bank area [18]. The power of one Instant-NeRF microarchitecture is 596.3mW. Speedup: Fig. 11(a) presents the training time improvement achieved by the proposed Instant-NeRF accelerator in comparison with the two SOTA edge GPU baselines, i.e., TX2 [15] and XNX [17], on the eight datasets [13]. Compared with the baselines, our proposed Instant-NeRF accelerator offers $109.5 \times \sim 266.1 \times$ and $22.0 \times \sim 49.3 \times$ speedup over TX2 [15] and XNX [17], respectively. Energy Efficiency: Fig. 11(b) presents the energy efficiency improvements. The proposed Instant-NeRF accelerator provides $172.9 \times \sim 420.3 \times$ and $46.4 \times \sim 103.7 \times$ energy efficiency improvement over TX2 [15] and XNX [17], respectively. # VI. RELATED WORKS Near-Memory Processing. Prior studies have utilized NMP architectures to accelerate general hash table lookups [20] and MLP workloads [19]. Our Instant-NeRF differs from prior works in that we propose an algorithm-hardware co-designed NMP framework tailored for iNGP's unique multi-resolution hash table lookups and enable dedicated inter-bank parallelisms to support iNGP's heterogeneous steps, including both hash table lookups and MLPs. ### VII. CONCLUSION We propose Instant-NeRF, the first NMP framework for enabling instant on-device NeRF training through dedicated algorithm-accelerator co-design. Extensive experiments on eight datasets verify Fig. 11: The normalized (a) speedup and (b) energy efficiency (over TX2 GPU [15]) achieved by Instant-NeRF accelerator. that Instant-NeRF provides $22.0 \times \sim 266.1 \times$ speedup over SOTA edge GPUs while maintaining the rendering quality. ### ACKNOWLEDGEMENT This work was supported by the National Science Foundation (NSF) SCH program (Award number: 1838873) and CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. ### REFERENCES - B. Asgari et al., "Fafnir: Accelerating sparse gathering by using efficient near-memory intelligent reduction," in 27th HPCA, 2021. - [2] A. Chen et al., "TensoRF: Tensorial Radiance Fields," arXiv preprint arXiv:2203.09517, 2022. - [3] A. Devic et al., "To pim or not for emerging general purpose processing in ddr memory systems," in 49th ISCA, 2022. - [4] C. Ericson, Ed., Real-Time Collision Detection, 1st ed. Crc Press, 2004, ch. 7, pp. 316–318. - [5] S. J. Garbin et al., "Fastnerf: High-fidelity neural rendering at 200fps," in ICCV 2021, 2021, pp. 14346–14355. - [6] A. Hore et al., "Image quality metrics: PSNR vs. SSIM," in 20th ICPR. IEEE, 2010, pp. 2366–2369. - [7] Y. Kim et al., "A case for exploiting subarray-level parallelism (SALP) in DRAM," in 39th ISCA, 2012. - [8] Y. Kim et al., "Ramulator: A fast and extensible dram simulator," IEEE Computer architecture letters, vol. 15, no. 1, pp. 45–49, 2015. - [9] Y. Kwon et al., "Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning," in Proceedings of the 52nd MICRO, 2019, pp. 740–753. - [10] N. Max, "Optical models for direct volume rendering," *IEEE Transactions on Visualization and Computer Graphics*, vol. 1, no. 2, pp. 99–108, 1995. - [11] Meta., "Introducing Horizon Workrooms: Remote Collaboration Reimagined," https://about.fb.com/news/2021/08/introducing-horizon-workrooms-remote-collaboration-reimagined/, 2021-08-01. - [12] Meta, "Meta Quest Pro," 2022, www.meta.com/quest/quest-pro/, 2022-11-01. - [13] B. Mildenhall et al., "Nerf: Representing scenes as neural radiance fields for view synthesis," in in ECCV 2020. Springer, 2020, pp. 405–421. - [14] T. Müller et al., "Instant neural graphics primitives with a multiresolution hash encoding," in SIGGRAPH 2022, vol. 41, no. 4, Jul. 2022. [15] NVIDIA, "NVIDIA Jetson TX2," 2020, www.nvidia.com/en-us/ - [15] NVIDIA, "NVIDIA Jetson TX2," 2020, www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/. - [16] NVIDIA, "GeForce RTX 2080 TI Graphics Card," 2022, www.nvidia. com/en-me/geforce/graphics-cards/rtx-2080-ti/. - [17] NVIDIA, "Jetson Xavier NX Series 16GB," 2022, www.nvidia.com/ en-us/autonomous-machines/embedded-systems/jetson-xavier-nx/. - [18] T.-Y. Oh et al., "A 3.2 Gbps/pin 8 Gbit 1.0 V LPDDR4 SDRAM with integrated ECC engine for sub-1 V DRAM core operation," *IEEE JSSC*, vol. 50, no. 1, pp. 178–190, 2014. - [19] H. Shin et al., "Mcdram: Low latency and energy-efficient matrix computations in dram," IEEE TCAD, pp. 2613–2622, 2018. - [20] S. F. Yitbarek et al., "Exploring specialized near-memory processing for data intensive operations," in 19th DATE. IEEE, 2016, pp. 1449–1452.