# Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package

Robert Guirado\* rguirado@ac.upc.edu Universitat Politècnica de Catalunya Barcelona, Spain Hyoukjun Kwon\* hyoukjun@gatech.edu Georgia Institute of Technology Atlanta, GA, USA Sergi Abadal abadal@ac.upc.edu Universitat Politècnica de Catalunya Barcelona, Spain

## Eduard Alarcón

eduard.alarcon@upc.edu Universitat Politècnica de Catalunya Barcelona, Spain

## **Abstract**

Deep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration over interposer has emerged as a promising solution, but as we show in this work, the limited interposer bandwidth and multiple hops in the Network-on-Package (NoP) can diminish the benefits of the approach. To cope with this challenge, we propose WIENNA, a wireless NoP-based 2.5D DNN accelerator. In WIENNA, the wireless NoP connects an array of DNN accelerator chiplets to the global buffer chiplet, providing highbandwidth multicasting capabilities. Here, we also identify the dataflow style that most efficiently exploits the wireless NoP's high-bandwidth multicasting capability on each layer. With modest area and power overheads, WIENNA achieves 2.2X-5.1X higher throughput and 38.2% lower energy than an interposer-based NoP design.

#### 1 Introduction

Deep Neural Networks (DNN) are currently able to solve a myriad of tasks with superhuman accuracy [13, 20]. To achieve these outstanding results, DNNs have become larger and deeper, reaching upto billions of parameters. Due to the enormous amount of calculations, the hardware running DNN inference has to be extremely energy efficient, a goal that CPUs and even GPUs do not live up to easily. This has led to a fast development of specialized hardware.

Research in DNN accelerators [6, 17] is a bustling topic. DNNs exhibit plenty of data reuse and parallelism opportunities that can be exploited via custom memory hierarchies and a sea of processing elements (PEs), respectively. However, as DNN models continue to scale, the compute capabilities of DNN accelerators need to scale as well.

Tushar Krishna tushar@ece.gatech.edu Georgia Institute of Technology Atlanta, GA, USA

At a fundamental level, DNN accelerators can either be scaled up (i.e., adding more PEs) or scaled out (i.e., connecting multiple accelerator chips together). There is a natural limit to scale-up due to cost and yield issues. Scale-out is typically done via board-level integration which comes with overheads of high communication latency and low bandwidth. However, the recent appeal of 2.5D integration of multiple *chiplets* interconnected via an interposer on the same package [15] (e.g., AMD's Ryzen CPU), offers opportunities for enabling efficient scale-out. This has proven effective in DNN accelerator domain via recent works [22, 30].

Unfortunately, as we identify in this work, DNN accelerator chiplets have an insatiable appetite for data to keep the PEs utilized, which is a challenge for interposers due to their limited bandwidth. Such a limitation originate from large microbumps ( $\sim 40 \mu m$  [15]) at the I/O ports of each chiplet, which naturally reduces the bandwidth by orders of magnitude compared to the nm pitch wires within the chiplet. The limited I/O bandwidth also causes chiplets to be typically connected to its neighbours only, hence making the average number of intermediate chiplets (hops) between source and destination to increase with the chiplet count. This introduces significant delay to data communication.

To address these challenges, for the first time, the opportunities provided by integrated wireless technology [2, 5, 10, 27] for the data distribution in 2.5D DNN accelerators are explored. We show that wireless links can (i) provide higher bandwidth than electrical interposers and (ii) naturally support broadcast. We then identify dataflow styles that use these features to exploit the reuse opportunities across accelerator chiplets.

This main contribution is WIENNA, WIreless-Enabled communications in Neural Network Accelerators, a 2.5D on-package scale-out accelerator architecture that leverages single-cycle wireless broadcasts to distribute weights and inputs, depending on the partitioning (i.e., dataflow) strategy. We implement three partitioning strategies (batch, filter and activation) for the scale-out architecture, leveraging the parallelism

<sup>\*</sup>Both authors contributed equally to this research.

in DNNs across these three dimensions. We evaluate an instance of WIENNA with 256 NVDLA [1]-like accelerator chiplets, and observe 2.5-4.4× throughput improvement and average 38.2% energy reduction over a baseline 2.5D accelerator that uses an electrical-only interposer.

# 2 Background and Related Work

**DNN Accelerators.** A DNN accelerator is specialized hardware for running DNN algorithms, which provides higher throughput and energy-efficiency than GPUs and CPUs via parallelism and dedicated circuitry. The abstract architecture of most DNN accelerators [6, 17, 18] consists of an off-chip memory, a global shared memory, and an array of PEs connected via a Network-on-Chip (NoC).

When mapping the targeted DNN into the PEs, one can apply different strategies or *dataflows*. Dataflows consist of three key components: loop ordering, tiling, and parallelization, which define how the dimensions are distributed across the PEs to run in parallel. The dataflow space is huge and its exploration is an active research topic [11, 12, 16, 28], as it directly determines the amount of data reuse and movement.

DNN accelerators involve three phases of communication [17] handled by the NoC. (1) *Distribution* is a few-to-many flow that involves sending inputs and weights to the PEs via unicasts or multicasts or broadcasts depending on the partitioning strategy of the dataflow. (2) *Local* forwarding involves data reuse within the PE array via inter-PE input/weight/partial-sum forwarding. (3) *Collection* involves writing the final outputs back to the global memory.

**2.5D Integration.** 2.5D systems refer to the integration of multiple discrete *chiplets* within a package, either over a silicon interposer [15] or other technologies. 2.5D integration is promising due to higher yield (smaller chiplets have better yields than large monolithic dies), reusability (chipletization can enable plug-and-play designs, enabling IP reusability), and cost (it is more cost-effective to integrate multiple chiplets than design a complex monolithic die). It has been shown to be effective for DNN accelerators [22, 30].

A silicon interposer is effectively a large chip upon which other smaller chiplets can be stacked. Thus, wires on the interposer can operate at the same latency as on-chip wires. However, the micro-bumps used to connect the chiplet to the interposer are  ${\sim}40\,\mu m$  in state-of-the-art TSMC processes [15]. This is much finer than board-level C4 bumps ( ${\sim}180\,\mu m$ ), but much wider than that of NoCs. This limits the number of microbumps that fit over the chiplet area and, thus, limits the effective inter-chiplet bandwidth.

In a scale-out DNN accelerator, limited interposer bandwidth slows down reads and writes from the global SRAM. While collection (i.e., write) can be hidden behind compute delay, distribution (i.e., read) is in the critical path [22]. Provisioning wires to enable fast distribution can be prohibitive and, even then, multiple hops are unavoidable. Enhancing this bandwidth is the aim of this work.



**Figure 1.** Transceiver area and power as functions of the datarate for [24, 26, 27] and references therein. Power is normalized to transmission range and to  $10^{-9}$  error rate. Energy per bit can be obtained dividing the power by the datarate.

Wireless Network-on-Package. Fully integrated on-chip antennas [5] and transceivers (TRXs) [26, 27] have appeared recently enabling short-range wireless communication upto 100 Gb/s and with low resource overheads. Fig. 1 shows how the area and power scale with the datarate, based on the analysis of 70+ short-range TRXs with different modulations and technologies [24, 26, 27].

Wireless NoPs are among the applications for this technology. In a wireless NoP, processor or memory chiplets are provided with antennas and TRXs that are used to communicate within the chiplet, or to other chiplets, using the system package as the propagation medium. This in-package channel is static and controlled and, thus, it can be optimized. In [25], it is shown that system-wide attenuation below 30 dB is achievable. These figures are compatible with the 65-nm CMOS TRX specs from [27]: 48 Gb/s, 1.95 pJ/bit at 25mm distance with error rates below  $10^{-12}$ , and  $0.8 \text{ mm}^2$  of area.

By not needing to lay down wires between TRXs, wireless NoP offers scalable broadcast support and low latency across the system. It is scalable because additional chiplets only need to incorporate a TRX, and not extra off-chip wiring, to participate in the communication. The bandwidth is high because it is not limited by I/O pin constraints, and latency is low as transfers bypass intermediate hops. Wireless NoP also allows to dynamically change the topology via reconfigurable medium access control or network protocols [21], as receivers can decide at run-time whether to process incoming transfers.

## 3 Motivation

To determine the design requirements for a 2.5D fabric for accelerator scale-out, we assess the impact of the data distribution bandwidth on the overall system throughput. Sec. 5.1 details the simulation methodology and system parameters.



**Figure 2.** Three tensor partitioning strategies across chiplets (a,b,c). Cp refers to chiplet. Based on the strategies, we construct three strategies as shown in (d). The replicated tensors are broadcast, while the partitiond tensors are unicast using the distribution network.



**Figure 3.** The impact of bandwidth on throughput. We analyze a classification network, Resnet50 [13], and a segmentation network, UNet [20] varying partitioning strategies. High-res and low-res layers indicate layers with larger/smaller activation height/width compared to the number of channels. Residual, FC, and Up-Conv indicate residual links, fully-connected layer, and up-scale convolutions.

| Layer       | Description                                                   |
|-------------|---------------------------------------------------------------|
| High-res    | CON2D layer with less channels than width of input activation |
| Low-res     | CON2D layer with more channels than width of input activation |
| Residual    | Skip connections [13]                                         |
| Fully-conn. | GEMM layer present in CNNs, MLPs, RNNs, and so on             |
| UpCONV      | Variant of CONV2D that increases the resolution of activation |

Table 1. Layer types

We model a baseline 2.5D accelerator with 256 chiplets connected via a Mesh NoP. A global SRAM interfaces with DRAM on one end, and performs data distribution to the chiplets on the other. We implement three partitioning strategies [11], as shown in Fig. 2. Each chiplet is a 64-PE accelerator, implementing a dataflow optimized for that partitioning strategy. We sweep the global SRAM read bandwith, and plot observed throughput in Fig. 3. We run two state-of-the-art DNNs - ResNet [13] and UNet [20] for image classification and image segmentation workloads. Although we focus on two CNNs, they include a variety of layer operations and shapes, which can be a representative set of modern DNN layers, and significantly affect the performance and energy [16].

We categorize layer types based on their operations and shapes in Table 1 and, for each layer type, we plot the impact of bandwidth across the three partitioning strategies.

**Observation I: Different layer types favor different partition strategies.** For instance, the high-resolution layers (i.e., input dim > channel dim) favor activation partitioning (i.e., YP-XP) across chiplets, where both inputs and weights can be broadcast. Meanwhile, low-resolution layers and fully connected layers do not exhibit sufficient parallelism in activations, and favor filter partitioning (i.e., KP-CP) instead.

Observation II: different layer types saturate to peak throughput at different bandwidth values. High-res layers with YP-XP saturate to the peak best-case throughput of 16K MACs/cycle at 64 Bytes/cycle (i.e., 64 unique inputs or weights delivered per cycle across the 256 chiplets) due to effective bandwidth amplification due to broadcasts of inputs and weights. Low-res layers in ResNet saturate to 8K MACs/cycle with KP-CP beyond 128 Bytes/cycle.

**Takeaways.** Three takeaways from the above observations are that (i) the communication fabric for data distribution plays a key role in performance, (ii) broadcast support and high-bandwidth are critical for scalability, and (iii) supporting adaptive partitioning strategies for each layer, rather than picking a fixed one for all layers, is crucial for performance. Challenges with Electrical NoPs. As described in Sec. 2, bandwidth is the Achilles heel for interposer wires because it is limited by the microbump size. For example, based on 55-μm microbump size in one of the latest technologies [14], only 21 wires can be placed over an edge of an accelerator [17] chiplet with 256 PEs. According to that technology [14], those 21 wires provide 42 Gbps bandwidth, 12.95× lower than the on-chip bandwidth in a chiplet from a recent work [22]. Table 2 compares these figures with other technologies.

Moreover, it is hard to design broadcast fabrics connecting hundreds of chiplets; broadcast will have to be supported via point-to-point forwarding, requiring multiple hops to deliver data to all chiplets, adding significant latency. This also has a subtle side-effect in terms of synchronization since different chiplets will receive data at different times.

| Technology              | Node<br>(nm) | BWD            | Energy<br>(pJ/bit) | LL<br>(mm) | Avg #<br>of Hops |
|-------------------------|--------------|----------------|--------------------|------------|------------------|
| Silicon Interposer [8]  | 45           | 450            | 5.3                | 40         | $O(\sqrt{N_C})$  |
| Silicon Interposer [22] | 16           | 80             | 0.82-1.75          | 6.5*       | $O(\sqrt{N_C})$  |
| EMIB (AIB) [14]         | 14           | 36.4           | 0.85               | 3          | $O(\sqrt{N_C})$  |
| Optical Interposer [29] | 40           | 8000           | 4.23               | N/A        | $O(\sqrt{N_C})$  |
| Wireless (unicast)**    | 65           | 26.5           | 4.01               | 40         | 1                |
| Wireless (broadcast)**  | 65           | $64\sqrt{N_C}$ | $1.4N_C$           | 40         | 1                |

**Table 2.** 2.5D interconnect technologies. BWD refers to bandwidth density in Gbps/mm.  $N_C$  represents the number of chiplets. LL refers to link length in mm. \*Estimated based on package and chiplet dimensions. \*\*Estimated based on Fig. 1.



**Figure 4.** Average per-bit energy of a multicast transmission in a silicon interposer with direct connections, mesh NoP with multicast support, and wireless NoP for two BER values.

**Promise of Wireless NoPs.** As discussed in Sec. 2, wireless NoPs are promising for 2.5D accelerators due to their broadcast support, independence from I/O pitch constraints, reconfigurability, and single-hop communication.

The single-hop communication is a key benefit in scaleout designs because the number of hops, which is a multiplier to latency and energy for interposer NoPs, increases with the communication fanout and the number of chiplets. Therefore, although some technologies provide higher bandwidth and lower energy per bit for single hop, as the number of chiplets or broadcast transmissions increase, the efficiency of wireless NoP surpasses other technologies. This is illustrated in Table 2 and Fig. 4. This presents a co-design opportunity to design dataflows with multicast to leverage wireless, as we describe for our 2.5D accelerator next.

#### 4 WIENNA Architecture

Fig. 5 illustrates the WIENNA architecture. In essence, WI-ENNA contains a High Bandwidth Memory (HBM) that feeds a global SRAM memory chiplet, which is in turn connected to an array of accelerator chiplets. Each accelerator chiplet contains a local memory and an array of PEs, which are composed of a multiplier, an adder, as well as buffers that store inputs, weights, and outputs momentarily.

WIENNA implements a two-level hierarchy. On the one hand, the HBM, the global SRAM, and the array of chiplets follow a 2.5D integration scheme and are interconnected by means of a hybrid wired/wireless NoP. In the NoP, the wireless side is used for data distribution and the wired side for data collection. On the other hand, each chiplet implements its own internal microarchitecture. In this work, we

use NVDLA [1] and Shidiannao [9] style accelerators depending on the chosen workload partitioning strategy. From a logical perspective, WIENNA's architecture allows us to partition the DNN dimensions across the chiplets via multiple mechanisms (see Fig. 2). It also allows adaptive switching between these strategies for every layer of the DNN, building upon the reconfigurability of the wireless NoP.

We describe a brief walkthrough example, showing how the KP-CP partition in Fig. 2(a) runs on WIENNA in Fig. 6. In the example, the partitioned filters are first unicasted ( $t_{0_-}$ ) to each chiplet exploiting wide bandwidth of wireless networks. Then, inputs are streamed by broadcasting ( $t_{0_-}$ ) one by one, exploiting the low-latency broadcasting of wireless networks. Next ( $t_{0_-}$ 2), each chiplet internally distributes the inputs and weights following the intra-chiplet dataflow and computes the output activation. Finally ( $t_{0_-}$ 3), WIENNA utilizes the wired network to collect outputs from each chiplet.

**Interconnection Network.** WIENNA uses a hybrid wired-wireless NoP. The wireless plane is *only* used to distribute data from the SRAM to the chiplets, whereas the wired plane is used to collect the processed data back to the global SRAM.

The wireless network is asymmetric (i.e., only distributes data) to keep it simple. If both distributions and reductions were to be performed through the wireless plane, full TRXs would be needed at each chiplet. Instead, WIENNA only requires a single TX located at the global SRAM and one RX per each chiplet. This avoids collisions completely, thus eliminating the need for a wireless arbiter and rendering flow and congestion control trivial because distributions are scheduled beforehand. The size and power of the TX and RX will depend on the required bandwidth (see Fig. 2). Finally, note that TSV-based vertical monopoles [25] are assumed at both transmitting and receiving ends, as the use of such antennas reduces the losses at the chip plane.

Besides wirelessly, the chiplets are also connected via a wired NoP through the interposer for output collection. To combat pin limitations and wiring complexity, a mesh NoP is assumed [22, 30]. We consider two design points with different bandwidth as listed in Table 4, to account for conservative and aggressive baselines of comparison. In WIENNA, the wired NoP is only used for the collection phase. In the baseline, it is used for both distribution and collection.

In summary, WIENNA's key feature is the proposal of an architecturally simple, but very powerful wireless NoP for low-cost, low-latency, and high-bandwidth data distribution from memory to the chiplets via unicast/broadcast. WIENNA thus enables 2.5D chiplet scale-out. Note that, while we evaluate a homogeneous chiplet array in this work, WIENNA makes no assumptions about the chiplet architecture and can thus accommodate heterogeneous combinations of chiplets with different architectures and networks.

**Area and Power Overheads.** To assess the implementation overhead of WIENNA, Table 3 shows the estimated area and power of an example WIENNA system with 256 chiplets



Figure 5. Overview of the WIENNA architecture.



Figure 6. WIENNA timeline for a filter partitioning example.

and 64 PEs per chiplet at 65nm CMOS. We observe that the area overhead of a wireless RX is 16% of a chiplet, which can be decreased when we employ a larger chiplet. Although wireless RX consumes 25% of each chiplet's power, the delay benefits to be discussed in Sec. 5.2, which is upto 5.1×, compensate the power and eventually provide energy benefits (an average reduction of 38.2%). Therefore, the overhead of WIENNA system is acceptable considering the benefits.

## 5 Evaluation

## 5.1 Methodology

We list up hardware parameters, workloads, partitioning strategy, and NoP characteristics in Table 4. To compute the throughput, we use an open source DNN accelerator cost model, MAESTRO [16], which is validated with average accuracy of 96.1% against RTL simulation and measurements. MAESTRO takes into consideration the latency, bandwidth, and multicasting characteristics of the different NoPs. To compute the energy of an electrical NoP, we compute the average number of hops multiplied by the per-hop energy from Table 2. To estimate the energy of the wireless NoP, we select conservative (C) and aggressive (A) design points from Fig. 1 at the required transmission rates. Note that Fig. 1 assumes a single transmitter and receiver with a 50%/50% ratio, but this is actually a design choice. This allows to model the energy of both unicasts, where only the required receiver is active while others remain powered off to save energy; and broadcasts (multicasts), where all receivers (a set of receivers) are active.

## 5.2 Results

We compare the performance of interposer and WIENNA accelerators in Fig. 7 and Fig. 8. The energy is compared in Fig. 9 and the cause for their differences illustrated in Fig. 10.

| Component             | Area               |     |     | Power |     |     |
|-----------------------|--------------------|-----|-----|-------|-----|-----|
| Sub-element           | (mm <sup>2</sup> ) | (%) | (%) | (mW)  | (%) | (%) |
| Chiplets (256×)       | 1646               | 97  |     | 89600 | 89  |     |
| PEs (64×) + Mem       | 5                  |     | 78  | 90    |     | 26  |
| Wireless RX           | 1                  |     | 16  | 90    |     | 25  |
| Collection NoP Router | 0.43               |     | 6   | 170   |     | 49  |
| Memory (1×)           | 53                 | 3   |     | 10167 | 11  |     |
| Global SRAM           | 51                 |     | 96  | 10000 |     | 99  |
| Wireless TX           | 2                  |     | 4   | 167   |     | 1   |
| Total                 | 1699               | 100 |     | 99767 | 100 |     |

**Table 3.** WIENNA area and power breakdown for 256 chiplets, each with 64 PEs (16K MACs). The global SRAM is 13MB. The PE and SRAM data are based on Eyeriss [6]. Wireless TX and RX are estimated from Fig. 2, based on 10<sup>-9</sup> BER. All data at 65-nm CMOS.

| m - 137 1 Cpm           |                                                        |
|-------------------------|--------------------------------------------------------|
| Total Number of PEs     | 16384                                                  |
| Global SRAM Size        | 13 MiB                                                 |
| Clock Frequency         | 500 MHz                                                |
| Number of Chiplets      | 32-1024                                                |
| PEs per Chiplet         | 64-512                                                 |
| Workloads               | Resnet50 (Classification), UNet (Segmentation)         |
| Partitioning Strategy   | KP-CP, NP-CP, and YP-CP                                |
| Chiplet Architecture    | KP-CP and NP-CP: NVDLA-like [1]                        |
| Chipiet Architecture    | YP-XP: Shidiannao-like [9]                             |
| Interposer Bandwidth    | 8–16 Bytes/cycle/link (conservative–aggressive)        |
| WIENNA Bandwidth        | 16–32 Bytes/cycle (conservative–aggressive)            |
| Average Hops in NoP     | Interposer (mesh): $\sqrt{N_{chiplets}}/2$ , WIENNA: 1 |
| Multicasting Capability | Interposer: No, WIENNA: Yes                            |

Table 4. Evaluation settings.

**Throughput Improvements.** Fig. 7 presents the throughput analysis of WIENNA, from which we highlight several key results. First, WIENNA improves the end-to-end throughput by 2.7-5.1× on Resnet50 and 2.2-3.8× on UNet. Second, WIENNA can achieve better results than interposer with the same relative bandwidth. As listed in Table 4, aggressive interposer (interposer A) and conservative WIENNA (WIENNA C) have the same bandwidth, but WIENNA C provides 2.58× and 2.21× higher throughput than interposer A. The difference is based on the single-cycle broadcasting of the wireless NoP, much faster than the multi-hop wired baseline, which is critical in most partition methods as we observed in Fig. 2 and later quantify in Fig. 10. Third, the optimal partitioning strategy in terms of throughput and the impact of having higher bandwidth depends on the layer being processed. This is especially clear in ReNet50, where KP-CP works better in low-res and FC layers, NP-CP in residual layers, and YP-XP in high-res layers. Based on this, in the end-to-end charts of Fig. 7, we present results based on adaptive partitioning where we select the best strategy for each layer. We find that adaptive partitioning improves throughput an extra 4.7% and 9.1% on Resnet50 and UNet, respectively, compared to keeping KP-CP across layers.

We further evaluate throughput by studying the impact of varying the number of chiplets assuming a fixed total of 16384 PEs. For the interposer NoP, we adapt the number of hops to the resulting number of chiplets. Fig. 8 shows results. Since the total number of PEs is fixed, less chiplets lead to more traffic per chiplet. Also, the utilization of chiplets and PEs in each chiplet depends on layer type and the partitioning strategy, as observed previously. As a consequence, we do not observe a monotonic change of the throughput for



Figure 7. Throughput analysis of conservative (C) and aggressive (A) designs of interposer- and WIENNA-based 2.5D accelerators.



Figure 8. Impact of cluster size for three partitioning strategies in (a) Resnet50 and (b) UNet.



**Figure 9.** Energy analysis of the distribution of input activations and filters from SRAM to chiplets in interposer- and WIENNA-based 2.5D accelerators. Inset (c) summarizes the end-to-end energy reduction by WIENNA compared to the interposer-based 2.5D accelerator.



**Figure 10.** Average multicast factor (number of received data across all the chiplets / number of sent data from global SRAM) for each layer types in the evaluated DNN models, (a) Resnet50 and (b) UNet. In this analysis, we apply the cluster size of 64, which results in 256 chiplets.

all the cases and, thus, we conclude that the chiplet size is an important and optimizable design parameter. In any case, WIENNA is consistently faster and also more affected by the cluster sizes (77.5% average difference from 64 to 512 PEs per chiplet) compared to interposer (62.5%).

Energy Improvements. The improvement in broadcast offered by WEINNA also affects the energy consumption. Fig. 9 compares the energy consumed in the distribution (from SRAM to chiplets) of input activations and filters in both systems. Across all the partitioning strategies and layers, WIENNA always reduces energy consumption (average of 38.2%). The energy saving is due to both the efficient multicast support via wireless NoP and the ample multicast opportunities of each partitioning strategy. Fig. 10 quantifies the multicast opportunities by plotting the multicast factor, which is the average number of destinations of each transfer from the global SRAM. The multicast factor widely varies

across partitioning strategies and layers, as they define the spatial reuse opportunities that determine the amount of multicast. In general, we observe that the energy reduction provided by WIENNA is high when the multicast factor is high. A clear example is the KP-CP partitioning strategy leading to both the highest multicast factor in Fig. 10 and highest energy reduction in Fig. 9.

# 6 Related Works

**2.5D Chiplet Scaleout.** DNN accelerator on-package scaleout is a nascent research area. NVIDIA recently demonstrated a multi-chip-module accelerator [30]. WIENNA can augment such a design via the single-cycle wireless broadcast. Gao *et al.* proposed Tetris [11], a DNN accelerator using a 3D memory exploiting wide bandwidth via TSV. Tetris solves the bandwidth problem by leveraging TSVs to deliver data to every compute tile. WIENNA solves the same problem for a 2.5D system via wireless NoP. Simba [22] taped out a 2.5D accelerator with 36K MAC units in 36 chiplets, which demonstrates the ability of 2.5D technology to scale-out a DNN accelerator. However, the silicon interposer-based NoP among chiplets provide  $10.88 \times less$  bandwidth than on-chip bandwidth, which can be the system bottleneck depending on the workload and dataflow. WIENNA can replace the NoP and provide higher bandwidth for higher performance.

Wireless NoC. The use of WNoCs for deep learning has been analyzed in [7, 23], but unlike WIENNA, the proposed architectures are single-chip and do not scale out. In [23], a single dataflow is adapted to leverage the WNoC to broadcast weights in a CNN accelerator. The heterogeneous CPU/GPU platform from [7] is not an accelerator, but is evaluated for deep learning workloads only. A multi-chip architecture enhanced with wireless interconnects is proposed in [4], but a single short-range TRX is taken as a baseline and only a single fixed dataflow is analyzed. WIENNA is the first scaleout-friendly architecture that leverages a wireless NiP, using concepts such as TSV antennas [19], asymmetric wireless multi-chip design [3] and reconfigurable dataflows.

#### 7 Conclusion

This paper proposes a new scalable design methodology of 2.5D DNN accelerators based on wireless technology. We identify the required capabilities for the interconnect in 2.5D DNN accelerators and demonstrate that in those environments, wireless NoPs lead to 2.5-4.4× higher throughput and 38.2% lower energy than interposer-based systems.

## 8 Acknowledgements

This work was supported by the European Commission under grant 863337 and NSF under Award OAC-1909900.

#### References

- [1] 2017. NVDLA Deep Learning Accelerator. http://nvdla.org.
- [2] S. Abadal, R. Guirado, H. Taghvaee, et al. 2020. Graphene-based Wireless Agile Interconnects for Massive Heterogeneous Multi-chip Processors. arXiv:2011.04107
- [3] M. M. Ahmed, N. Mansoor, and A. Ganguly. 2018. An Asymmetric, Energy Efficient One-to-Many Traffic-Aware Wireless Network-in-Package Interconnection Architecture for Multichip Systems. In IGSC.
- [4] G. Ascia, V. Catania, A. Mineo, et al. 2020. Improving Inference Latency and Energy of DNNs through Wireless Enabled Multi-Chip-Modulebased Architectures and Model Parameters Compression. In NOCS.
- [5] H. M. Cheema and A. Shamim. 2013. The last barrier: On-chip antennas. IEEE Microwave Magazine 14, 1 (2013), 79–91.
- [6] Y. Chen, T. Yang, J. Emer, et al. 2019. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. *IEEE Trans. Emerg. Sel. Topics Circuits Syst.* 9, 2 (2019), 292–308.
- [7] W. Choi, K. Duraisamy, R. G. Kim, et al. 2018. On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems. *IEEE Trans. Comput.* 67, 5 (2018).
- [8] T. O. Dickson, Y. Liu, S. V. Rylov, et al. 2012. An 8x 10-Gb/s source-synchronous I/O system based on high-density silicon carrier interconnects. *IEEE J. Solid-State Circuits* 47, 4 (2012), 884–896.
- [9] Z. Du, R. Fasthuber, T. Chen, et al. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ISCA.

- [10] V. Fernando, A. Franques, S. Abadal, et al. 2019. Replica: A Wireless Manycore for Communication Intensive and Approximate Data. In ASPLOS.
- [11] M. Gao, J. Pu, X. Yang, et al. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS.
- [12] R. Guirado, H. Kwon, E. Alarcón, et al. 2019. Understanding the Impact of On-chip Communication on DNN Accelerator Performance. In ICECS. 85–88.
- [13] K. He, X. Zhang, S. Ren, et al. 2016. Deep residual learning for image recognition. In CVPR.
- [14] D. Kehlet. 2017. Accelerating innovation through a standard chiplet interface: The advanced interface bus (AIB). In *Intel WP-01285-1.1*.
- [15] J. Kim, G. Murali, P. Gauthaman, et al. 2019. Architecture, Chip, and Package Co-design Flow for 2.5D IC Design Enabling Heterogeneous IP Reuse. In DAC.
- [16] H. Kwon, P. Chatarasi, M. Pellauer, et al. 2019. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach. In MICRO.
- [17] H. Kwon, A. Samajdar, and T. Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In ASPLOS.
- [18] W. Lu, G. Yan, J. Li, et al. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In HPCA.
- [19] V. Pano, I. Tekin, I. Yilmaz, et al. 2020. TSV Antennas for Multi-Band Wireless Communication. *IEEE Trans. Emerg. Sel. Topics Circuits Syst.* 10, 1 (2020).
- [20] O. Ronneberger, P. Fischer, and T. Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. 234–241.
- [21] M. S. Shamim, N. Mansoor, R. S. Narde, et al. 2017. A Wireless Interconnection Framework for Seamless Inter and Intra-chip Communication in Multichip Systems. *IEEE Trans. Comput.* (2017).
- [22] Y. Shao, J. Clemons, R. Rangharajan, et al. 2019. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In MICRO.
- [23] M. Sinha, S. H. Gade, W. Singh, and S. Deb. 2018. Data-flow Aware CNN Accelerator with Hybrid Wireless Interconnection. In ASAP.
- [24] A. C. Tasolamprou, A. Pitilakis, S. Abadal, et al. 2019. Exploration of Intercell Wireless Millimeter-Wave Communication in the Landscape of Intelligent Metasurfaces. *IEEE Access* 7 (2019).
- [25] X. Timoneda, S. Abadal, A. Franques, et al. 2020. Engineer the Channel and Adapt to it: Enabling Wireless Intra-Chip Communication. *IEEE Trans. Commun.* (2020).
- [26] K. K. Tokgoz, S. Maki, J. Pang, et al. 2018. A 120Gb/s 16QAM CMOS millimeter-wave wireless transceiver. In ISSCC.
- [27] X. Yu, J. Baylon, P. Wettin, et al. 2014. Architecture and Design of Multi-Channel Millimeter-Wave Wireless Network-on-Chip. *IEEE Des. Test* 31, 6 (2014), 19–28.
- [28] Z. Zhao, H. Kwon, S. Kuhar, et al. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In ISPASS, 282–292.
- [29] M. Zia, C. Zhang, H. Yang, et al. 2016. Chip-to-chip interconnect integration technologies. *IEICE Electronics Express* 13, 6 (2016).
- [30] B. Zimmer, R. Venkatesan, Y. S. Shao, et al. 2019. A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm. In VLSIC.