# Bridging the Gap: FPGAs as Programmable Switches

Thomas Luinaud, Thibaut Stimpfling, Jeferson Santiago da Silva , Yvon Savaria,and J.M. Pierre Langlois, Polytechnique Montréal, Canada {firstname.lastname}@polymtl.ca

*Abstract*—The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have proposed to implement PISA on FPGAs. However, little effort has been dedicated to analyze whether FPGAs are good candidates to implement PISA.

In this work, we take a step back and evaluate the microarchitecture efficiency of various PISA blocks. We demonstrate, supported by a theoretical and experimental analysis, that the performance of a few PISA blocks is severely limited by the current FPGA architectures. Specifically, we show that match tables and programmable packet schedulers represent the main performance bottlenecks for FPGA-based programmable switches. Thus, we explore two avenues to alleviate these shortcomings. First, we identify network applications well tailored to current FPGAs. Second, to support a wider range of networking applications, we propose modifications to the FPGA architectures which can also be of interest out of the networking field.

Index Terms—FPGA, PISA, P4 language, in-network computing.

#### I. INTRODUCTION

THE P4 language [1] and the Protocol Independent Switch Architecture (PISA) [2] have paved the road for high performance configurable data planes. P4 is a domain specific language (DSL) designed to describe how packets are forwarded on a data plane. PISA, a domain specific architecture associated to P4, provides a common abstraction to configurable data planes. Figure 1 presents PISA, a pipeline of configurable blocks (§II-B).

However, current implementations of PISA switches limit the innovation potential of network architects. Several network applications are difficult to implement in PISA switches, due to hardware constraints (stateful applications often require large shared embedded memory) or architectural rigidity (fixedfunction packet scheduler).

Recently, FPGAs have emerged as a platform to accelerate network and server applications in data centers. As an example, Microsoft has deployed FPGAs in its data centers [3]. In-network computing has also attracted attention of FPGA enthusiasts [4].

Following this trend, recent research [5], [6] has also proposed mapping PISA to FPGAs. The goal is to exploit the inherent FPGA reconfigurability to implement network applications as new P4 programs can be deployed by simply reconfiguring the FPGA bitstream. Yet, the performance of FPGA-based PISA switches are, at best, one order of magnitude lower than that of their ASIC counterparts.

In this work, we take a step back and we thoroughly analyze *how* PISA blocks are implemented in FPGAs to identify the strengths and weaknesses of such mapping. These theoretical and experimental analyses have led to two major conclusions. First, some network applications are intrinsically good matches for FPGA implementation. Second, FPGA devices need to be reengineered to better support networking applications.

In summary, our contributions are as follows:

- 1) We analyze the mapping of PISA blocks to FPGAs, highlighting the pros and cons (§III);
- We evaluate, analytically and experimentally, the performance of FPGA-based PISA switches (§IV);
- We identify which applications are well suited for current FPGAs (§V); and
- We propose the specialization of the FPGA architecture for the network domain (§VI).

# II. BACKGROUND

A generic FPGA architecture is first presented. Then, the configurable blocks of PISA are introduced.

#### A. FPGA architecture

As shown in Figure 2, FPGAs are structured as an array of blocks interconnected by a routing fabric. The routing fabric allows to route signals between the FPGA blocks.

To communicate off-chip, FPGAs integrate configurable Input/Output (I/O) blocks and dedicated high-speed transceiver pins. Hard-wired PCIe blocks and MAC blocks, connected to the high-speed transceivers pins, are also integrated in modern FPGAs.

First three authors have equally contributed to this work.

This research was partially funded by Mitacs/Canada and CPNq/Brazil.



Figure 1: Reference PISA switch model

Logic operations are implemented by Configurable Logic Blocks (CLBs), distributed over slices. A slice comprises lookup tables (LUTs), which implement logic functions, and flip-flops (FFs) used to synchronize signals between logic functions.

In addition, specialized blocks are also hard-wired into FP-GAs, such as Digital Signal Processing (DSP) blocks and Block RAM (BRAM). DSPs perform arithmetical operations, while BRAMs are width-configurable SRAMs. Hard-wired blocks avoid wasting CLBs and allow higher clock frequencies.

#### B. Protocol Independent Switch Architecture

Historically, network switches have been build upon fixed function ASICs as packets undergo similar and straightforward processing in switch devices. However, the advent of Software-Defined Networking (SDN) has changed this state of affairs as it mandates for network programmability.

In this context, the PISA architecture [2] was the first proposal to support programmable protocol-agnostic packet forwarding. PISA comprises a programmable parser, a programmable pipeline of match-action tables, a fixed function packet scheduler, and a configurable deparser. PISA is illustrated in Figure 1, where blue rectangles are match tables and yellow trapeziums are Arithmetic and Logical Units (ALUs) implementing actions.

A parser extracts header fields that are used to build lookup keys for the match-action stages. A key is matched against rules stored in a match table. The lookup result, an action and an associated data, is executed by an action stage. A packet scheduler reorders the packet according to a scheduling algorithm. Finally, a deparser reassembles the updated packet headers and emits the packet.

### **III. MAPPING PISA TO FPGAS**

A typical implementation of PISA on FPGA follows a dataflow architecture, where the processing is laid out spatially on a pipeline. That is, the processing is divided into a sequence of operations, where each operation is mapped onto a portion of the FPGA. Packets are streamed on a data bus throughout the PISA components. Because a packet size can be larger than the bus width, a packet can be segmented over the data



Figure 2: Considered FPGA architecture

bus, and can require multiple clock cycles to traverse a PISA component.

In the next sections, we describe the key components of the PISA architecture and we discuss their micro-architecture efficiency when implemented on FPGAs.

#### A. Programmable Parser

The packet parser extracts header fields to be matched in the match-action stages, and determines the stack of valid protocols in a switch [7]. The common approach to implement packet parsers is through an abstract state machine. The state machine can be represented as a Directed Acyclic Graph (DAG) in which nodes are protocols and edges indicate protocol transitions.

DAGs are a good fit to FPGAs due to their embarrassingly pipeline-able characteristic. Also, data flow architectures can be leveraged by the intrinsic of the FPGA fabric, where logic elements are tightly coupled to registers improving the pipeline efficiency. In addition, nodes of a packet parser DAG are compact data structures that easily fit in on-chip memories or registers. As a result, multiple high-performance programmable packet parsers have been implemented on FP-GAs [8]-[10].

### B. Match Stage

A match table is an abstract container, holding a collection of keys, actions and data, addressed by a lookup key. In P4, a match table can be configured for exact match, ternary match or longest prefix match.

Exact Match (EM). EM is traditionally implemented with content addressable memory (CAM). A CAM memory is associative array used to find if an exact key value is stored into a table. Such CAM memories are no longer integrated as hard blocks in FPGAs and must be emulated.

One approach to CAM emulation is transposed memories, where the lookup key is used as an address. The memory stores for each lookup key a bitmap of the matched keys. The transposed approach typically yields a memory efficiency less than 10% [11], [12].

An efficient approach to implement exact match operations exploits a hash table combined with Cuckoo hashing for collision resolution, yielding a memory efficiency higher than

80% [2], [13]. Thus, EM implementation on FPGAs can be efficient using on-chip memories.

**Ternary Match.** TCAM are traditionally used for ternary matches. Hardware TCAM comprises a memory with a match circuitry, to store a ternary rule and match it against a lookup key and a priority encoder, which returns the matched rule index with the highest priority. Similarly to CAMs, TCAMs are emulated on FPGAs.

A memory efficient approach to emulate a TCAM of width W and depth N is to build P smaller TCAMs using the transposed memory approach. Still, the memory overhead is  $2^w/w$ , with w = W/P.Hence, the overhead is minimal for a TCAM where w = 1 or w = 2, which translates to a depth of 2 or 4. In practice, this overhead ranges from  $8.4 \times$  to  $65 \times$ , since current FPGA memories have a minimum depth of 32 (LUT RAMs) and 512 (BRAMs).

Reviriego, Ullah, and Pontarelli propose to represent ternary rules as logic functions [14], which are synthesized in LUTs (§ II-A). However, this approach yields a memory efficiency similar with the transposed memory approach, because a LUT is a small SRAM that records one result for each possible input. Hence, implementation of efficient TCAMs on FPGAs remains an open question.

Longest Prefix Match (LPM). LPM is a sub-case of ternary match with two differences: 1) the mask applies on a contiguous segment of bits, starting from the least significant bit, 2) when multiple objects match a lookup key, only the longest prefix matched is returned. Thus, LPM implemented with TCAMs [15] are achieving a low memory efficiency on FPGAs (§ III-B).

Otherwise, LPM can be emulated with data structures such as binary trees as in the Xilinx LPM IP. Not only both the memory efficiency and frequency are improved by  $2\times$ , but the resource consumption is reduced by an order of magnitude over transposed memory. However, the update latency grows linearly with the number of keys.

Several other data structures proposed in the literature [16], [17], exploit specific characteristics of the stored key to improve memory efficiency. However, these methods can hardly be used in systems configured with P4 since the characteristics of the content of match tables are unknown a priori.

## C. Action Stage

*Primitive actions* are operations natively supported in P4, comprising arithmetic, logic, bit shifts, and conditional operations.

Arithmetic and logic operations. These operations are, except for the division, efficiently mapped to the FPGA fabric. Indeed the ALU found in a CPU is efficiently implemented on FPGA [11]. In addition, modern FPGAs supports fast multiplications and multiply-and-accumulate operations, as well

as adders with DSPs and hard-wired carry chains. However, division operations on FPGAs are costly in terms of logic resources, but are quite uncommon in packet processing. Complex actions combining multiple primitives are well supported when laid out in a pipeline.

In addition, the reconfiguration capability of FPGAs allow supporting almost any combination of actions.

**Conditional operations.** Conditions described in P4 are translated at the architectural level into multiplexers that select one operation, or result, out of multiple inputs. While an FPGA slice can be configured as a 16 to 1 multiplexer, very wide multiplexers must be distributed over multiple slices, which degrades performance with the number of conditions.

**Bit-shift operations.** Common fixed bit shifts represent no hardware cost when implemented in FPGA because the shift values are known at compile time, hence, they are hardwired into the FPGA fabric. Likewise, barrel shifters are poorly mapped in FPGAs as they are commonly implemented using a chain a muxes. These, in turn, are uncommon operations in packet processing, therefore, when used, barrel shifters can be implemented using hard DSP block.

# D. Packet Scheduler

A packet scheduler decides at what times and in what orders are packets sent.

The push-in-first-out (PIFO) queue [18] was recently proposed as an abstraction upon which a programmable packet schedulers can be built. However the PIFO architecture maps poorly to an FPGA, because a range-search CAM is required, which can only be emulated either with flip-flops, or with the transposed memory approach (§III-B).

Alternatively, Benacer, Boyer, and Savaria have proposed a priority queue that better fits to the FPGA architecture [19]. It exploits the intrinsic parallelism of FPGAs to sort the packet ranks using a systolic priority queue. However, this work does not provide the programmability supported by the PIFO.

In addition, a packet buffer is required to store packets while they are scheduled. Assuming a typical data center  $100 \,\mu s$  RTT and one  $100 \,Gb/s$  interface, an  $1.2 \,MB$  buffer is required. Hence, assuming an FPGA with  $12 \times 100 \,Gb/s$  interfaces, a  $12 \,MB$  buffer would be required, which would use one fourth of the on-chip memories available on the largest FPGAs and would stress the FPGA internal routing fabric (§IV).

## E. Deparser

The deparser is a module that performs the inverse of what the parser does as it reassembles headers in the correct order before sending the packet. Thus, a deparser is well supported on FPGAs, and the methods used to implement a high performance parser can be applied [9]. In P4<sub>16</sub>, the deparser is described as a sequence of header emission statements. The packet header is then recomposed by respecting the order of valid headers in the sequence. However, the headers validity can be modified during the match-action stages and current compilers do not lifetime header analysis to find the smallest combination set of possible valid headers. Thus, the deparser is more difficult to implement and it tends to have the largest resource consumption in the P4 pipeline (§ IV-B).

#### IV. SCALING THE PACKET THROUGHPUT

High-end FPGAs come with multiple hard-wired 100 G Ethernet MACs, offering a total packet throughput at the I/O level exceeding the Tb/s barrier. This raises the question whether an FPGA can *in practice* process packets at a Tb/s.

Because PISA is implemented on FPGAs using a data flow architecture, the packet throughput supported is directly the product  $width_{bus} \times frequency_{bus}$ . Hence, increasing either the bus width, or the bus frequency directly translates to a higher packet throughput. Another option to scale the packet throughput is to use parallelism, i.e, to replicate a PISA pipeline. We first discuss each of the three approaches. Second, based on experimentation, we present *in practice* the scaling limitations on FPGAs.

#### A. Methods

Scaling the Bus Width. To simplify the discussion, we assume that the minimum packet size is greater or equal to the bus width. Increasing the bus size comes at the cost of a higher resource consumption and limits the maximum bus frequency. Because the bus size is increased, more bits are synchronized and processed, which directly increases the resource consumption. However, wide buses (>512 bits) increase the routing congestion in FPGAs, leading to longer wire delays, which directly limits the frequency.

Scaling the Bus Frequency. One method consists in increasing the depth of a pipeline to reduce the logic delay and wire delay between two flip-fops. The latency can increase, but the shorter clock periods can be obtained, which increases the frequency. In theory, the parser, match action tables and deparser can be heavily pipelined, and thus, the frequency can scale. However, experimentally, the FPGA architecture limits the frequency scaling (§ IV-B).

**Pipeline Replication.** Our experiments (§ IV-B) show that the maximum practical throughput in single pipeline is around 800 Gb/s. However, State-of-the-art FPGAs can support almost twice this throughput as hard 100 Gb/s MAC blocks.

The packet throughput can be increased linearly with the number of PISA pipelines implemented, at the cost of a linear resource consumption growth.



Figure 3: FPGA results for test cases from Table I.

To exploit the benefits of pipeline replication, a packet dispatcher is required at both the input and output of the pipelines to distribute the packet traffic among the pipelines. In addition, the packet dispatcher integrates buffers in each port to prevent packet drops. The resulting buffers complexity can be expressed as:

$$\texttt{BufferSize} = \frac{\texttt{Ports}}{\texttt{Pipes}} \times \texttt{MaxPktSize}$$
$$\texttt{TotalBufferSize} = 2 \times \texttt{Ports} \times \texttt{BufferSize}$$

where Pipes is the number of pipelines, Ports the number of Ethernet ports, BufferSize the input/output buffer size per port, MaxPktSize the maximum supported packet size, and TotalBufferSize the total required buffer size.

#### B. Experimental Evaluation

To characterize some FPGA performance limitations, we first evaluate the impact of the data bus width on the clock frequency. Then, we characterize the performance of multiple PISA blocks in terms of resource usage and clock frequencies. The PISA block stressed and the description of each test is presented in Table I. All the experiments were described in P4<sup>1</sup>. The implementation was executed with Xilinx SDNet 2017.4 combined with Vivado 2018.2 on a Xilinx Virtex Ultrascale+ FPGA (XCVU9P-flga2577-3-e) with a targeted clock frequency of 500 MHz.

**Bus width vs frequency.** To characterize the bus width impact on the clock period, the test T0 was implemented with a bus width ranging from 64 to 2048 bits.

As shown in Figure 4, the resource consumption is linear with the bus width, while the maximum clock frequency decreases

<sup>&</sup>lt;sup>1</sup>The source code is available at https://github.com/luinaudt/Unleashing\_FPGA

Table I: Test cases to stress the FPGA architecture.

| Test | Description                                    | Stress       |
|------|------------------------------------------------|--------------|
| TO   | 3 headers, 74 bytes, no tables                 | -            |
| T1   | T0 + 8 headers, 116 bytes                      | (De)Parser   |
| T2   | T1 + IPv4 checksum                             | Actions      |
| T3   | T2 + EM $64 \text{ k} \times 128 \text{ bits}$ | Match tables |
| T4   | T2 + TCAM $4 \text{ k} \times 128$ bits        | Match tables |
| T5   | T2 + 4 chained conditionals                    | Actions      |

for bus larger than 1280 bits. This frequency deterioration relates to routing congestion and a large net fan-out. When a routing block is congested, signals are routed to farther routing blocks, which increases wire delays. In addition, we observe a large fan out in our tests that limit the maximum clock frequency, even with hyper-pipelined data paths.

We observe a maximum throughput around 786 Gb/s per implemented pipeline for a bus width of 2048 bits. As a result, two pipelines are required to process the I/O bandwidth available on high-end FPGAs.

**Stressing PISA.** To stress the different PISA blocks, we used a bus width of 2048 bits. The implementation results are presented in Figure 3. The dotted line shows the clock period required for a throughput of 600 Gb/s per pipeline.

We observe that T1, T2 and T5 have little impact on the clock period. Thus, increasing the protocol stack and adding more action does not affect throughput. By contrast, in T3 and in T4, the clock period is increased by more than 60%. The performance deterioration for T3 does not relate to the CAM emulation method § III-B, but to the BRAMs distribution over the FPGA fabric. To implement larger memories, several BRAM blocks are combined. Because, BRAMs are organized in columns (§ II-A), large memories are spread into multiple BRAM columns, incurring long wire delays, which decrease the clock period. For T4, the reduced performance also relates to the use of large distributed RAM spanning over the FPGA fabric.

Finally, resource consumption increases between T0 and T1, because the deparser uses more than 80% of all resources. The LUT and FF usage is similar for T1, T2, T3 and T5, which demonstrates the efficiency of actions performed on FPGAs. Compared to T1, more BRAMs are used for T3 and T4 to implement the match tables. T4 uses a high number of LUTs and FF, because of an inefficient TCAM implementation.

## V. FPGAs and In-Network computing

A growing interest in the literature is shown for in-network computing. Our thesis is that FPGAs are a key component to unlock the potential of this paradigm. Indeed, FPGAs can support a very high packet throughput in several cases (§ IV-B). This raises three fundamental questions:



Figure 4: Resource consumption and clock period as function of the bus size for the T0 benchmark.

- 1) Which applications can efficiently be implemented in current FPGA?
- 2) Which programming model to use?
- 3) Is PISA sufficient for in-network computing?

Applications tailored to FPGAs. Ports and Nelson classify in-network computing applications in three categories: the number of operations per packet, the number of states per packet and the packet gain [20]. Based on our experimentations, applications tailored to FPGAs need to perform a high number of operations per packet, have a limited number of states, but avoid match tables.

An example of application meeting these criteria is the reduction operation in a distributed deep neural network (DDNN) training [4]. The DDNN training consists of computing gradients which are subsequently collected by a parameter server (PS). The PS aggregates received values with arithmetical operations, then updates the model and forward it back. Because the PS has a high number of operations per packet, packets are mainly parsed and forwarded, this application could support a higher packet throughput over the previously reported results [4].

**Programming model.** To enforce the abstraction layers, we argue that a network DSL, such as P4, should be used only to describe packet manipulation, while the remaining application processing should be expressed with a general-purpose language. In addition, current high level synthesis tools, synthesizing a general-purpose language for FPGAs, have shown to achieve performance close to hand written RTL code [10].

Abusing PISA externs. On FPGAs, an in-network computing application can consist of PISA connected to an application module. P4 *externs* allow to connect external hardware modules to PISA. Hence a generic in-network computing architecture would integrate standardized interfaces to extern processing modules.

# VI. SPECIALIZING FPGAS

PISA efficiency and performance are currently limited by the existing FPGA architecture. Notably, the match stages and packet scheduler are the main performance bottlenecks.

We propose to specialize the FPGA architecture to better support PISA, while preserving the FPGA's flexibility for applications outside of the networking realm.

**Hard-wired TCAMs.** This would increase resource efficiency and performance of a ternary match stage.

To evaluate the benefits of hard-wired TCAMs, we analyze the number of transistors required for a  $48 \times 128$  TCAM as older Lattice FPGAs had hard-wired TCAMs using this configuration. It is well established that a single-bit TCAM requires 16 transistors.Thus, 98 k transistors are needed for the memory and match circuitry in a  $48 \times 128$  TCAM block. The cost associated to the priority encoder is ignored as it is negligible.

In contrast, a 1-bit SRAM cell requires 6 transistors. Since soft-TCAMs use SRAMs, and have a  $10 \times$  memory overhead (§ III-B), a  $48 \times 128$  soft-TCAM needs 368 k transistors. Hence, ignoring the reconfigurability cost, a hard-wired TCAM could reduce the silicon usage by  $3.8 \times$ .

Determining the number of TCAM primitives to hard-wire on FPGA is left for future work, as it relates to the FPGA thermal budget and the TCAM cells used [2].

More flexible on-chip memory primitives can be as an alternative to hard-wired TCAMs. The transposed memory approach would directly benefit from shallower and wider memories. However, wider memories imply large data buses, which have a direct impact on performance (§IV).

**Hard-wired CAM.** The motivation to hard-wire CAMs is to better support packet schedulers, but *not* for lookup operations. Indeed, the PIFO micro-architecture [18] requires single cycle CAMs, which are poorly emulated on FPGAs [12]. In addition, PIFO uses range-search CAMs, where a lookup key must be enclosed in a range. However, a hard range-search CAM is too specialized to be included as a generic module.

Hence, we propose to hard-wire configurable CAM primitives supporting =, < and > match operations, which brings two benefits. First, multiple CAM primitives can be combined with programmable logic to construct range search CAMs. Second, because multiple match operations are supported, other applications can use it. For instance, soft-CPUs can use the proposed CAM for cache units, out-of-order scheduling and applications using sparse matrices, such as machine learning, to reduce memory footprint. We now evaluate the CAM size needed for the PIFO architecture. The CAM size derives from the number of flows and the rank size supported by a packet scheduler. To support 1024 flows, with a rank size of 16 bits, a range-search CAM of 1024  $\times$  (16  $\times$  2) is required [18]. Sivaraman *et al.* [21] report an overhead of almost 20 $\times$  between a range-search CAM block and its equivalent size SRAM. However, since the analyzed CAMs are built using flip-flops, this overhead is over-evaluated by a factor of 3.In addition, because high-end FPGAs pack hundreds of Mb of SRAM, using hundreds of kb of SRAM for CAMs has a very limited impact.

**Dedicated Network on Chip.** One solution to limit the signal congestion inside the routing fabric is to integrate a Network on Chip (NoC). A NoC can be seen as a grid interconnecting different blocks of the FPGA together. Because a NoC is hardwired, it supports high clock frequency, which in turns allows a high packet throughput as presented by Achronix<sup>2</sup>. Also, a NoC simplify routing complexity by reducing the routing scope.

**Hard-wired wide buses.** We propose to integrate hard-wired wide buses into the routing fabric in order to support a high packet throughput. The idea consists in routing a n-bit bus as a single instance instead of routing individually each bit of a n-bit bus.

The benefits are multiple. First, it would reduce the signal congestion observed with wide buses, which limits the frequency.Second, both the interconnection cost and the configuration memory footprint would be reduced because a single configuration bit would control a n-bit bus. Hard-wired wide buses would also ease the routing process for the FPGA tools.

While many applications using byte based structures would benefit from a hard-wired wide bus, single bit routing level is still required to keep the flexibility of today's FPGAs.

## VII. RELATED WORK

**FPGA acceleration.** FPGAs can drastically improve the CPU usage efficiency, reduce the computation time, and increase the energy efficiency. Caulfield *et al.* have demonstrated the benefits of FPGAs on applications such as machine learning or page ranking [3]. For pure network acceleration, the NetFPGA platform has been introduced. However, it lacked support for programmable data planes. Recently, Ibanez *et al.* demonstrated how to extend the NetFPGA platform to implement programmable data planes using the Xilinx SDNet compiler [6]. Moreover, many works have proposed to use FPGAs as SmartNICs [3], [22]. However, these works connect FPGAs to 40G/50G Ethernet links, which do not stress the FPGA architecture.

**FPGA architecture for networking.** While Bosshart *et al.* highlighted that FPGAs cannot beat a programmable ASIC

<sup>&</sup>lt;sup>2</sup>https://www.achronix.com/product/speedster7t/

because of their inefficient TCAM support and the limited I/O bandwidth provided, the authors did not study the roots of these inefficiencies [2]. Caulfield, Costa, and Ghobadi suggested augmenting the FPGA architecture with hard-wired CAMs, but their motivation lies in supporting efficient lookup operations, which differs from our proposal [23]. The Xilinx's Versal ACAP [24] architecture integrates a NoC used mainly to route data out of the programmable logic, but its performance to route data within the programmable logic is yet unknown.

**Compiling network applications to FPGAs.** Available commercial and open-source P4 compilers compile P4 directly to RTL [5], [6]. Another avenue is to exploit a high-level synthesis tools as an intermediate step [25]. However, none of these works evaluated the shortcomings of the FPGA architecture for networking applications.

## VIII. CONCLUSION

P4 is the language and PISA the architecture that made programmable data plane a reality. FPGAs, in turn, have recently played an important role for servers and network offloading in data centers. Some works have also proposed to implement PISA on FPGAs. However, little effort has been devoted to analyze whether FPGAs can efficiently implement PISA.

In this paper, we have studied how each PISA block is mapped to the existing FPGA archicture. Our analysis, supported by experiments, showed that a few PISA blocks are inefficiently implemented on FPGAs. The root of this inefficiency lies in FPGA architectures. Still, we identified a set of networking applications, and in-networking applications that are excellent matches to current FPGA devices. We also proposed to integrate network specialized hard-wired blocks, which would significantly improve performance of FPGA-based PISA switches without sacrificing the flexibility of FPGAs.

#### REFERENCES

- P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker, "P4: Programming Protocolindependent Packet Processors", *SIGCOMM Comput. Commun. Rev.*, 2014.
- [2] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, "Forwarding Metamorphosis: Fast Programmable Match-action Processing in Hardware for SDN", *SIGCOMM Comput. Commun. Rev.*, 2013.
- [3] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, *et al.*, "A cloud-scale acceleration architecture", in *The 49th Annual IEEE/ACM International Symposium on Microarchitecture*, 2016.

- [4] Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang, "Accelerating distributed reinforcement learning with in-switch computing", in *Proceedings of the 46th International Symposium on Computer Architecture*, 2019.
- [5] H. Wang, R. Soulé, H. T. Dang, K. S. Lee, V. Shrivastav, N. Foster, and H. Weatherspoon, "P4FPGA: A Rapid Prototyping Framework for P4", in *Proceedings of the Symposium on SDN Research*, 2017.
- [6] S. Ibanez, G. Brebner, N. McKeown, and N. Zilberman, "The p4-netfpga workflow for line-rate packet processing", in *Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, ser. FPGA '19, 2019.
- [7] G. Gibb, G. Varghese, M. Horowitz, and N. McKeown, "Design Principles for Packet Parsers", in *Proceedings* of the Ninth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, 2013.
- [8] M. Attig and G. Brebner, "400 Gb/s Programmable Packet Parsing on a Single FPGA", in *Proceedings* of the 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems, 2011.
- [9] P. Benacek, V. Pus, H. Kubatova, and T. Cejka, "P4to-vhdl: Automatic generation of high-speed input and output network blocks", *Microprocessors and Microsystems*, 2018.
- [10] J. Santiago da Silva, F.-R. Boyer, and J. P. Langlois, "P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs", in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018.
- [11] H. Wong, V. Betz, and J. Rose, "Quantifying the gap between fpga and custom cmos to aid microarchitectural design", *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 2013.
- [12] A. M. S. Abdelhadi and G. G. F. Lemieux, "Modular sram-based binary content-addressable memories", in *IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*, 2015.
- [13] A. Kirsch and M. Mitzenmacher, "The power of one move: Hashing schemes for hardware", ACM Transactions on Networking, vol. 18, no. 6, 2010.
- [14] P. Reviriego, A. Ullah, and S. Pontarelli, "PR-TCAM: Efficient TCAM Emulation on Xilinx FPGAs Using Partial Reconfiguration", *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 2019.
- [15] A. M. Abdelhadi, G. G. Lemieux, and L. Shannon, "Modular block-ram-based longest-prefix match ternary content-addressable memories", in 2018 28th International Conference on Field Programmable Logic and Applications (FPL), 2018.
- [16] H. Le and V. K. Prasanna, "Scalable tree-based architectures for IPv4/v6 lookup using prefix partitioning", *IEEE Transactions on Computers*, 2011.

- [17] G. Retvari, J. Tapolcai, A. Korosi, A. Majdan, and Z. Heszberger, "Compressing IP Forwarding Tables: Towards Entropy Bounds and Beyond", *IEEE/ACM Transactions on Networking*, 2016.
- [18] A. Sivaraman, S. Subramanian, M. Alizadeh, S. Chole, S.-T. Chuang, A. Agrawal, H. Balakrishnan, T. Edsall, S. Katti, and N. McKeown, "Programmable Packet Scheduling at Line Rate", in *Proceedings of the 2016* ACM SIGCOMM Conference, 2016.
- [19] I. Benacer, F. Boyer, and Y. Savaria, "A fast, singleinstructionmultiple-data, scalable priority queue", *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 2018.
- [20] D. R. K. Ports and J. Nelson, "When should the network be the computer?", in *Proceedings of the Workshop on Hot Topics in Operating Systems*, ser. HotOS '19, New York, NY, USA: ACM, 2019, pp. 209–215.
- [21] A. Sivaraman, S. Subramanian, A. Agrawal, S. Chole, S.-T. Chuang, T. Edsall, M. Alizadeh, S. Katti, N. McKeown, and H. Balakrishnan, "Towards programmable packet scheduling", in *Proceedings of the 14th ACM Workshop on Hot Topics in Networks*, ser. HotNets-XIV, New York, NY, USA: ACM, 2015.
- [22] D. Firestone, A. Putnam, S. Mundkur, D. Chiou, A. Dabagh, M. Andrewartha, H. Angepat, V. Bhanu, A.

Caulfield, E. Chung, H. K. Chandrappa, S. Chaturmohta, M. Humphrey, J. Lavier, N. Lam, F. Liu, K. Ovtcharov, J. Padhye, G. Popuri, S. Raindel, T. Sapre, M. Shaw, G. Silva, M. Sivakumar, N. Srivastava, A. Verma, Q. Zuhair, D. Bansal, D. Burger, K. Vaid, D. A. Maltz, and A. Greenberg, "Azure accelerated networking: Smartnics in the public cloud", in *15th* USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), 2018, pp. 51–66.

- [23] A. Caulfield, P. Costa, and M. Ghobadi, "Beyond smartnics: Towards a fully programmable cloud", 2018.
  [Online]. Available: https://www.microsoft.com/enus/research/uploads/prod/2018/05/beyond\_smart\_nics. pdf.
- [24] I. Swarbrick, D. Gaitonde, S. Ahmad, B. Gaide, and Y. Arbel, "Network-on-chip programmable platform in versal tm acap architecture", in *Proceedings of the* 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM, 2019, pp. 212–221.
- [25] N. Sultana, S. Galea, D. Greaves, M. Wójcik, J. Shipton, R. Clegg, L. Mai, P. Bressana, R. Soulé, R. Mortier, *et al.*, "Emu: Rapid prototyping of networking services", in 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017, pp. 459–471.