# LEAPS: Topological-<u>L</u>ayout-Adaptable Multi-Die FPG<u>A</u> Placement for <u>Super Long Line Minimization</u>

Zhixiong Di, Member, IEEE, Runzhe Tao, Jing Mai, Lin Chen, Yibo Lin, Member, IEEE

Abstract—Multi-die FPGAs are crucial components in modern computing systems, particularly for high-performance applications such as artificial intelligence and data centers. Super long lines (SLLs) provide interconnections between super logic regions (SLRs) for a multi-die FPGA on a silicon interposer. They have significantly higher delay compared to regular interconnects, which need to be minimized. With the increase in design complexity, the growth of SLLs gives rise to challenges in timing and power closure. Existing placement algorithms focus on optimizing the number of SLLs but often face limitations due to specific topologies of SLRs. Furthermore, they fall short of achieving continuous optimization of SLLs throughout the entire placement process. This highlights the necessity for more advanced and adaptable solutions.

In this paper, we propose LEAPS, a comprehensive, systematic, and adaptable multi-die FPGA placement algorithm for SLL minimization. Our contributions are threefold: 1) proposing a high-performance global placement algorithm for multi-die FPGAs that optimizes the number of SLLs while addressing other essential design constraints such as wirelength, routability, and clock routing; 2) introducing a versatile method for more complex SLR topologies of multi-die FPGAs, surpassing the limitations of existing approaches; and 3) executing continuous optimization of SLL counts across the whole placement stages, including global placement (GP), legalization (LG), and detailed placement (DP). Experimental results demonstrate the effectiveness of LEAPS in reducing SLLs and enhancing circuit performance. Compared with the most recent state-of-the-art (SOTA) method, LEAPS achieves an average reduction of 43.08% in SLL counts and 9.99% in HPWL while exhibiting a notable  $34.34 \times$  improvement in runtime.

*Index Terms*—Multi-die FPGA, super long line (SLL), placement, nonlinear optimization, GPU acceleration

#### I. INTRODUCTION

**M** ULTI-DIE FPGAs are essential for modern computing systems, especially for high-performance applications such as artificial intelligence and data centers. A multi-die FPGA comprises several SLRs on a silicon interposer, interconnected by SLLs that facilitate communication between these regions, as depicted in Fig. 1(a). In the multi-die FPGA

This work was supported by National Natural Science Foundation of China (62374138, 62034007). Corresponding author: Zhixiong Di (dizhixiong2@126.com).

Zhixiong Di, Runzhe Tao, and Lin Chen are with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. (email:dizhixiong2@126.com, 825140517@qq.com, mix\_lc@qq.com).

Jing Mai is with the School of Computer Science and the School of Integrated Circuits, Peking University, Beijing, China. (email: jing-mai@pku.edu.cn)

Yibo Lin is with the School of Integrated Circuits, Peking University, Beijing, China, Institute of Electronic Design Automation, Peking University, Wuxi, China, and Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China. (email: yibolin@pku.edu.cn)



Fig. 1: (a) Architectural illustration of Xilinx multi-die FPGA Alveo U250: Demonstrating a  $1 \times 4$  SLR topology with central I/O banks and DDR controller IPs, and a right-side Vitis platform for CPU communication. (b) Detailed view of SLR architecture: Partitioned into  $2 \times 3$  clock regions and further segmented into multiple half columns. (c) Schematic of a CLB slice: Distinguishing between SLICEL and SLICEM types to highlight asymmetric compatibility.

design flow, cells within each SLR are interconnected by routing resources (i.e. regular interconnects), while SLLs enable interconnections between SLRs. Nevertheless, it is essential to emphasize that SLLs exhibit substantially greater latency in comparison to regular interconnects, consequently severely impacting timing performance. As design complexity grows, the number of SLLs multiplies, which leads to performance degradation and power increase. Therefore, minimizing the SLL counts is a crucial and challenging task in multi-die FPGA placement.

Existing works [1]–[5], [24], [25] have endeavored to optimize the number of SLLs during partitioning before placement. In [1] and [2], they solve the SLL issue by employing distinct optimization techniques for solving the pin assignment problem. Specifically, the former utilizes integer linear programming (ILP), while the latter combines a cluster approach with minimum cost flow (MCF) optimization. Moreover, modular placement approaches have been investigated in [3] and [4], where optimal SLL resource utilization is achieved by mapping partitioned modules to appropriate dies. However, these approaches cannot simultaneously consider

various physical constraints like clock routing, as they are applied at a separate partitioning stage before placement. A recent state-of-the-art (SOTA) approach [5] proposes an analytical placement method for multi-die FPGAs, which optimizes both the number of SLLs and critical clock routing constraints based on a 3D Poisson density formulation with proximal alternating direction method of multipliers (ADMM) as the solver. However, this method is only applicable to a specific multi-die FPGA architecture (e.g., four dies arranged vertically on an interposer) and cannot accommodate more complex topologies.

Additionally, these established placement methods typically concentrate on SLL optimization during the global placement (GP) stage, often neglecting the necessity to tackle the SLL issue in the subsequent legalization (LG) and detailed placement (DP) stages. This oversight may inadvertently degrade circuit quality. Similar to optimizing other placement metrics like wirelength and routability [7]–[10], [22], SLL minimization should be an ongoing and holistic process throughout the entire placement. Further, during the LG and DP, the movement of placeable instances is critical to the potential impact of placement metrics [11]–[15]. This movement may increase the number of SLLs, necessitating positive measures to maintain circuit performance.

Accordingly, our proposed method aims to address three major challenges: 1) simultaneous optimization of various design constraints and objectives, such as wirelength, routability, and clock routing; 2) capability of adapting to more complex topologies; and 3) holistic optimization of the number of SLLs throughout the placement stage. To tackle the above challenges, this paper presents LEAPS, a comprehensive, systematic, and adaptable multi-die FPGA placement algorithm for SLL minimization. Our contribution can be summarized as follows.

- We propose a high-performance nested optimization hierarchy for global placement of multi-die FPGA, which aims to reduce wirelength and the number of SLLs, and meanwhile satisfy routability and clock routing constraints.
- We introduce an adaptive wirelength-weighting-factor adjusting technique, primarily aimed at balancing the trade-offs between HPWL and SLL counts. This approach is pivotal in achieving a more finely-tuned and optimized placement solution, addressing the wirelength handling challenges in multi-die FPGA design.
- We design a flexible method to adapt multi-die FPGAs with arbitrary SLR topologies. It converts SLR indexes into vector representations of instances' coordinates and uses a soft floor technique, thus enabling a seamless transition from global to local optimization.
- We propose a simple but effective optimization technique for SLL minimization at the LG and DP stages.

In summary, this paper presents a novel approach to address the challenges in multi-die FPGA placement, specifically for SLL minimization, while maintaining a focus on other essential design constraints. Our proposed LEAPS framework demonstrates adaptability to complex topologies and ensures continuous optimization throughout the entire placement process. The experimental results show that our method greatly outperforms the SOTA algorithm [5]. It achieves significant reductions of 43.08% and 9.99% in SLL and HPWL, respectively, while exhibiting a substantial improvement of  $34.34 \times$  in runtime.

The rest of the paper is structured as follows. Section II provides essential background information and formulates the problem of multi-die FPGA placement addressed by the proposed framework. Section III presents a comprehensive overview of the LEAPS framework, highlighting its key features and technical innovations. Section IV delves into the technical details of the core placement algorithms employed in LEAPS. Section V presents the experimental results, which validate the efficacy and superiority of our approach. Finally, Section VI concludes the paper by summarizing the key contributions and highlighting avenues for future research.

## **II. PRELIMINARIES**

In this section, we provide the background and concepts related to the multi-die FPGA placement problem addressed in this paper. First, we introduce the multi-die FPGA architecture and its various topologies, as well as the calculation of the number of SLLs and clocking constraints within SLRs. Then, we also discuss the multi-electrostatic approach used to optimize the placement, emphasizing the advancement of the underlying methods on which our framework depends. Finally, we formally state the problem of multi-die FPGA placement, highlighting the key objectives and constraints to be considered in the proposed placement algorithm.

# A. Multi-Die FPGA Architecture

The multi-die FPGA architecture utilizes stacking technology to interconnect multiple FPGA cores, known as SLRs, via SLLs on an interposer, as depicted in Fig. 1(a). It is worth noting that SLLs have significantly higher delay compared to regular interconnects, which can greatly impact the design's timing performance and circuit quality. Fig. 1(b) illustrates that each SLR contains multiple distinct clock regions. This arrangement facilitates more flexible and efficient clock signal management and routing.

Additionally, each SLR comprises millions of logic gates, including heterogeneous blocks such as look-up tables (LUTs), flip-flops (FFs), digital signal processors (DSPs), random access memories (RAMs), and other intellectual property (IP) blocks. LUTs and FFs are ultimately clustered in configurable logic blocks (CLBs) for placement. Fig. 1(c) illustrates the representation of CLBs, which are classified into two types: SLICEL and SLICEM, showing asymmetric compatibility. SLICEL allows LUT blocks to be configured as LUTs, while SLICEM can be configured in one of the following modes: LUT, distributed RAMs, and SHIFTs within a CLB.

An important aspect of multi-die FPGAs is the arrangement of SLRs, which we refer to as SLR topology, with examples including  $1 \times 4$  (shown as Fig. 1(a)) and  $2 \times 2$  (shown as Fig. 2) configurations. An  $m \times n$  SLR topology implies an arrangement of m column and n rows of SLRs. We present two industrial examples, the Xilinx Alveo U250 FPGA and the Xilinx Alveo U280 FPGA, to illustrate different SLR topologies and some basic configurations.

- The Xilinx Alveo U250 FPGA features a  $1 \times 4$  SLR topology, with I/O banks and DDR controller IPs located in the middle column, and a Vitis platform region on the right side for communication with the host CPU.
- The Xilinx Alveo U280 FPGA, which integrates High-Bandwidth Memory (HBM), has a 2 × 2 SLR topology, with I/O banks in the middle columns and a gap region devoid of programmable logic in the center of the chip.

Two representative multi-die FPGA architectures are presented above, highlighting the diversity of SLR topologies. This diversity emphasizes the need for a placement algorithm that can handle different topologies and constraints while ensuring flexibility and efficiency. However, due to the limitations in academic datasets, our framework design and testing focus on the *Xilinx UltraScale* architecture.

#### B. SLL Calculation Method

Calculating the number of SLLs is crucial for the effective placement optimization of multi-die FPGAs. The number of SLLs is determined by the number of times a net has to cross between different SLRs.

Given a hypergraph-based placement result, we define the set of placeable instances as  $V = \{v_1, v_2, ..., v_n\}$ , and the net as  $E = \{e_1, e_2, ..., e_n\}$ . For the multi-die FPGA placement problem, we define the coordinates of instance  $v_i$ as  $(x_i, y_i, z_i)$ , where  $x_i$  and  $y_i$  represent the physical location of the instance on the layout, and  $z_i$  denotes the index of the SLR in which the instance is located. With the above definitions, we can calculate the total number of SLLs as follows:

$$S_{SLL} = \sum_{e \in E} f(\{z_i | v_i \in e\}, \mathcal{T}_{SLR}).$$
(1)

Here,  $\{z_i | v_i \in e\}$  denotes the SLR index set of the instances associated with net n, and  $\mathcal{T}_{SLR}$  denotes the SLR topology. The function  $f(\cdot)$  denotes the mapping between the specified index set and the SLR topology. For the  $1 \times 4$  SLR topology in the previous discussion, this mapping can be found in the existing work [5]. For complex SLR topologies like  $2 \times 2$  or  $3 \times 3$ , we use a minimum spanning tree (MST) for mapping, which is efficient due to the typically small size of these SLR topologies (with rows and columns less than 5). A mapping table in our function further improves computational speed. Notably, while our method can adapt to any SLR topologies, choosing the best one involves balancing the benefits of multidie architectures against practical constraints. These include clock region division, timing closure, and manufacturing factors related to cost and limitations.

To determine the SLR index  $z_i$  for each instance, we use the Manhattan metric considering both  $x_i$  and  $y_i$ . We define the distance thresholds  $\delta_x$  and  $\delta_y$  as the width and height of each SLR, respectively. The width and height of an SLR are computed by dividing the total width and height of the



Fig. 2: Schematic example of a multi-die FPGA featuring a  $2 \times 2$  SLR topology with an illustrative SLL calculation for a 3-pin net *n*.

FPGA's layout by the number of columns and rows in the SLR topology. In our method, the reference point  $(x_{ref}, y_{ref})$  is set to the bottom-left corner of the layout (0, 0). With these definitions, the SLR index  $z_i$  can be calculated as follows:

$$\boldsymbol{z}_{i} = \lfloor \frac{|x_{i} - x_{ref}|}{\delta_{x}} \rfloor \cdot \hat{\mathbf{x}} + \lfloor \frac{|y_{i} - y_{ref}|}{\delta_{y}} \rfloor \cdot \hat{\mathbf{y}}, \qquad (2)$$

where  $\hat{\mathbf{x}}$  and  $\hat{\mathbf{y}}$  represent the unit vectors in the x and y dimensions, respectively. Note that the SLR index  $\mathbf{z}_i$  is represented as a two-dimensional vector. This vector representation will be consistently used in subsequent sections for clarity and uniformity.

The above method for computing SLLs is an improved version of the method in [5]. It's proposed to accommodate SLR topologies with multiple rows and columns, such as  $2 \times 2$  SLR topology. The motivation behind this improved method is the indispensable role that precise quantization of the number of SLLs for efficient optimization. This precise quantification allows the placer to adjust the arrangement of logic instances, thus facilitating the minimization of the total number of SLLs.

#### C. Clocking Constraints in SLRs

Clocking constraints are crucial for both performance optimization and timing closure in a multi-die FPGA design flow. The target device has rectangular-shaped clock regions (CRs) arranged in a  $5 \times 8$  grid, each consisting of columns of site resources. The CRs can be further subdivided horizontally into upper and lower half-columns (HCs), with a maximum of 12 clock nets per HC and a maximum of 24 clock nets per CR. These constraints are referred to as the half-column constraint and the clock region constraint, respectively.

To mathematically model these constraints, we first define  $C_k^z$  as the set of blocks connected to clock k on SLR z. We

then establish the x(y) coordinates of the right, left (top, bottom) boundaries of clock region o on SLR z, denoted by  $r_o^z$ ,  $l_o^z(u_o^z, d_o^z)$ , respectively. Accordingly, we can calculate the horizontal and vertical clocking resource usages for clock k in clock region o on SLR z as:

$$H(k, o, z) = \min \{ \max \{ x_i \mid i \in C_k^z \}, r_o^z \} - \max \{ \min \{ x_i \mid i \in C_k^z \}, l_o^z \},$$

$$V(k, o, z) = \min \{ \max \{ y_i \mid i \in C_k^z \}, u_o^z \} - \max \{ \min \{ y_i \mid i \in C_k^z \}, d_o^z \}.$$
(3)

The total clock usage P(k, o, z) for clock k in clock region o on SLR z can be computed as:

$$P(k, o, z) = \begin{cases} 1, & \text{if } H(k, o, z) > 0 \text{ and } V(k, o, z) > 0; \\ 0, & \text{otherwise.} \end{cases}$$
(4)

By assuming that clock region o on SLR z is covered by at most  $M_{o,z}$  clock net bounding boxes, we can define the clocking constraints of a multi-die FPGA as:

$$\sum_{k} P(k, o, z) \le M_{o, z}, \quad \forall o, z.$$
(5)

With the above definitions, we effectively model the clocking constraints in multi-die FPGA architectures. This model allows a more precise depiction of the clocking resources and their existing constraints within the FPGA device. As a result, it will facilitate the development process of more advanced multi-die FPGA placement algorithms, making the overall efficiency higher.

#### D. Multi-Electrostatic FPGA Placement

State-of-the-art placement algorithms [16]–[20], grounded in electrostatics, conceptualize each instance as a positive charge within an electrostatic system. This approach was originally introduced in ASIC placement to mitigate density overflow problem in the placement, and leverages the physical principle of balanced charge distribution leading to low potential energy in electrostatic systems. [16] expanded this method to include multiple electrostatic fields, thereby facilitating the management of diverse resource types in FPGA placement, such as LUTs, FFs, DSPs, and BRAMs. Building upon these advancements, recent work [17], [18] has further refined the multi-electrostatic approach by incorporating considerations of SLICEL-SLICEM heterogeneity and multiple constraints, including timing, clock routing, and carry chain alignment. This innovative algorithm takes the quality and efficiency of FPGA placement a significant step forward, surpassing its predecessors. It seeks to minimize the total potential energy of multiple fields, effectively reducing density overflow. Given this capacity for adept resource distribution management across multiple dies and their respective clocking domains, the multi-electrostatic approach is especially well-suited to multidie FPGA placement. The primary objective of this approach is to optimize placement by achieving a balanced resource distribution in the layout. This problem can be mathematically formulated as follows:

$$\min_{\boldsymbol{x},\boldsymbol{y}} \widetilde{W}(\boldsymbol{x},\boldsymbol{y}) \quad \text{ s.t. } \Phi_s(\boldsymbol{x},\boldsymbol{y}) = 0, \forall s \in S$$
 (6)

Here, x, y represent instances' location,  $\widehat{W}(\cdot)$  denotes the wirelength objective, S is the field type set, and  $\Phi_s(\cdot)$  signifies the electric potential energy of the field for field type  $s \in S$ . We formally constrain the target energy  $\Phi_s(x, y)$  to 0, as the energy is typically nonnegative. The constraints can be relaxed to the objective and solved using the gradient descent method. In practice, optimization is ceased when the energy reaches a sufficiently low level, or equivalently when the density overflow reaches an acceptable threshold.

#### E. Problem Statement for Multi-Die FPGA Placement

In this paper, we aim to solve the problem of multi-die FPGA placement. The optimization objective is to minimize the half-perimeter wirelength (HPWL) and the number of SLLs crossing SLRs while satisfying various architectural constraints. Formally, the problem can be formulated as follows:

**Problem 1** (Multi-Die FPGA Placement). Given a circuit netlist  $\mathcal{N}$ , a placement region  $\mathcal{R}$ , and architecture constraints  $\mathcal{C}$ , determine the optimal legal position  $(x_i, y_i)$  of each logic block i on a SLR to minimize the HPWL  $W_{\mathcal{H}}$  and the number of SLLs  $W_S$  with a weighting factor  $\psi$ , such that all architecture constraints in  $\mathcal{C}$  are satisfied. Mathematically, the problem can be formulated as:

$$\min_{\boldsymbol{x},\boldsymbol{y}} W_{\mathcal{H}}(\boldsymbol{x},\boldsymbol{y}) + \psi W_{\mathcal{S}}(\boldsymbol{x},\boldsymbol{y})$$
s.t.  $\boldsymbol{x}, \boldsymbol{y} \in \mathcal{R},$ 

$$\nexists i, j \in \mathcal{N} \text{ with } i \neq j \text{ such that } \mathcal{O}verlap(i, j) > 0,$$
all architecture constraints in  $\mathcal{C}$  are satisfied. (7)

#### **III. LEAPS OVERVIEW**

Building upon the foundational contributions of [17], [18], we introduce the proposed LEAPS for multi-die FPGA placement, consisting of three main stages: global placement, legalization, and detailed placement. A summary of the core techniques employed in the proposed LEAPS is provided in Fig. 3. Furthermore, to ensure a coherent explanation of key terminologies used throughout this paper, a detailed glossary is presented in TABLE I.

The proposed LEAPS considers a set of field types denoted as  $S = \{LUTL, LUTM-AL, FF, CARRY, DSP, BRAM\}$ . Notably, the LUTL and LUTM-AL field types were introduced in [17]. The LUTL field type represents the LUT resources provided by both SLICEM and SLICEL, while the LUTM-AL field type models the additional logic resources offered solely by SLICEM and not by SLICEL.

## A. Global Placement

Global placement acts as the backbone of the entire placement algorithm, harmonizing multiple design objectives while satisfying complex constraints.

| Term                            | Definition                                                      | Motivations                                                     |  |  |  |
|---------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|--|--|--|
| Augmented Lagrangian Method     | Transforms constrained problems into unconstrained ones by      | Simplifies Eq. (8)'s complex constrained problem into Eq. (9)'s |  |  |  |
| (ALM)                           | adding equality constraints and a quadratic penalty term.       | unconstrained problem.                                          |  |  |  |
| Clock Penalty Multiplier n      | Dynamically penalizes clock routing violations in the optimiza- | Ensures clock routing constraints are considered with other     |  |  |  |
| Clock I charty Whitepher $\eta$ | tion process.                                                   | design objectives.                                              |  |  |  |
| Density Penalty D               | Penalizes denser areas in the FPGA placement to even out logic  | Prevents hotspots and ensures routability in EPGA placement     |  |  |  |
| Density Tenanty D <sub>g</sub>  | element distribution.                                           | revents hotspots and ensures foundarity in 11 Ory placement.    |  |  |  |
| Wirelength Weighting Factor W   | Balances the minimization of HPWL and the reduction of SLL      | Managas HDWI and SLL trade offs in multi-dia EPGA design        |  |  |  |
| whereight weighting actor \$    | counts.                                                         | Wanages III wE and SEE trade-ons in multi-die II GA design.     |  |  |  |
| Instance-to-Clock-Region Map-   | Assigns instances to clock regions to minimize SULs             | Optimizes clock network efficiency and EPGA performance         |  |  |  |
| ping Generation                 | risigns mistances to clock regions to minimize SELS.            | opunitizes clock network encicledy and 11 GA performance.       |  |  |  |

TABLE I: Glossary Table of Key Terminology in the proposed LEAPS



Fig. 3: Core Techniques in the proposed LEAPS: (1) Nested Optimization Hierarchy: Enhances [17], [18] for multiobjective optimization, focusing on SLL minimization. See Section III-A3. (2) Soft Floor Method: Transforms discrete SLR coordinates into continuous models, optimizing wirelength and SLR constraints. Refer to Section IV-A3. (3) Wirelength-weighting Optimization: Dynamically adjusts HPWL and SLL trade-offs for improved FPGA placement. Details in Section IV-C. (4) SLL-aware Legalization: Adapts [21] to prioritize SLL reduction with concurrent clock constraint management. Further information in Section IV-E. (5) SLL-aware Detailed Placement: Builds on [8], focusing on SLL minimization, clock-awareness, and wirelength optimization. See Section IV-E.

1) Problem Definition: Considering wirelength minimization objective, clock constraints, and carry chain alignment feasibility, we present the multi-die global placement problem as a constrained minimization problem:

 $\min_{\boldsymbol{x},\boldsymbol{y}} \quad \overline{W}_{\psi}(\boldsymbol{x},\boldsymbol{y}), \tag{8a}$ 

s.t. 
$$\Phi_s(\boldsymbol{x}, \boldsymbol{y}; \mathcal{A}^s) = 0, \quad \forall s \in S,$$
 (8b)

$$\Gamma(\boldsymbol{x}, \boldsymbol{y}) = 0, \tag{8c}$$

Here,  $\widetilde{W}_{\psi}(\boldsymbol{x}, \boldsymbol{y})$  denotes the total wirelength, which accounts for both the HPWL and SLL counts;  $\mathcal{A}^s$  represents all the instance areas in field s; and  $\Gamma(\cdot)$  signifies the clock penalty term. For brevity, we simplify  $\Phi_s(\boldsymbol{x}, \boldsymbol{y}; \mathcal{A})$  to  $\Phi_s$  for all  $s \in$  S. The potential energy vector, with elements  $\Phi_s \ (\forall s \in S)$ , is denoted by  $\Phi$  in subsequent discussions.

2) Problem Reformulation with ALM: To facilitate solving the original problem (8), we employ the *augmented Lagrangian method (ALM)* [27] to formulate an unconstrained subproblem:

$$\begin{split} \min_{\boldsymbol{x},\boldsymbol{y}} \quad \mathcal{L}(\boldsymbol{x},\boldsymbol{y};\boldsymbol{\lambda},\psi,\boldsymbol{\mathcal{A}},\eta) &= \widetilde{W}_{\psi}(\boldsymbol{x},\boldsymbol{y}) + \sum_{s \in S} \lambda_{s} \mathcal{D}_{s} \\ &+ \eta \Gamma(\boldsymbol{x},\boldsymbol{y}), \quad (9a) \\ \mathcal{D}_{s} &= \Phi_{s} + \frac{1}{2} \mathcal{W}_{s} \Phi_{s}^{2}, \quad \forall s \in S. \end{split}$$

$$\end{split}$$

$$\end{split}$$

$$\end{split}$$

$$\begin{split} (9b)$$

Here, the density multiplier vector is  $\lambda \in \mathbb{R}^{|S|}$ , and the clock penalty multiplier is  $\eta \in \mathbb{R}$ . The density-weighting coefficient vector  $\mathcal{W} \in \mathbb{R}^{|S|}$  is employed to balance the first-order and second-order terms for density penalty. We adopt the setup for  $\lambda$  and  $\mathcal{W}$  from [16].

3) Nested Optimization Hierarchy: To handle multiple constraints, we solve the problem (9) using the ALM in a nested manner:

Clock Opt.: 
$$\mathcal{L}_1 = \max_{\eta} \mathcal{L}_2(\eta),$$
 (10a)

Routability Opt.: 
$$\mathcal{L}_2(\eta) = \max_{\mathbf{A}} \mathcal{L}_3(\mathbf{A}, \eta),$$
 (10b)

WLW Opt.: 
$$\mathcal{L}_3(\mathcal{A}, \eta) = \max_{\psi} \mathcal{L}_4(\psi, \mathcal{A}, \eta)$$
 (10c)

WL Opt.: 
$$\mathcal{L}_4(\psi, \mathcal{A}, \eta) = \max_{\boldsymbol{\lambda}} \mathcal{L}_5(\boldsymbol{\lambda}, \psi, \mathcal{A}, \eta), \quad (10d)$$

Subproblem:  $\mathcal{L}_5(\boldsymbol{\lambda}, \psi, \mathcal{A}, \eta) = \min_{\boldsymbol{x}, \boldsymbol{y}} \mathcal{L}(\boldsymbol{x}, \boldsymbol{y}; \boldsymbol{\lambda}, \psi, \mathcal{A}, \eta).$ 

Here,  $\mathcal{L}_5$  denotes Eq. (9); "Opt." stands for "Optimization", while "WL" and "WLW" are abbreviations for "Wirelength" and "Wirelength-weighting", respectively.

Within this nested structure, each term, ranging from  $\mathcal{L}_1$  to  $\mathcal{L}_5$ , addresses a unique aspect of the placement challenge. Each term passes its variables to the subsequent subproblem, considering them as fixed hyperparameters. This systematic approach is vividly portrayed in the global placement phase as shown in Fig. 4. We highlight two vital aspects:

a) Effective Range for SLL Optimization: The SLL minimization is integrated into the optimization objective only when the density overflow lies between 0.15 and 0.9. Within these bounds,  $\mathcal{L}_4$  becomes operative. When the density overflow drops below 0.15, the algorithm adjusts the instance area to mitigate routing congestion.



Fig. 4: Overview of the proposed LEAPS framework. The framework continuously optimizes the number of SLLs while handling other design objectives during the global placement, legalization, and detailed placement stages. The global placement employs a nested optimization technique to progressively converge and optimize each design objective. Subsequent legalization and detailed placement consider SLL minimization and clock routing constraints while refining the initial placement. Note that "Instance Area Adjustment" and "Carry Chain Alignment Correction" are referenced in [16] and [17], respectively, and are not repeated in this paper. The rest of the contents are described in this work.

b) Distinct Roles of Each Optimizer: To illustrate the overall approach, we consider the example of optimizer  $\mathcal{L}_5$ , particularly in scenarios where density overflow exceeds 0.9. Upon reaching its optimal state,  $\mathcal{L}_5$  maintains the parameters  $\lambda$ ,  $\psi$ ,  $\mathcal{A}$ , and  $\eta$  as fixed values. Subsequently, the  $\mathcal{L}_4$  optimizer amplifies the density term by incrementing  $\lambda$ , progressively satisfying the density constraints. This approach is similarly employed for other optimizers. Each optimizer has a specific role:

- $\mathcal{L}_1$  analytically mitigates clock violations, with its termination criteria based on the fulfillment of clock constraints.
- $\mathcal{L}_2$  enhances routability via an area inflation-based technique, with its termination criteria determined by routing congestion estimation and pin density.
- $\mathcal{L}_3$  accounts for the growth of SLLs during the iterative process to balance the trade-off between the HPWL and SLL counts in the total wirelength objective.
- $\mathcal{L}_4$  tackles the core wirelength-driven placement problem, with its termination criteria determined by the density overflow of all instances.

In the end,  $\mathcal{L}_5$  is consistently solved with a pre-defined number of iterations, such as one iteration in our experiments.

#### B. Legalization & Detailed Placement

Legalization (LG) and detailed placement (DP) serve as the refining stages of the placement, fine-tuning the initial solution and ensuring compliance with design constraints and objectives. Given the multi-die FPGA architecture, both LG and DP need to achieve the aim of minimizing SLLs under various design constraints. To achieve this, we adapt our LG and DP to specifically target the reduction of SLLs while maintaining other design objectives.

Our LG approach, inspired by the Direct Legalize (DL) algorithm [21], employs a binary optimization strategy to map instances to specific clock regions. This method has been adapted and expanded from its original single-die application to multi-die FPGAs, incorporating a refined cost function that includes SLL optimization. This enhancement makes it better suited to address the unique challenges presented in multi-die scenarios, going beyond the traditional DL method by considering factors like HPWL, SLL counts, and packing metrics. This approach not only ensures a pronounced reduction in the number of SLLs but also maintains essential clock feasibility constraints.

Complementing this, our DP method builds on the multistage independent set matching (ISM) technique as presented in [8], [11]. It emphasizes the reduction of the number of SLLs, similar to the legalization process, and employs clockawareness to the ISM method, thereby refining placement for improved wirelength and routability. This refinement ensures the viability of the clock network and simultaneously optimizes SLL counts, a crucial aspect frequently neglected in conventional single-die FPGA placement strategies.

These approaches, building upon the foundational frameworks of [8], [11], [21], introduce novel enhancements specifically designed for the complexities of multi-die architectures. These advancements are essential to meet the evolving demands of modern FPGA design.

# C. Comparative Features of LEAPS and Other Placers

TABLE II summarizes the characteristics of the published SOTA FPGA placers. They mainly resort to quadratic programming-based approaches [8], [9], [11]–[13], [23] and nonlinear optimization-based approaches [5], [16], [18], [20] for the best trade-off between quality and efficiency. To

TABLE II: Features of the published state-of-the-art FPGA placers

| Placer             | UTPlaceF [8] | GPlace 3.0<br>[9] | elfPlace<br>[ <mark>16</mark> ] | DREAMPlaceFPGA [19] | AMF-Placer [23] | UTPlaceF<br>2.0&2.X [11], [13] | ICCAD'17<br>[12] | ICCAD'19<br>[5] | OpenPARF [18] | Ours      |
|--------------------|--------------|-------------------|---------------------------------|---------------------|-----------------|--------------------------------|------------------|-----------------|---------------|-----------|
| Clock Constraints  | ×            | ×                 | ×                               | ×                   | ×               | $\checkmark$                   | $\checkmark$     | $\checkmark$    | $\checkmark$  | ~         |
| Multi-die Support  | ×            | ×                 | ×                               | ×                   | ×               | ×                              | ×                | $\checkmark$    | ×             | ~         |
| GPU-Acceleration   | ×            | ×                 | √                               | √                   | ×               | ×                              | ×                | ×               | √             | √         |
| Algorithm Category | Quadratic    | Quadratic         | Nonlinear                       | Nonlinear           | Quadratic       | Quadratic                      | Quadratic        | Nonlinear       | Nonlinear     | Nonlinear |

provide a comprehensive assessment, we categorize our evaluation based on three pivotal features:

- Handling Clock Constraints: An important issue in FPGA placement is to effectively address clock constraints to optimize performance. Among these SOTA placers, UTPlaceF 2.0&2.X [11], [13], ICCAD' 17 [12], ICCAD' 19 [5], OpenPARF [18], and the proposed LEAPS exhibit proficiency in this domain.
- Supporting Multi-Die Architecture: The capability to facilitate designs across multiple dies is only present in ICCAD' 19 [5] and the proposed LEAPS. This gives them an advantage in modern FPGA designs, where multi-die configurations are sought for enhanced performance and modularity.
- Leveraging GPU acceleration: Speeding up the placement process is crucial in FPGA design. Among the placers, only elfPlace [16], DREAMPlaceFPGA [20], OpenPARF [18], and the proposed LEAPS capitalize on GPU acceleration, making them ideal for rapid design iterations.

Conclusively, LEAPS demonstrates a robust feature set aligning with modern FPGA design demands, including clock constraints, multi-die support, and GPU acceleration, making it stand out in the FPGA placement landscape.

## **IV. CORE PLACEMENT ALGORITHMS**

In this section, we will explicate the core algorithms of the proposed framework.

#### A. Wirelength Objective Handling

The wirelength is the most fundamental objective in placement algorithms. In traditional FPGA placement algorithms, wirelength is typically measured in the x and y dimensions. However, considering the multi-die FPGA architectures in this work, it is necessary to minimize the number of SLLs by incorporating the z dimension, which represents the SLR index of instances dominated by x and y.

1) Wirelength Objective Formulation: The wirelength objective is formulated as:

$$W_{\psi}(\boldsymbol{x}, \boldsymbol{y}) = W_{\mathcal{H}}(\boldsymbol{x}, \boldsymbol{y}) + \psi \cdot W_{\mathcal{S}}(\boldsymbol{x}, \boldsymbol{y})$$
  
$$= \sum_{e \in E} \left( \max_{i, j \in e} |x_i - x_j| + \max_{i, j \in e} |y_i - y_j| + \psi \max_{i, j \in e} |z_i - z_j| \right)$$
(11)

Here, x and y denote the locations of the instances in a layout, while z represents the SLR index of instances dominated by x and y. The term  $\|\cdot\|_1$  denotes the L1 norm, applicable here as the SLR indexes  $z_i$  and  $z_j$  are two-dimensional vectors. The weighting factor  $\psi$  adjusts the weighting of the SLL term  $W_{\mathcal{S}}(\cdot)$  in the wirelength objective function  $W_{\psi}(\cdot)$ . 2) Smooth and Differentiable Wirelength Model: To enable the utilization of gradient-based optimization methods, we adopt a smooth and differentiable wirelength model using the *weighted-average (WA) approach* [26] for the max term. Specifically, the wirelength model for the z-dimension is defined as:

$$\widetilde{W}_{e_{z}}(z) = \frac{\sum_{i \in e} \|\boldsymbol{z}_{i}\|_{1} \exp(\|\boldsymbol{z}_{i}\|_{1}/\gamma_{\mathcal{S}})}{\sum_{i \in e} \exp(\|\boldsymbol{z}_{i}\|_{1}/\gamma_{\mathcal{S}})} - \frac{\sum_{i \in e} \|\boldsymbol{z}_{i}\|_{1} \exp(-\|\boldsymbol{z}_{i}\|_{1}/\gamma_{\mathcal{S}})}{\sum_{i \in e} \exp(-\|\boldsymbol{z}_{i}\|_{1}/\gamma_{\mathcal{S}})},$$
(12)

Here,  $\gamma_S > 0$  is a parameter controlling the accuracy of the approximation. As  $\gamma_S$  increases, the approximation becomes more accurate, but the objective function becomes less smooth. In this work, we utilize the parameter  $\gamma_S$  to estimate the wirelength in the z-direction, while also introducing a parameter  $\gamma_H$  for the wirelength approximation in the x and y directions. Employing this smooth approximation allows the wirelength model to be differentiable, facilitating the use of gradient-based optimization methods.

By substituting the smooth approximations for the max function in x, y, and z directions into the original wirelength objective shown as Eq. (11), we obtain a smooth and differentiable wirelength objective:

$$\widetilde{W}_{\psi}(\boldsymbol{x}, \boldsymbol{y}) = \widetilde{W}_{\mathcal{H}}(\boldsymbol{x}, \boldsymbol{y}) + \psi \cdot \widetilde{W}_{\mathcal{S}}(\boldsymbol{x}, \boldsymbol{y})$$
$$= \sum_{e \in E} \left( \widetilde{W}_{e_{x}}(x) + \widetilde{W}_{e_{y}}(y) + \widetilde{W}_{e_{z}}(z) \right), \quad (13)$$

This wirelength formulation greatly expands the utility of gradient-based optimization methods within our placement algorithm. By minimizing the wirelength objective, the algorithm aims to achieve improved placement results in terms of wirelength while considering constraints related to SLLs.

3) Soft Floor Method for Discrete Coordinates z: The discrete nature of z presents challenges when optimizing the wirelength objective, as discrete variables can impede the convergence of the optimization algorithm. To overcome this issue, we aim to transform the discrete z into a continuous and smooth variable.

Specifically, we propose a soft floor method that enables the smoothing and continuous representation of z. This approach utilizes a sigmoid-like function, defined as follows:

$$\sigma(x) = \frac{1}{1 + exp(-\gamma_{\mathcal{S}} \cdot x)} \tag{14}$$

Here,  $exp(\cdot)$  denotes the exponential function, while  $\gamma_S$  is an adaptive parameter used in the wirelength objective for the z-dimension, as illustrated in Eq. (13). By employing the



Fig. 5: Visualization of the soft floor method applied to a multi-die FPGA with a  $2 \times 2$  SLR topology: Demonstrating variations with different  $\gamma_S$  values.

sigmoid function  $\sigma(\cdot)$ , a continuous and smooth transformation of  $z_i$  can be formulated as:

$$z_{i} = z_{i}^{x} \cdot \hat{\mathbf{x}} + z_{i}^{y} \cdot \hat{\mathbf{y}}$$

$$= \sum_{k=0}^{k=|z^{x}|-1} \sigma(\frac{x_{i}}{\sigma_{x}} - k) \cdot \hat{\mathbf{x}} + \sum_{k=0}^{k=|z^{y}|-1} \sigma(\frac{y_{i}}{\sigma_{y}} - k) \cdot \hat{\mathbf{y}}'$$
(15)

Eq. (15) illustrates how the discrete vector z is transformed into a continuous, smoothly varying two-dimensional vector. This transformation is related to the normalized coordinates of the instances  $\left(\frac{x_i}{\sigma_x}, \frac{y_i}{\sigma_y}\right)$ , where  $\hat{\mathbf{x}}$  and  $\hat{\mathbf{y}}$  denote the unit vectors along the x and y axes, respectively.

Next, we delve into this methodology in terms of two key questions:

a) Operational Principles of the Soft Floor Method: As the number of optimization iterations increases, the value of  $\gamma_S$  rises, exacerbating the barrier between SLRs. Initially, when  $\gamma_S = 1$ , as illustrated in Fig. 5, instances move easily between dies for global optimization. However, with  $\gamma_S$  increasing to 20, traversing between dies becomes more challenging and costly, directing the optimization towards refining local solutions. Only instances located at the edges are considered to move between molds to obtain a more optimal solution. By modulating the value of  $\gamma_S$ , the algorithm strikes a balance between global and local searches, leading to better solutions in fewer iterations.

b) The Advancement of Our Method Over the Lifting Dimension Technique by [5]: While our soft floor method draws inspiration from the lifting dimension technique, it offers distinct improvements:

- As shown in Fig. 6(a), the lifting dimension technique in [5] specifically designed for 1 × 4 SLR topology and introduces an electric field dimension z with discrete SLR indexes, aiming to minimize wirelength in the zdirection using 3D Poisson equation and ADMM solver. However, this discrete approach leads to suboptimal results.
- In contrast, our soft floor method treats z as a continuous variable influenced by x and y coordinates, as depicted in Fig. 6(b). It can represent SLR indexes for any SLR



Fig. 6: Comparison of the electric field modeling between the SOTA method [5] and the proposed LEAPS.  $Inst_{index}^{SLR}$ represents the SLR index of placeable instances. (a) The SOTA method utilizes the lifting dimension technique. (b) The proposed LEAPS utilizes a smooth and continuous function  $f_{smooth}$  (i.e., the soft floor method in Section IV-A3).

topology. This continuous approach allows for smooth adjustments of instance coordinates, facilitating SLL minimization. Moreover, by employing the 2D Poisson equation, the placer simplifies computational demands and enhances design integration.

In essence, the soft floor method provides a more adaptive and efficient approach to multi-die FPGA placement, promising optimal results. By transforming z into a continuous variable, it leverages gradient-based optimization, ensuring a differentiable wirelength model and improved placement results.

## B. Density Multiplier Updating for Multi-Die FPGA

In FPGA placement, the density multiplier  $\lambda$  is pivotal for wirelength optimization, guiding the spreading rate of various resource types. While the method for updating  $\lambda$  has been extensively discussed in [16], our work introduces modifications tailored for multi-die FPGA, particularly considering SLL counts.

Our method initializes the density multiplier  $\lambda^{(0)}$  as follows:

$$\boldsymbol{\lambda}^{(0)} = \eta \frac{\|\nabla W_{\psi}\left(\boldsymbol{x}^{(0)}, \boldsymbol{y}^{(0)}\right)\|_{1}}{\sum_{i \in V} q_{i} \|\boldsymbol{\xi}_{i}^{(0)}\|_{1}} (1, 1, \cdots, 1)^{T}.$$
(16)

The formula  $\|\nabla \widetilde{W}_{\psi}(\cdot)\|_1 = \|\nabla \widetilde{W}_{\mathcal{H}}(\cdot) + \psi \cdot \nabla \widetilde{W}_{\mathcal{S}}(\cdot)\|_1$ , distinct from [16], incorporates SLL counts into the initialization process. The initial placement location  $(\boldsymbol{x}^{(0)}, \boldsymbol{y}^{(0)})$  and the initial electric field  $\boldsymbol{\xi}_i^{(0)}$  of each instance are considered. The weight parameter  $\eta$  and the L1 norm are calibrated to prioritize wirelength minimization in early iterations. We set  $\eta$  to  $10^{-4}$ , applying uniform spreading weights across all resource types.

For the subsequent updating mechanism of  $\lambda$ , we largely follow the subgradient update technique described in [16]. This approach has been proven to enhance convergence efficiency and circuit quality. Detailed technical aspects of this method are available in the cited work.

In conclusion, while our density multiplier updating mechanism builds upon the foundation set by [16], it introduces critical modifications to cater to the unique challenges posed by multi-die FPGA architecture, ensuring optimal placement results.

#### C. Adaptive Wirelength-Weighting-Factor Adjusting

To further improve the performance in solving the overall wirelength minimization problem, we also adaptively update the wirelength-weighting factor  $\psi$  to balance the trade-off between HPWL minimization and SLL minimization. We apply an exponential moving average (EMA) and the Adam optimization algorithm, which has two main advantages: 1) improved convergence speed and 2) better trade-off between different objectives.

Defining the function S(x, y), which represents the number of SLLs. We first derive the growth of SLL counts, denoted by  $\delta_{S}^{(k+1)}$ , in the (k+1)-th iteration. The calculation is performed as follows:

$$\delta_{S}^{(k+1)} = S(\boldsymbol{x}^{(k+1)}, \boldsymbol{y}^{(k+1)}) - S(\boldsymbol{x}^{(k)}, \boldsymbol{y}^{(k)}).$$
(17)

This equation computes the change in the SLL counts from the k-th iteration to the (k + 1)-th iteration, providing a quantitative measure of the SLL growth for the optimization process.

Next, we calculate the EMA of  $\delta_{S}^{k}$  using a weight parameter  $\rho$ , set to 0.9 for smooth convergence:

$$E_{S}^{(k+1)} = \rho \cdot \delta_{S}^{(k+1)} + (1-\rho) \cdot E_{S}^{(k)}.$$
 (18)

We employ the Adam optimization algorithm to update  $\psi$  based on the EMA value  $E_S^{(k+1)}$ . The algorithm dynamically adjusts the learning rate using the first- and second-moment estimates of the gradient, with  $E_S^{(k)}$  serving as the gradient in this context. We compute the first-moment estimate  $\psi_m$  and the second-moment estimate  $\psi_n$ :

$$\psi_m = \beta_1 \cdot \psi_m + (1 - \beta_1) \cdot E_S^{(k+1)},$$
 (19)

$$\psi_v = \beta_2 \cdot \psi_v + (1 - \beta_2) \cdot (E_{\mathcal{S}}^{(k+1)})^2,$$
(20)

Here, we set  $\beta_1 = 0.9$  and  $\beta_2 = 0.999$  as the exponential decay rates for the first- and second-moment estimates. Then, we compute the bias-corrected first and second-moment estimates:

$$\hat{\psi}_m = \frac{\psi_m}{1 - \beta_1},\tag{21}$$

$$\hat{\psi}_v = \frac{\psi_v}{1 - \beta_2},\tag{22}$$

Lastly, we update  $\psi^{(k)}$  with  $\psi^{(k+1)}$  using the bias-corrected estimates:

$$\psi^{(k+1)} = \psi^{(k)} + t_{\psi} \cdot \frac{\hat{\psi}_m}{\sqrt{\hat{\psi}_v} + \epsilon_{\psi}},\tag{23}$$

where  $t_{\psi}$  is the step size, and  $\epsilon_{\psi} = 10^{-8}$  is a small constant to prevent division by zero.

In essence, the use of EMA helps to reduce noise and maintain a good balance between HPWL and SLL counts. Not only that, the Adam optimization algorithm accelerates the convergence process and allows for an efficient tradeoff between these objectives. This combined strategy features robustness to noise gradients and effective bias correction. It ensures an optimal balance between HPWL and SLL counts across the FPGA placement, improving overall performance.

# D. Improved Clock Network Planning Algorithm for Multi-Die FPGAs

We introduce an advanced algorithm for clock network planning in multi-die FPGAs, aiming to satisfy clock constraints while effectively minimizing SLLs. This algorithm is structured into two key stages:

- Instance-to-Clock-Region Mapping Generation with SLL Minimization: This stage focuses on assigning instances to specific clock regions. Our approach, inspired by the methods in [21], extends beyond just adhering to clock routing constraints. It integrates a novel optimization objective that concurrently addresses clock constraints and actively reduces SLL counts. See Section IV-D1.
- *The Advanced Clock Penalty*: In this stage, we incorporate a smooth, differentiable penalty function into the overall placement optimization. This function is designed to subtly guide instances towards their specified clock regions, aligning with the clock network's layout and constraints. See Section IV-D2.

In our proposed algorithm, the synergy of these two stages results in a more balanced and efficient clock network planning for multi-die FPGAs. It not only meets the critical clock constraints but also minimizes SLL counts, leading to an optimized placement and routing of the clock network.

1) Instance-to-Clock-Region Mapping Generation with SLL Minimization: In this stage, our objective is to generate mappings from instances to clock regions. It requires satisfying clock constraints while minimizing the number of SLLs. Initially, we introduce symbols and notions to clarify the problem, as shown in TABLE III. Then, the instance-

TABLE III: Symbols and Notions Used in Clock Network Planning.

| V                     | The set of instances.                                                           |
|-----------------------|---------------------------------------------------------------------------------|
| S                     | The set of resource types.                                                      |
| $V^{(s)}$             | The set of instances of resource type $s \in S$ .                               |
| $\mathcal{A}_v^{(s)}$ | The instance v's demand for resource type $s \in S$ .                           |
| $\mathcal{R}$         | The set of clock regions.                                                       |
| $C_r^{(s)}$           | The clock region r's capacity for resource type $s \in S$ .                     |
| $D_{v,r}$             | The physical distance between instance $v$ and clock region $r$ .               |
| $I_{v,r}$             | The increase in the number of SLLs if moving instance $v$ to clock region $r$ . |
| ε                     | The set of clock nets.                                                          |

to-clock-region mapping process is formulated as a binary optimization problem, shown in Formulation (24).

$$\underset{\boldsymbol{x}}{\text{minimize}} \quad \sum_{v \in V} \sum_{r \in \mathcal{R}} \left( D_{v,r} + \alpha I_{v,r} \right) \cdot \boldsymbol{x}_{v,r}, \tag{24a}$$

s.t. 
$$\boldsymbol{x}_{v,r} \in \{0,1\}, \forall v \in V, \forall r \in \mathcal{R},$$
 (24b)

$$\sum_{r \in \mathcal{P}} \boldsymbol{x}_{v,r} = 1, \forall v \in V,$$
(24c)

$$\sum_{v \in \mathcal{V}} \mathcal{A}_{v}^{(s)} \cdot \boldsymbol{x}_{v,r} \leq C_{r}^{(s)}, \forall r \in \mathcal{R}, \forall s \in S, (24d)$$

Exist a legal clock routing w.r.t x. (24e)

In the above formulation, the overall cost function (Eq. (24a)) is designed by summing 1) the physical distance  $D_{v,r}$  between instances and clock regions and 2) the increase in SLL counts  $I_{v,r}$ . The two are weighed for importance by a factor  $\alpha$ . The binary decision variable  $x_{v,r}$  (Eq. (24b)) denotes the mapping status of instance v to clock region r. The constraint (Eq. (24c)) specifies that each instance v is mapped to exactly one clock region r. The upper limit of total resource demand per region is enforced by the constraint (Eq. (24d)), ensuring no clock region is overburdened. Finally, the constraint (Eq. (24e)) ensures compliance with the legal clock routing and conforms to the constraints of the multidie FPGA architecture. As such, the proposed formulation comprehensively explores the solution space, balancing the physical distance and the potential increase in SLL counts for clock network planning.

To solve this optimization problem Eq. (24), we employ a *branch-and-bound* based method proposed by [13], which advances the performance of clock-driven placement algorithms. This method utilizes a tree traversal-based heuristic to search for a huge solution space of possible variable assignments.

A critical distinction of our proposed method compared to previous works lies in its dual focus: not only does it minimize the total distance between instances and their designated clock regions, but it also crucially aims to reduce the overall increase in SLL counts. Addressing the SLL issue represents a notable advancement in this field. We introduce Algorithm 1 to address this challenge, detailing below its workflow for calculating the increase in SLL counts. Algorithm 1 Calculation of the Increase in SLL Counts

```
Input: The set of candidate mapping nodes N, the target mapping clock region cr with its central coordinates specified as (cr<sub>x</sub>, cr<sub>y</sub>). The SLR's width δ<sub>x</sub>, the SLR's height δ<sub>y</sub>.
Output: Total increase in SLL counts I<sub>S</sub>

I<sub>S</sub> ← 0
Compute cr's SLR index cr<sub>z</sub> using Eq. (2)
```

- 3: for all  $n \in N$  do 4: Get node n's
- 4: Get node n's coordinate (n<sub>x</sub>, n<sub>y</sub>)
  5: Get node n's SLR index n<sub>z</sub> using Eq. (2)
- 6: if compare == 0 then
- 7: **continue**
- 8: else

| 9: | Get | pins  | of | node | n  | denoted | as | $P_r$ |
|----|-----|-------|----|------|----|---------|----|-------|
|    | 000 | PIIID | 01 | noue | 10 | aenotea | uo | - 11  |

for all  $p_n$  in  $P_n$  do 10: Get the net  $e_p$  belonged to pin  $p_n$ 11: if  $e_p$  is not eligible then 12: continue 13: 14: end if Compute net  $e_p$ 's bounding box denoted as  $B_e$ 15: Compute a partial increase in the number of 16: SLLs  $\Delta_{\mathcal{S}} \leftarrow \|B_e - B'_e\|_1$  $I_{\mathcal{S}} \leftarrow I_{\mathcal{S}} + \Delta_{\mathcal{S}}$ 17: end for 18: end if 19:

20: end for 21: return *I*<sub>S</sub>

a) Streamlined Analysis of the Algorithm: Initially, the increase in SLL counts  $I_S$  is set to zero, and the SLR index  $cr_z$  for the clock region is computed, defining its location within the grid. Then, the algorithm checks each node in the candidate set. For each node, its SLR index is calculated and compared to the clock region's index. If they match, the algorithm considers that there is no potential to increase the number of SLLs and proceeds to the next node. However, if the indexes differ, the algorithm further explores the node's pins to identify eligible nets to calculate the increase in the number of SLLs. For each eligible net, the bounding box is computed, an updated bounding box is generated considering the instance-to-clock region assignment, and a partial increase in the number of SLLs is computed and added to  $I_{\mathcal{S}}$ . After evaluating all nodes, the algorithm concludes by returning the final total SLLs' increase  $I_S$ . The algorithm is able to efficiently evaluate the potential increase in the number of SLLs while considering the spatial relationships and interconnections of the nodes.

b) Complexity Analysis of the Algorithm: As for the algorithm's complexity, the time complexity is mainly determined by the nested loops that iterates over the nodes and pins. Let N represent the number of nodes, and P denote the maximum number of pins per node. In the worst case, the algorithm needs to check all pins of all nodes, resulting in a time complexity of O(NP). However, since only a small fraction of nets need to perform the calculation of the

increased number of SLLs, the actual runtime is usually much less than the worst-case complexity. The space complexity is determined by the storage required for the data associated with nodes, pins, and various auxiliary data structures for intermediate computations. In general, the space complexity is proportional to the number of nodes, pins, and nets, making it O(N + P + E), where E represents the number of eligible nets.

2) The Advanced Clock Penalty: In the second stage, we implement an advanced clock penalty term to the placement objective for better adapting to the multi-die FPGA architecture while minimizing the number of SLLs and meeting clock constraints.

Unlike previous works [11], [14], which enforces a direct shift of instances to their clock regions. Instead, we adopt a novel gravitational attraction concept [15], resembling a bowl-like pull, to guide instances toward their mapped clock regions. The clock penalty function is expressed as:

$$\Gamma_{i}\left(\boldsymbol{x}_{i},\boldsymbol{y}_{i}\right)=\Gamma_{i}\left(\boldsymbol{x}_{i}\right)^{x}+\Gamma_{i}\left(\boldsymbol{y}_{i}\right)^{y}.$$
(25)

The penalty terms  $\Gamma_i(\boldsymbol{x}_i)^x$  and  $\Gamma_i(\boldsymbol{y}_i)^y$  correspond to the x and y directions, respectively. Let  $lo_i^x$ ,  $hi_i^x$ ,  $lo_i^y$ , and  $hi_i^y$  denote the left, right, bottom, and top boundary coordinates of the generated mapping result for instance i. We define  $\Gamma_i(\boldsymbol{x}_i)^x$  as,

$$\Gamma_{i}(\boldsymbol{x}_{i})^{x} = \begin{cases} (\boldsymbol{x}_{i} - lo_{i}^{x})^{2}, \ \boldsymbol{x}_{i} < lo_{i}^{x}, \\ 0, \ lo_{i}^{x} \leq \boldsymbol{x}_{i} \leq hi_{i}^{x}, \\ (\boldsymbol{x}_{i} - hi_{i}^{x})^{2}, \ hi_{i}^{x} < \boldsymbol{x}_{i}. \end{cases}$$
(26)

Here,  $\Gamma(\boldsymbol{x}, \boldsymbol{y})$  denotes the sum of the clock penalty of all instances, i.e.,  $\Gamma(\boldsymbol{x}, \boldsymbol{y}) = \sum_{i \in \mathcal{V}} \Gamma_i(\boldsymbol{x}_i, \boldsymbol{y}_i)$ .

The clock penalty multiplier  $\eta$  is initially set to 0. Upon resetting the clock penalty function  $\Gamma(\cdot)$ ,  $\eta$  is updated with the relative ratio between the gradient norms of the wirelength and the clock penalty to maintain the clock penalty function's stability.

$$\eta = \frac{\iota \|\nabla W_{\psi}\|_2}{\|\nabla \Gamma\|_2 + \varepsilon}.$$
(27)

As the placement optimization proceeds, the clock penalty multiplier  $\eta$  is dynamically adjusted to balance the influence of wirelength and clock penalty terms in the objective function. This adaptation ensures that the optimization algorithm maintains an appropriate focus on wirelength minimization and compliance with clock region constraints.

After the instances are assigned to their respective clock regions, only about 1% of instances are found outside their designated clock regions. Therefore, most of the instances do not incur any clock penalty. Empirically, we set the parameters  $\iota$  and  $\varepsilon$  to  $10^{-4}$  and  $10^{-2}$ , respectively. This setting achieves an appropriate balance between the gradient norm ratio of wirelength and clock penalty terms.

As the optimization proceeds, instances that are still outside of their designated clock regions will be subjected to an increasing clock penalty. The size of the penalty will grow with the distance of the instance from its specified region, prompting the instance to move toward its specified clock region. This approach facilitates the smooth convergence of the optimization process while satisfying the clocking constraints imposed by the multi-die FPGA architecture.

In summary, this clock penalty method can dynamically adjust the clock penalty multiplier. This provides an efficient way to place instances in a multi-die FPGA architecture while minimizing the wirelength and satisfying clock region constraints. This approach enables improved placement quality and performance in comparison to existing methods, proving its applicability and effectiveness for modern FPGA designs.

# E. Clock- and SLL-aware Legalization & Detailed Placement

Two critical constraints that need to be carefully considered during the legalization (LG) and detailed placement (DP) stages are clock feasibility and minimizing SLL counts. We delve into these constraints in the following discussion.

In the LG stage, we leverage the Direct Legalize (DL) algorithm [21] to skillfully manage clock constraints. Due to the complexity of clock networks, modern FPGAs often introduce "clock region constraints" at this stage. To address this, we establish a legal clock-to-clock region assignment that specifies which cell can be positioned to which slice. Then, Our DL algorithm performs an additional check to discard cell-to-slice assignments that violate this assignment, thereby ensuring adherence to the clock region constraint.

In the DP stage, we leverage a clock-aware multi-stage ISM approach, drawing inspiration from the UTPlaceF series [8], [11]. The approach utilizes an iterative minimum-cost-flow-based cell assignment technique to optimize wirelength and routability while adhering to complex clock constraints, resulting in clock-legal and high-quality placement solutions.

Addressing the SLL minimization challenge involves estimating the potential increase in SLL counts due to instance relocations during legalization and detailed placement. This estimation is seamlessly incorporated into the optimization objectives, mirroring the Algorithm 1 applied in clock network planning. The goal is to ensure that the overall optimization objective is minimized.

To concretely demonstrate our methodology, we consider the SLL optimization in the LG to illustrate the practical details. Given the DL algorithm concurrently explores the solution spaces of placement and packing. This requires a scoring function that encapsulates both placement- and packing-related metrics. Given a slice s and a cluster c, the score of c in s is defined as follows:

$$SCORE(c, s) = \sum_{e \in \mathcal{E}(c)} \frac{\text{InternalPins}(e, c) - 1}{\text{TotalPins}(e) - 1} - \varphi(\Delta HPWL(c, s) + \alpha_{\text{LG}}\Delta SLL(c, s))$$
(28)

Here,  $\mathcal{E}(c)$  denotes the set of nets with at least one cell in c, TotalPins(e) represents the total pin count of net e, InternalPins(e, c) indicates the number of pins of net e in c, and  $\Delta HPWL(c, s)$  and  $\Delta SLL(c, s)$  denote the increase in HPWL and the number of SLLs when moving cells in c from their flat initial placement (FIP) locations to s. The positive weighting parameters  $\varphi$  and  $\alpha_{LG}$  are empirically set to 0.02 and 4.0, respectively. The first term defines the clustering score, granting higher scores to clusters that convert more external nets into internal ones, effectively reducing routing demands and enhancing routability. The second term favors candidates that significantly reduce the wirelength and the number of SLLs.

The SLL optimization in the DP is consistent with the approach in the LG above, while also taking clock constraints into account. This ensures the overall optimization goal, including SLL minimization and clock routing constraints, guarantees high-quality placement results. This process is not further elaborated here and can be referred to in the description in the LG.

# V. EXPERIMENTAL RESULTS

## A. Comparison with the SOTA methods

We implemented our GPU-accelerated placer in C++ and Python along with the open-source machine learning framework PyTorch for fast gradient back-propagation. We conduct experiments on a Ubuntu 22.04 LTS platform that consists of an Intel(R) Xeon(R) Gold 6248 CPU @ 3.00GHz (24 cores), an NVIDIA RTX3090 GPU, and 128GB memory.

To comprehensively compare our LEAPS with other SOTA placers, we evaluated its performance using the *ISPD 2017 benchmarks*, specifically targeting multi-die FPGA with a 1×4 SLR topology. These evaluations focus on three key metrics: minimization of super long lines (SLL), optimization of half-perimeter wirelength (HPWL), and overall runtime efficiency. Notably, we further dissect the runtime into CPU runtime (CRT) and GPU runtime (GRT) to highlight the GPU acceleration capabilities of our placement method.

The characteristics and comparative analysis of various FPGA placement algorithms, including ICCAD'17 [12], Min-cut + ICCAD'17 [12], ICCAD'19 [5], and our proposed LEAPS, are detailed in Table IV, focusing on the *ISPD 2017 contest benchmark*. The rationale behind selecting these specific algorithms for comparison is as follows:

- ICCAD'17 [12] and Min-cut + ICCAD'17 [12] are included despite ICCAD'17 [12] not being a multi-die FPGA placer. It represents a significant clock-aware placement algorithm. The Min-cut + ICCAD'17 setup, which combines the Min-cut method with ICCAD'17 [12], not only provides a balanced comparison but is also pivotal in the ICCAD'19 [5] analysis, serving as a benchmark method. This method divides blocks into four subsets for placement within each die, providing a unique approach to FPGA placement.
- ICCAD' 19 [5] is considered for its recent advancements as a state-of-the-art (SOTA) method, particularly addressing SLL challenges with clock- and SLL-aware techniques.

Notably, recent heterogeneous FPGA placement algorithms such as elfPlace [16], DREAMPlaceFPGA [20], AMF-Placer [23] are excluded from this comparison. The key reason for their exclusion is the omission of clock constraints in these algorithms, a detail underscored in Section III-C. The assessment in TABLE II is based on three key features: handling clock constraints, supporting multi-die architecture, and leveraging GPU acceleration. In these aspects, LEAPS demonstrates its powerful capabilities in meeting the demands of modern FPGA design, distinguishing itself in the field of FPGA placement. This distinction is crucial, as overlooking clock constraints can significantly affect wirelength metrics post-placement, leading to an inaccurate comparison of performance metrics. Additionally, OpenPARF [18], representing our preliminary work, is not compared directly. However, the superiority of the LEAPS framework is evident from the results presented in Tables IV, V, and VI. For enhanced clarity and emphasis, the most superior results in these tables are highlighted in bold.

The analysis presented in Table IV clearly indicates that our LEAPS method surpasses other algorithms across all evaluated metrics, achieving notably lower counts of super long lines (SLL) and improved half-perimeter wirelength (HPWL) for all benchmark designs. Additionally, LEAPS demonstrates a substantial advantage in runtime, consistently completing placements more rapidly than its counterparts. This remarkable enhancement in performance is largely due to the method's efficient optimization techniques and the integration of GPU acceleration. It's noteworthy that even when the GPU acceleration factor is set aside, the CPU-based implementation of LEAPS still significantly outpaces the current state-of-the-art, ICCAD' 19 [5], with an approximate  $2.62 \times$ speedup in runtime. In comparison to the latest SOTA method ICCAD' 19 [5], LEAPS with GPU acceleration demonstrates a substantial reduction in SLL by 43.08% and in HPWL by 9.99%, along with a significant  $34.335 \times$  speedup in runtime. These results underscore LEAPS's ability to achieve more optimal placements with lower computational demands.

In conclusion, LEAPS demonstrates clear superiority over other algorithms in SLL, HPWL, and runtime metrics for the ISPD 2017 benchmarks. Its combination of efficient optimization and GPU acceleration not only minimizes HPWL and SLL counts but also reduces computational overhead, making it an effective solution for multi-die FPGA placement.

## B. Effectiveness Validation of Optimization Techniques

In this section, we conduct a thorough validation of the techniques presented in the LEAPS framework, specifically tailored for multi-die FPGA placement. To achieve this, we design two sets of experiments: the first evaluates the impact of optimizing SLL at different stages of the placement process, while the second assesses the adaptive wirelength-weighting-factor adjusting method (hereafter referred to as the WLW method) in the GP, which enables trade-offs between HPWL and SLL counts. These experiments aim to provide a comprehensive understanding of how each technique within LEAPS contributes to the overall placement efficacy.

1) Necessity of Full-flow Optimization in LEAPS: Our primary focus is on the full-flow optimization of the number of SLLs, driven by the premise that SLL minimization

<sup>&</sup>lt;sup>1</sup>The Norm. in this table are calculated using the relative improvement method. This differs from the relative reduction percentage used in the main text, leading to variations in the reported values.

TABLE IV: Comparison of Super Long Lines ( $\times 10^{0}$ ), Half-Perimeter Wirelength ( $\times 10^{3}$ ), and Runtime (Seconds) for Multi-Die FPGA with  $1 \times 4$  SLR Topology on ISPD 2017 Benchmarks.

| Design     | #LUT/#EE/#BRAM/#DSP #Clock |        | I     | CCAD'17 [ | 12]     | Min-cut + ICCAD'17 [12] |         |         | ICCAD' 19 [5] |         |         | The Proposed LEAPS |         |        |        |
|------------|----------------------------|--------|-------|-----------|---------|-------------------------|---------|---------|---------------|---------|---------|--------------------|---------|--------|--------|
| Design     | #LU1/#FF/#BKAM/#DSF        | #CIOCK | SLL   | HPWL      | CRT (s) | SLL                     | HPWL    | CRT (s) | SLL           | HPWL    | CRT (s) | SLL                | HPWL    | CRT(s) | GRT(s) |
| CLK-FGPA01 | 211K/324K/164/75           | 32     | 19707 | 1933691   | 2939    | 15039                   | 2126497 | 7963    | 14817         | 1916227 | 3227    | 4873               | 1658361 | 1697   | 123    |
| CLK-FGPA02 | 230K/280K/236/112          | 35     | 19245 | 1949266   | 3356    | 14937                   | 2138430 | 7772    | 14470         | 1927038 | 3225    | 7192               | 1777341 | 1552   | 121    |
| CLK-FGPA03 | 410K/481K/850/395          | 57     | 33915 | 4760837   | 7410    | 24310                   | 5702452 | 17545   | 22500         | 4688170 | 7251    | 14285              | 4487928 | 2377   | 202    |
| CLK-FGPA04 | 309K/372K/467/224          | 44     | 22774 | 3388240   | 6015    | 17317                   | 4163495 | 13060   | 17123         | 3389653 | 5419    | 10852              | 3094173 | 2148   | 148    |
| CLK-FGPA05 | 393K/469K/798/150          | 56     | 28246 | 4147683   | 7460    | 21745                   | 5112935 | 17533   | 21238         | 4066860 | 7275    | 11777              | 3821386 | 2269   | 189    |
| CLK-FGPA06 | 425K/511K/872/420          | 58     | 30526 | 5007798   | 8261    | 21260                   | 6128113 | 12708   | 20988         | 5152846 | 5686    | 15959              | 4625645 | 2355   | 214    |
| CLK-FGPA07 | 254K/309K/313/149          | 38     | 14916 | 2096178   | 3747    | 11079                   | 2271849 | 8582    | 11215         | 2047259 | 3561    | 6813               | 1905011 | 1688   | 127    |
| CLK-FGPA08 | 212K/257K/161/75           | 32     | 16711 | 1673570   | 2812    | 13457                   | 2143600 | 8401    | 12565         | 1661350 | 3509    | 4849               | 1545018 | 1554   | 109    |
| CLK-FGPA09 | 231K/358K/236/112          | 35     | 16275 | 2162916   | 3994    | 10282                   | 2836349 | 10879   | 10485         | 2177478 | 4512    | 6508               | 1891086 | 1816   | 131    |
| CLK-FGPA10 | 327K/506K/542/255          | 47     | 22584 | 3886385   | 6396    | 17793                   | 4716132 | 18464   | 17233         | 3970566 | 7675    | 13816              | 3301351 | 2273   | 180    |
| CLK-FGPA11 | 300K/468K/454/224          | 44     | 26024 | 3676642   | 6339    | 19356                   | 4573412 | 15325   | 19567         | 3697769 | 6359    | 11052              | 3138093 | 2731   | 165    |
| CLK-FGPA12 | 277K/430K/389/187          | 41     | 25683 | 2814733   | 4703    | 19275                   | 3109834 | 12768   | 18559         | 2811424 | 5702    | 10533              | 2453370 | 1916   | 151    |
| CLK-FGPA13 | 339K/405K/570/262          | 47     | 32248 | 3464495   | 4750    | 24774                   | 4297976 | 14354   | 24999         | 3422521 | 5956    | 10003              | 3172093 | 2057   | 160    |
|            | Norm.1                     |        | 2.403 | 1.111     | 33.753  | 1.795                   | 1.338   | 81.858  | 1.757         | 1.110   | 34.335  | 1.000              | 1.000   | 13.084 | 1.000  |

TABLE V: HPWL and SLL Evaluations With Different Stages Optimizations on ISPD 2017 Benchmarks.

|            |        |         | $1 \times$ | 4 SLR Topolo | gy       |               | $2 \times 2$ SLR Topology |         |         |           |         |              |  |  |
|------------|--------|---------|------------|--------------|----------|---------------|---------------------------|---------|---------|-----------|---------|--------------|--|--|
| Design     | LEAP   | S(GP)   | LEAPS (    | GP+LG+DP)    | LEAPS (G | SP+LG+DP+CNP) | LEAP                      | S(GP)   | LEAPS ( | GP+LG+DP) | LEAPS(G | P+LG+DP+CNP) |  |  |
|            | SLL    | HPWL    | SLL        | HPWL         | SLL      | HPWL          | SLL                       | HPWL    | SLL     | HPWL      | SLL     | HPWL         |  |  |
| CLK-FPGA01 | 5180   | 1658842 | 4916       | 1659305      | 4873     | 1658361       | 10598                     | 1629348 | 10173   | 1631056   | 10026   | 1631274      |  |  |
| CLK-FPGA02 | 7523   | 1800437 | 7278       | 1799139      | 7192     | 1777341       | 14480                     | 1788592 | 13928   | 1773191   | 13984   | 1791406      |  |  |
| CLK-FPGA03 | 14833  | 4495993 | 14430      | 4494080      | 14285    | 4487928       | 24088                     | 4471987 | 23626   | 4475201   | 23371   | 4474117      |  |  |
| CLK-FPGA04 | 11034  | 3090311 | 10845      | 3093255      | 10852    | 3094173       | 19781                     | 3068101 | 19014   | 3070280   | 19131   | 3068247      |  |  |
| CLK-FPGA05 | 12079  | 3832208 | 11773      | 3832323      | 11777    | 3821386       | 21617                     | 3843644 | 20799   | 3843667   | 20735   | 3844829      |  |  |
| CLK-FPGA06 | 16351  | 4622926 | 15983      | 4624631      | 15959    | 4625645       | 26867                     | 4642177 | 26124   | 4648913   | 25970   | 4642732      |  |  |
| CLK-FPGA07 | 7234   | 1912074 | 6936       | 1912016      | 6813     | 1905011       | 12150                     | 1912148 | 11494   | 1915191   | 11584   | 1913722      |  |  |
| CLK-FPGA08 | 4915   | 1545356 | 4588       | 1544583      | 4849     | 1545018       | 10182                     | 1530010 | 9595    | 1531410   | 9706    | 1524861      |  |  |
| CLK-FPGA09 | 6864   | 1889271 | 6526       | 1890708      | 6508     | 1891086       | 10922                     | 1904842 | 10388   | 1904557   | 10339   | 1904389      |  |  |
| CLK-FPGA10 | 14285  | 3299965 | 13941      | 3304380      | 13816    | 3301351       | 21470                     | 3304337 | 20890   | 3305586   | 20717   | 3306488      |  |  |
| CLK-FPGA11 | 11042  | 3136026 | 10690      | 3136950      | 11052    | 3138093       | 16921                     | 3130795 | 16227   | 3134114   | 16087   | 3131357      |  |  |
| CLK-FPGA12 | 10985  | 2451674 | 10825      | 2452170      | 10533    | 2453370       | 15244                     | 2460003 | 14723   | 2461136   | 14593   | 2460000      |  |  |
| CLK-FPGA13 | 10352  | 3177396 | 10172      | 3181802      | 10003    | 3172093       | 18742                     | 3168715 | 18196   | 3170607   | 18179   | 3172810      |  |  |
| Norm.      | 1.0324 | 1.0011  | 1.0030     | 1.0015       | 1.0000   | 1.0000        | 1.0403                    | 0.9997  | 1.0035  | 1.0000    | 1.0000  | 1.0000       |  |  |

should be a continuous effort throughout the entire placement process, not limited to the GP stage alone. We conducted comparative experiments, as detailed in TABLE V, evaluating HPWL and SLL across three scenarios: 1) optimization solely during the GP stage (abbreviated as LEAPS (GP)), 2) optimization across the GP, LG, and DP stages (abbreviated as LEAPS (GP+LG+DP)), and 3) optimization extending into the clock network planning (CNP) stage (abbreviated as LEAPS (GP+LG+DP+CNP)).

Results from experiments using both  $1 \times 4$  and  $2 \times 2$ SLR topologies on the ISPD 2017 benchmark distinctly demonstrate significant reductions in SLL counts and improvements in wirelength optimization. Importantly, the application of optimization at the LG, DP, and CNP stages leads to a progressive decrease in SLL counts, confirming their effectiveness in refining SLL optimization in multi-die FPGA designs. Although optimization in the LG, DP, and CNP typically results in a minor increase in HPWL within acceptable limits, they occasionally produce a decrease in HPWL. This enhancement may be attributed to more refined clock network planning and wirelength objective, and is also likely influenced by inherent coupling mechanisms within these topologies. However, it is acknowledged that our current understanding of these blind spots is incomplete, prompting further investigation. This area forms the nucleus of our ongoing research endeavors. Moreover, our analysis suggests that compared to the  $2 \times 2$  SLR topology, the  $1 \times 4$  configuration results in fewer SLLs while maintaining comparable HPWL. This suggests a potential preference for the  $1 \times 4$  topology in multi-die FPGAs with four SLRs. However, a deeper study into other design performance aspects, like timing and routed wirelength, is essential for a definitive finding.

Upon the normalized data, it becomes evident that the comprehensive LEAPS framework (encompassing GP, LG, DP, and CNP) is the most effective, significantly enhancing SLL and HPWL performance in the evaluated designs. These results not only demonstrate the efficacy of the LEAPS framework but also validate our strategic approach towards optimizing the entire workflow in multi-die FPGA design.

2) Effectiveness of Adaptive Wirelength-weighting-factor Adjusting Method: By integrating SLL counts into the conventional wirelength objective, the LEAPS framework innovates with the WLW method. This method aims to strike a balance between HPWL and SLL counts during the GP, addressing one of the key challenges in LEAPS. TABLE VI presents a comparative analysis of LEAPS with the WLW method (LEAPS (with WLW)) and without the WLW method (LEAPS (without WLW)), illustrating the method's effectiveness in reducing SLL counts with a minimal impact on HPWL. Specifically, in the 1×4 SLR topology, the WLW method achieved a notable 4.58% reduction in SLLs with only a marginal 0.1% increase in HPWL. These results validate the WLW method's efficacy in achieving a delicate balance between minimizing SLL counts and maintaining HPWL, underscoring its importance in the LEAPS framework for multi-die FPGA placement.

| Design     | LEAPS ( | without WLW) | LEAPS(with WLW) |         |  |  |
|------------|---------|--------------|-----------------|---------|--|--|
| Design     | SLL     | HPWL         | SLL HPWL        |         |  |  |
| CLK-FPGA01 | 4992    | 1654111      | 4873            | 1658361 |  |  |
| CLK-FPGA02 | 7425    | 1777650      | 7192            | 1777341 |  |  |
| CLK-FPGA03 | 14971   | 4485711      | 14285           | 4487928 |  |  |
| CLK-FPGA04 | 11989   | 3089971      | 10852           | 3094173 |  |  |
| CLK-FPGA05 | 12942   | 3826220      | 11777           | 3821386 |  |  |
| CLK-FPGA06 | 16465   | 4627091      | 15959           | 4625645 |  |  |
| CLK-FPGA07 | 6942    | 1902767      | 6813            | 1905011 |  |  |
| CLK-FPGA08 | 5532    | 1546489      | 4849            | 1545018 |  |  |
| CLK-FPGA09 | 6679    | 1889111      | 6508            | 1891086 |  |  |
| CLK-FPGA10 | 14017   | 3301502      | 13816           | 3301351 |  |  |
| CLK-FPGA11 | 11711   | 3126103      | 11052           | 3138093 |  |  |
| CLK-FPGA12 | 10846   | 2447167      | 10533           | 2453370 |  |  |
| CLK-FPGA13 | 10215   | 3173383      | 10003           | 3172093 |  |  |
| Norm.      | 1.048   | 0.999        | 1.000           | 1.000   |  |  |

TABLE VI: Comparative Performance Analysis of the LEAPS Framework Utilizing Versus Omitting the WLW Method in a 1×4 SLR Topology.

## VI. CONCLUSION

In this paper, we have introduced LEAPS, a comprehensive and adaptable multi-die FPGA placement algorithm that addresses the challenges of minimizing SLL counts while optimizing essential design constraints, such as wirelength, routability, and clock routing. Our key contributions include a high-performance nested optimization algorithm with adaptive wirelength-weighting-factor adjusting, a soft floor method for handling any multi-die FPGA SLR topology, and the continuous optimization of SLLs throughout the entire placement process, including LG and DP stages. Experimental results demonstrate that our method significantly outperforms the SOTA algorithm, achieving an average reduction of 43.08% and 9.99% in SLL counts and HPWL, respectively, and a  $34.34 \times$  speedup in execution efficiency.

Future research may involve refining the LEAPS framework by developing advanced optimization techniques or employing machine learning to learn from previous placement experiences. Additionally, integrating our algorithm with other placement and routing tools could enhance seamless interoperability and collaboration between different stages of the FPGA design flow. In conclusion, LEAPS offers a promising foundation for addressing challenges in multi-die FPGA placement, setting the stage for future advancements in this field and contributing to the ongoing development of high-performance computing systems.

#### References

- W. S. Kuo, S. H. Zhang, W. K. Mak, R. Sun, and Y. K. Leow, "Pin assignment optimization for multi-2.5D FPGA-based systems," in *Proc. ISPD*, 2018, pp. 106-113.
- [2] Y. C. Liao and W. K. Mak, "Pin assignment optimization for multi-2.5D FPGA-based systems with time-multiplexed I/Os," *IEEE TCAD*, vol. 40, no. 3, pp. 494-506, Mar. 2021.
- [3] F. Mao, W. Zhang, B. Feng, B. He, and Y. Ma, "Modular placement for interposer based multi-FPGA systems," in *Proc. GLS-VLSI*, 2016, pp. 93-98.
- [4] C. Ravishankar, D. Gaitonde, and T. Bauer, "Placement strategies for 2.5D FPGA fabric architectures," in *Proc. FPGA*, 2018, pp. 16-20.
- [5] J. Chen, W. Zhu, J. Yu, L. He, and Y.-W. Chang, "Analytical placement with 3D Poisson's equation and ADMM based optimization for largescale 2.5D heterogeneous FPGAs," in *Proc. ICCAD*, 2019, pp. 1-8.

- [6] R. Raikar and D. Stroobandt, "Multi-die heterogeneous FPGAs: How balanced should netlist partitioning be?" in *Proc. SLIP*, 2022, pp. 1-7.
- [7] C. Pui, G. Chen, W. Chow, K. Lam, J. Kuang, P. Tu, H. Zhang, E. F. Y. Young, and B. Yu, "RippleFPGA: A routability-driven placement for large-scale heterogeneous FPGAs," in *Proc. ICCAD*, 2016, p. 67.
- [8] W. Li, S. Dhar, and D. Z. Pan, "UTPlaceF: A routability-driven FPGA placer with physical and congestion aware packing," *IEEE TCAD*, vol. 37, no. 4, pp. 869-882, 2018.
- [9] Z. Abuowaimer, D. Maarouf, T. Martin, J. Foxcroft, G. Gréwal, S. Areibi, and A. Vannelli, "GPlace3.0: Routability-driven analytic placer for UltraScale FPGA architectures," *ACM TODAES*, vol. 23, no. 5, pp. 66:1–66:33, 2018.
- [10] X. He, T. Huang, W.-K. Chow, J. Kuang, K.-C. Lam, W. Cai, and E. F. Y. Young, "Ripple 2.0: High quality routability-driven placement via global router integration," in *DAC*, 2019, pp. 1-6.
- [11] W. Li, Y. Lin, M. Li, S. Dhar, and D. Z. Pan, "UTPlaceF 2.0: A highperformance clock-aware FPGA placement engine," ACM TODAES, vol. 23, no. 4, pp. 42:1-42:23, 2018.
- [12] Y.-C. Kuo, C.-C. Huang, S.-C. Chen, C.-H. Chiang, Y.-W. Chang, and S.-Y. Kuo, "Clock-aware placement for large-scale heterogeneous FPGAs," in *Proc. ICCAD*, 2017, pp. 519-526.
- [13] W. Li, M. E. Dehkordi, S. Yang, and D. Z. Pan, "Simultaneous placement and clock tree construction for modern FPGAs," in *Proc. FPGA*, Feb. 2019, pp. 132-141.
- [14] C. Pui, G. Chen, Y. Ma, E. F. Y. Young, and B. Yu, "Clock-aware ultrascale FPGA placement with machine learning routability prediction: (invited paper)," in *Proc. ICCAD*, IEEE, 2017, pp. 929-936.
- [15] J. Chen, Z. Lin, Y. Kuo, C. Huang, Y. Chang, S. Chen, C. Chiang, and S. Kuo, "Clock-aware placement for large-scale heterogeneous FPGAs," *IEEE TCAD*, vol. 39, no. 12, pp. 5042-5055, 2020.
- [16] Y. Meng, W. Li, Y. Lin, and D. Z. Pan, "elfPlace: Electrostatics-based placement for large-scale heterogeneous FPGAs," *IEEE TCAD*, vol. 41, no. 1, pp. 365-378, Jan. 2022.
- [17] J. Mai, Y. Meng, Z. Di, and Y. Lin, "Multi-electrostatic FPGA placement considering SLICEL-SLICEM heterogeneity and clock feasibility," in *Proc. DAC*, Jul. 2022, pp. 649-654.
- [18] J. Mai, J. Wang, Z. Di, G. Luo, Y. Liang, and Y. Lin, "OpenPARF: An open-source placement and routing framework for large-scale heterogeneous FPGAs with deep learning toolkit," in *Proc. ASICON*, 2023.
- [19] Y. Lin, Z. Jiang, J. Gu, W. Li, S. Dhar, H. Ren, B. Khailany, and D. Z. Pan, "DREAMPlace: Deep learning toolkit-enabled GPU acceleration for modern VLSI placement," *IEEE TCAD*, June 2020.
- [20] R. S. Rajarathnam, M. B. Alawieh, Z. Jiang, M. Iyer, and D. Z. Pan, "DREAMPlaceFPGA: An open-source analytical placer for large scale heterogeneous FPGAs using deep-learning toolkit," in *ASP-DAC*, 2022, pp. 300-306.
- [21] W. Li and D. Z. Pan, "A new paradigm for FPGA placement without explicit packing," *IEEE TCAD*, vol. 38, no. 11, pp. 2113-2126, 2019.
- [22] S. Chen and Y. Chang, "Routing-architecture-aware analytical placement for heterogeneous FPGAs," in *Proc. DAC*, ACM, 2015, pp. 27:127:6.
- [23] T. Liang, G. Chen, J. Zhao, L. Feng, S. Sinha, and W. Zhang, "AMF-Placer: High-performance analytical mixed-size placer for FPGA," in *Proc. ICCAD*, 2021, pp. 1-6.
- [24] S.-J. Lee and K. Raahemifar, "FPGA placement optimization methodology survey," in CCECE, IEEE, 2008, pp. 001 981-001 986.
- [25] I. L. Markov, J. Hu, and M.-C. Kim, "Progress and challenges in VLSI placement research," *Proceedings of the IEEE*, vol. 103, no. 11, pp. 1985- 2003, 2015.
- [26] M.-K. Hsu, V. Balabanov, and Y.-W. Chang, "TSV-aware analytical placement for 3-D IC designs based on a novel weighted-average wirelength model," *IEEE TCAD*, vol. 32, no. 4, pp. 497-509, 2013.
- [27] R. Andreani, E. G. Birgin, J. M. Martínez, and M. L. Schuverdt, "On augmented lagrangian methods with general lower-level constraints," *SIAM Journal on Optimization*, vol. 18, no. 4, pp. 1286–1309, 2008.