[DOI: 10.2197/ipsjtsldm.12.38]

## **Short Paper**

# An FPGA Implementation Method based on Distributed-register Architectures

Koichi Fujiwara<sup>1,a)</sup> Kazushi Kawamura<sup>1,b)</sup> Masao Yanagisawa<sup>1</sup> Nozomu Togawa<sup>1,c)</sup>

Received: May 29, 2018, Revised: September 1, 2018, Accepted: October 22, 2018

**Abstract:** In order to reduce the effects of interconnection delays in recent FPGA chips, circuit designs based on distributed-register architectures (which we call DR-based circuit designs) are important. Several methods for DR-based circuit designs have been proposed, where high-level synthesis techniques are effectively utilized. However, no methods have been proposed yet to *practically implement* DR-based circuits on FPGA chips. In this paper, we propose an FPGA implementation method based on DR architectures and apply it to a DR-based circuit. The implementation result shows that it operates on an FPGA chip with 21% faster than the circuit based on a traditional architecture.

Keywords: FPGA implementation, interconnection delay, distributed-register architecture, high-level synthesis

## 1. Introduction

In order to reduce the effects of interconnection delays in recent FPGA chips, circuit designs based on distributed-register architectures (which we call DR-based circuit designs) are important. Since a DR-based circuit has local registers for each functional unit (FU), it can achieve smaller interconnection delays between FUs and registers compared with a traditional sharedregister-based (SR-based) circuit.

Several methods for DR-based circuit designs have been proposed [1], [2], where high-level synthesis (HLS) techniques are effectively utilized. These methods introduce, in HLS stage, floorplanning of modules which consist of FUs and local registers, which enables estimation and handling of interconnection delays in HLS. On the other hand, when we utilize these methods for FPGA implementation, we have to appropriately incorporate floorplanning information which is decided in HLS stage into FPGA implementation stage unlike SR-based circuit designs. However, no methods have been proposed yet to *practically implement* DR-based circuits on FPGA chips.

From this reason, we propose, in this paper, an FPGA implementation method based on DR architectures and apply it to a DR-based circuit designed by the approach in Ref. [2]. The contributions of this paper are summarized as follows:

- (1) The method to practically implement DR-based circuits on FPGA chips is presented.
- (2) The operation of an FPGA-implemented DR-based circuit is verified, which achieves 21% faster than the SR-based circuit.

c) togawa@togawa.cs.waseda.ac.jp

## 2. FPGA-implemented DR-based Circuit

For FPGA implementation, "Xilinx Virtex-7 (XC7V2000T-2FLG1925)" [5] on the evaluation board "TB-7V-2000T-LSI" is utilized in this paper. The board has off-chip memories realized by DDR3 SDRAM. In our experiments, data communication between PC and the board is performed via USB. The actual evaluation environment with this board is shown in **Fig. 1**.

**Figure 2** illustrates the diagram of an FPGA-implemented DRbased circuit. It consists of a DR-based circuit, interface circuits and Off-chip memories. The interface circuits include USB transmission interface, Write and read memory interface, and Clock/reset generator \*<sup>1</sup>.

Data transmission between PC and Off-chip memories can be realized via USB transmission interface. According to a start instruction from PC, input data are read into the DR-based circuit from Off-chip memories via Write and read memory interface. Similarly, according to a termination instruction from the DRbased circuit, output data are written back to Off-chip memories.



Fig. 1 Evaluation environment.

<sup>&</sup>lt;sup>1</sup> Department of Computer Science and Communications Engineering, Waseda University, Shinjuku, Tokyo 169–8555, Japan

a) kouichi.fujiwara@togawa.cs.waseda.ac.jp
b) kazushi kawamura@togawa.cs waseda.ac.ji

b) kazushi.kawamura@togawa.cs.waseda.ac.jp
c) togawa@togawa.cs.waseda.ac.jp

<sup>&</sup>lt;sup>\*1</sup> In this paper, we have utilized the interface circuits whose configurations are similar to Ref. [4].



Fig. 2 Diagram of FPGA-implemented DR-based circuit.

|                  | Clock period |                   | Operation clock |        |              |         |       |      |             | Operation time     |
|------------------|--------------|-------------------|-----------------|--------|--------------|---------|-------|------|-------------|--------------------|
|                  | constraint   | Parameters        | period          |        |              |         |       |      | Total impl. | on Xilinx Virtex-7 |
|                  | in HLS [ns]  | $(\alpha, \beta)$ | $CLK_{op}$ [ns] | #Steps | Latency [ns] | #Slices | #LUTs | #FFs | time [s]    | [ns]               |
| SR-based circuit | 6.4          | (8, 25)           | 6.378           | 10     | 63.78        | 1268    | 2943  | 3394 | 1179        | 65.87              |
| DR-based circuit | 5.0          | (5, 20)           | 4.983           | 10     | 49.83        | 1303    | 2779  | 3233 | 710         | 51.92              |

In Clock/reset generator, the input clock period  $CLK_{in}$  which is provided from the board (= 19.933 ns) is modulated to the operation clock period  $CLK_{op}$  which is calculated by:

$$CLK_{op} = CLK_{in} \times \frac{\alpha}{\beta} \text{ [ns]},$$
 (1)

where the parameters  $\alpha$  and  $\beta$  satisfy the following conditions: **Condition 1**  $\alpha \in \mathbb{N}$  and  $\beta \in \mathbb{N}$ . **Condition 2**  $\beta \le 64^{*2}$ .

# 3. FPGA Implementation Method Based on DR Architectures

## 3.1 Implementation Flow

We propose an FPGA implementation flow based on DR architectures in **Fig. 3**. In this paper, we use the method [2] for DR-based circuit designs where HLS techniques are effectively utilized. We also use Xilinx Vivado Design Suite 2014.2 [6] for FPGA implementation.

In this implementation flow, in Step 1, we first design a DRbased circuit for a target application. Step 1 is performed based on a control-data flow graph (CDFG) which represents the application's behavior, an FU constraint, a clock period constraint, and FPGA information obtained from Ref. [3]. In this step, we obtain *not only* an RTL description of a DR-based circuit *but also* a result of module floorplanning. Next in Step 2, we perform logic synthesis of the RTL description together with three interface circuits and obtain a gate-level circuit. In this step, the operation clock period *CLK*<sub>op</sub> is given so that *CLK*<sub>op</sub> (calculated by Eq. (1)) is almost equal to the clock period constraint in Step 1. We set the parameters  $\alpha$  and  $\beta$  in Eq. (1) as follows:

We first calculate the value of  $\frac{\alpha}{\beta}$  so that the operation clock period  $CLK_{op}$  is equal to the clock period constraint given in HLS



Fig. 3 FPGA implementation flow based on DR architectures.

stage by:

$$\frac{\alpha}{\beta} = CLK_{op} \div CLK_{in},\tag{2}$$

where  $CLK_{in}$  is approximated to 20 ns. Then the parameters  $\alpha$  and  $\beta$  are determined so as to satisfy Conditions 1 and 2 \*<sup>3,\*4</sup>.

Then in Step 3, we perform place and route for the gate-level circuit. It should be noted that, in Step 3, the placement of the DR-based circuit is performed based on the result of module floorplanning in Step 1. Finally a bit stream file is generated in Step 4 and we obtain, through configuration, the FPGA-implemented DR-based circuit.

#### **3.2** Implementation Results and Conclusions

We have implemented a DR-based circuit on the FPGA chip

<sup>&</sup>lt;sup>\*2</sup> We have experimentally known that the place and route by Vivado would be failed when we set the parameter  $\beta$  larger than 64.

<sup>&</sup>lt;sup>\*3</sup> The parameter  $\beta$  was set within a range of 20–30 in our experiment.

<sup>&</sup>lt;sup>\*4</sup> The parameters  $\alpha$  and  $\beta$  are explicitly given to the RTL description of Clock/reset generator.

by applying our proposed implementation flow as in Fig. 3 for a benchmark application DCT (48 operation nodes, not including conditional branches). In this implementation, we have assumed 4 adders and 4 multipliers as the FU constraint [2]. In order to show the validity of DR-based circuits, we have also implemented its SR-based circuit on the FPGA chip \*<sup>5</sup>.

**Table 1** shows the implementation results. The 1st column shows the target architecture of the implemented circuit and the 2nd column shows the clock period constraint given in HLS stage. The 3rd column shows the parameters  $\alpha$  and  $\beta$  in Eq. (1) which are obtained according to Section 3.1. The 4th column shows the operation clock period  $CLK_{op}$  calculated by Eq. (1). The 5th column shows the number of control steps obtained in HLS stage and the 6th column shows the circuit latency calculated by multiplying the 4th and 5th columns. The 7th to 9th columns show the number of flip-flops (FFs) of the implemented circuit, respectively. The 10th column shows the total implementation time and the 11th column shows the measured operation time of the implemented circuit on the FPGA chip \*6.

Both DR-based and SR-based circuits have been designed so that they can operate with minimum latency. We have verified the operation of each FPGA-implemented circuit according to the following procedure:

For the FPGA-implemented circuit, we have provided the operation clock period  $CLK_{op}$  as shown in the 4th column of Table 1, and operated with a set of input values. We have then obtained a set of output values from the circuit and compared each value with the theoretical one.

We have verified the operations of FPGA-implemented DRbased and SR-based circuits according to the above procedure. The 11th column of Table 1 is 2.09 ns larger than the 6th column of Table 1. It is because the 11th column of Table 1 includes the data transmission delays from PC to the FPGA chip. As demonstrated in the 11th column of Table 1, the DR-based circuit can operate on the FPGA chip 21% faster than the SR-based circuit.

### 4. Conclusion

In this paper, we have proposed an FPGA implementation method utilizing HLS techniques for DR architectures and applied it to a DR-based circuit of DCT application designed by the approach in Ref. [2]. As the contributions of this paper, we have presented the method to practically implement DR-based circuits including interface circuits on FPGA chips. Then, we have verified the operation of an FPGA-implemented DR-based circuit, which has achieved 21% faster than the SR-based circuit on the FPGA chip.

In the future, we will apply our method to other benchmark applications and compare them with the circuits which are implemented by commercial HLS tools.

#### References

- Cong, J., Fan, Y., Han, G., Yang, X. and Zhang, Z.: Architecture and synthesis for on-chip multicycle communication, *IEEE Trans. Computer Aided Design of Integrated Circuits and Systems*, Vol.23, No.4, pp.550–564 (2004).
- [2] Fujiwara, K., Kawamura, K., Yanagisawa, M. and Togawa, N.: A highlevel synthesis algorithm for FPGA designs optimizing critical path with interconnection-delay and clock-skew consideration, *Proc. 2016 International Symposium on VLSI Design, Automation and Test* (2016).
- [3] Fujiwara, K., Kawamura, K., Abe, S., Yanagisawa, M. and Togawa, N.: Interconnection-delay and clock-skew estimate modelings for floorplan-driven high-level synthesis targeting FPGA designs, *IEICE Trans. on Fundamentals of Electronics, Communications and Computer Sciences*, Vol.E99-A, No.7, pp.1294–1310 (2016).
- [4] Igarashi, K., Yanagisawa, M. and Togawa, N.: Image synthesis circuit design using selector-logic-based alpha blending and its FPGA implementation, *Proc. IEEE 11th International Conference on ASIC* (2015).
- [5] Xilinx User Guide, 7 Series FPGAs configuration (UG470) (2015).
- [6] Vivado Design Suite, available from (http://www.xilinx.com/products/ design-tools/vivado/index.htm.)



Koichi Fujiwara received his B. Eng. and M. Eng. degrees from Waseda University in 2014 and 2016, respectively, all in computer science. He is presently working toward Dr. Eng. degree there. His research interest is high-level synthesis targeting FPGA designs.



**Kazushi Kawamura** received his B. Eng., M. Eng. and Dr. Eng. degrees from Waseda University in 2012, 2013 and 2016, respectively, all in computer science. He is presently an Assistant Professor in the Department of Computer Science and Engineering, Waseda University. His research interests are high-level syn-

thesis, thermal-aware design, and reliable LSI design.



Masao Yanagisawa received his B. Eng., M. Eng. and Dr. Eng. degrees from Waseda University in 1981, 1983, and 1986, respectively, all in electrical engineering. He was with University of California, Berkeley from 1986 through 1987. In 1987, he joined Takushoku University. In 1991, he left Takushoku University and

joined Waseda University, where he is presently a Professor in the Department of Computer Science and Engineering. His research interests are combinatorics and graph theory, computational geometry, VLSI design and verification, and network analysis and design. He is a member of IEEE, ACM, and the Institute of Electronics, Information and Communication Engineers.

<sup>\*5</sup> Unlike DR-based circuit designs, SR-based circuit designs do not consider floorplanning of modules in HLS stage.

<sup>\*6</sup> We have performed the operation 1,000,000,000 times and calculated the operation time based on the total time.



**Nozomu Togawa** received his B. Eng., M. Eng. and Dr. Eng. degrees from Waseda University in 1992, 1994, and 1997, respectively, all in electrical engineering. He is presently a Professor in the Department of Computer Science and Engineering, Waseda University. His research interests are VLSI design, graph

theory, and computational geometry. He is a member of IEEE and the Institute of Electronics, Information and Communication Engineers.

(Recommended by Associate Editor: Ittetsu Taniguchi)