Keywords

1 Introduction

Large-size functional blocks and nodes of a digital system and also the digital system itself, as a rule, include a control device or a controller. The speed of a digital system and functional blocks depends directly on the speed of their control devices. The mathematical model for the majority of control devices and controllers is a finite state machine (FSM). Because of this, the synthesis methods of high-speed FSMs are necessary for designing high-performance digital systems. Note that an implementation cost can be ignored in the synthesis of high-speed FSMs, because an FSM area takes a small part compared with other system components (for example, memory or transceivers).

Now, programmable logic devices (PLDs) are widely used for designing digital systems. Two types of PLD architectures are widely used: on the basis of two programmed matrixes (AND and OR), and on the basis of functional generators, an LUT (Look Up Table). The first PLD type is called Complex Programmable Logic Devices (CPLDs), and the second PLD type is called Field Programmable Gate Arrays (FPGAs). It is possible to represent an FPGA structure as a great quantity of LUTs united by interconnections. Every LUT allows realizing any Boolean function from a small number of arguments (as a rule, from 4 to 6). The methods of FSM synthesis on CPLD have been considered in [1].

Many authors considered the synthesis problem of high-speed FSMs on PLD. Their methods were characterized by a large variety of approaches to deciding on a given task. In [2], a technique for improving the performance of a synchronous circuit configured as an FPGA-based look-up table without changing the initial circuit configuration is presented. Only the register location is altered. This improves clock speed and data throughput at the expense of latency. In [3], the methods and tools for state encoding and combinational synthesis of sequential circuits based on new criteria of information flow optimization are considered. In [4], the timing optimization technique for a complex FSM that consists of not only random logic but also data operators is proposed. The technique, based on the concept of a catalyst, adds a functionally redundant block (which includes a piece of combinational logic and several other registers) to the circuits under consideration so that the timing critical paths are divided into stages. In [5, 6], the styles of FSMs description in VHDL language and known methods of state assignment for the implementation of FSMs are researched. In [7], evolutionary methods are applied to the synthesis of FSMs. At the first stage, the task of state assignment by means of genetic algorithms is resolved. Then evolutionary algorithms are applied to the minimization of chip area and time delay of FSM output signals. In [8], the task of state assignment and optimization of the combinational circuit at implementation of high-speed FSMs in CPLD is considered. In [9], a novel architecture that is specifically optimized for implementing reconfigurable FSMs, Transition-based Reconfigurable FSM (TR-FSM), is presented. The architecture shows a considerable reduction in area, delay, and power consumption compared to FPGA architectures. In [10], a new model of the automatic machine named the virtual finite state machine (Finite Virtual State Machine - FVSM) is offered. For implementation of the FVSM, architecture based on storage and a technique of FVSM generation from traditional FSMs is offered. FVSM implemented on new architecture have an advantage on high-speed performance compared with traditional implementation of FSMs on storage RAM. In [11], an implementation of FSMs in FPGA with the use of integral units of storage ROM is considered. Two pieces of FSMs architecture with multiplexers on inputs of ROM blocks which allow reducing the area and increasing high-speed FSM performance are offered. In [12], the reduction task of arguments of transition functions by state splitting is considered; this allows reducing an area and time delay in the implementation of FSMs on FPGA.

This paper also uses splitting of FSM states, but the purpose of splitting is an increase of FSMs performance in LUT-based FPGA. Splitting of FSM states belongs to operations of equivalent conversions of an FSM and does not change the algorithm of its functioning. During splitting of FSM states, the machine type (Mealy or Moore) is saved, the general structure of the FSM does not change, and embedded memory blocks of FPGAs are not used. In the course of state splitting, the hierarchy of state names is saved, which simplifies the analysis and debugging of the project. Because of this, the offered synthesis method of high-speed FSMs in FPGA is aimed at practical usage and can be easily included in the general flow of digital system design.

This paper is organized as follows. Section 2 describes estimations of the number of LUT levels in the implementation of FSM transition functions in the case of sequential and parallel decomposition. Section 3 considers the synthesis method of high-speed FSMs, which includes two algorithms: a general algorithm and an algorithm for the decomposition of the concrete state. A detailed example shows the method. The experimental results are reported in Sect. 4. The paper concludes with a summary in Sect. 5.

2 Estimations for the Number of LUT Levels for Transition Functions

Let A = {a1, , aM} be the set of internal states, X = {x1, , xL} be the set of input variables, Y = {y1, , yN} the set of output variables, and D = {d1, , dR} the set of transition functions of an FSM.

A one-hot state assignment is traditionally used for the synthesis of high-speed FSMs in FPGAs. Thus, each internal state ai (ai ∈ A) corresponds to a separate flip-flop of FSM’s memory. A setting of this flip-flop in 1 signifies that the FSM is in the given state. The data input of each flip-flop is controlled by the transition function di, di ∈ D, i.e. any internal state ai (ai ∈ A) of the FSM corresponds with its own transition function \( d_{i} ,i = \overline{1,M} \).

Let X(am,ai) be the set of FSM input variables, whose values initiate the transition from state am to state ai (am, ai ∈ A). To implement some transition from state am to state ai, it is necessary to check the value of the flip-flop output for the active state am (one bit) and the input variable values of the X(am,ai) set, which initiates the given transition. To implement the transition function di, it is necessary to check the values of the flip-flop outputs for all states, such that transitions from which lead to state ai, i.e. |B(ai)| values, where B(ai) is the set of states from which transitions terminate in state ai, where |A| is the cardinality of set A. Besides, it is necessary to check the values of all input variables, which initiate transitions to state ai, i.e. |X(ai)| values, where X(ai) is the set of input variables, whose values initiate transitions to state ai, \( X(a_{i} ) = \bigcup\limits_{{a_{m} \in B(a_{i} )}} {X(a_{m} ,a_{i} )} \).

Let ri be a rank of the transition function di, where

$$ r_{i} = \left| {B(a_{i} )} \right| + \left| {X(a_{i} )} \right|. $$
(1)

Let n be the number of inputs of LUTs. If the rank ri for transition function \( d_{i} (i = \overline{1,M} ) \) exceeds n, there is a necessity to decompose the transition function di and its implementation on several LUTs.

Note that by splitting internal states it is impossible to lower the rank of the transition functions below the value

$$ r^{*} = \hbox{max} (|X(a_{m} ,a_{s} )|) + 1,m,s = \overline{1,M} . $$
(2)

In this method, the value r* is used as an upper boundary of the ranks of the transition functions in splitting the FSM states.

It is well-known that there are two basic approaches to the decomposition of Boolean functions: sequential and parallel. In the case of sequential decomposition, all the LUTs are sequentially connected in a chain (Fig. 1).

Fig. 1.
figure 1

Sequential decomposition of Boolean function

The n arguments of function di arrive on inputs of the first LUT, and the (n − 1) arguments arrive on inputs of all remaining LUTs. So the number \( l_{i}^{s} \) of the LUT’s levels (in the case a sequential decomposition of the transition function di having the rank ri) is defined by the expression:

$$ l_{i}^{s} = \text{int} \left( {\frac{{r_{i} - n}}{n - 1}} \right) + 1, $$
(3)

where int(A) is the least integer number more or equal to A.

In the case of parallel decomposition, the LUTs incorporate in the form of a hierarchical tree structure (Fig. 2).

Fig. 2.
figure 2

Parallel decomposition of Boolean function

The values of the function arguments arrive on LUTs inputs of the first level, and the values of the intermediate functions arrive on LUTs inputs of all next levels. So the number of LUT’s levels (in the case parallel decomposition the transition function di having the rank ri) is defined by the following expression:

$$ l_{i}^{p} = \text{int} \left( {\log_{n} r_{i} } \right). $$
(4)

It is difficult to predict what type of decomposition (sequential or parallel) is used by a concrete synthesizer. The preliminary research showed that, for example, the Quartus II design tool from Altera simultaneously uses both sequential and parallel decomposition. The number li levels of LUTs in the implementation on FPGA transition function di with the rank ri can be between values \( l_{i}^{s} \) and \( l_{i}^{p} \), \( i = \overline{1,M} \).

Let k be an integer coefficient (k ∈ [0,10]) that allows adapting the offered algorithm in defining the number of LUT’s levels for the specific synthesizer. In this case the number li of LUT’s levels for the implementation of the transition function di having the rank ri will be defined by following expression:

$$ l_{i} = \text{int} \left( {\frac{10 - k}{10}l_{i}^{p} + \frac{k}{10}l_{i}^{s} } \right). $$
(5)

The specific value of coefficient k depends on the architecture of the FPGA and the used synthesizer.

The following problem is the answer to the question: when is it necessary to stop splitting the FSM states? The matter is that in splitting state \( a_{i} (i = \overline{1,M} ) \), except for the increase of the number M of the FSM states, the number of transitions in the states of set A(ai) is also increased, where A(ai) is the set of states in which the transitions from state ai terminate. When splitting state ai, the cardinalities of sets B(am) (am ∈ A(ai)) are increased for the states of set A(ai). Therefore, according to (1) for the states of set A(ai) the ranks of the transition functions grow, which can lead to an increase of the values and \( l_{i}^{s} \), \( l_{i}^{p} \), and li.

In this algorithm, the process of state splitting is finished, when the following condition is met:

$$ l_{\hbox{max} } \le \text{int} \left( {l_{mid} } \right) , $$
(6)

where lmax is the number of LUT levels, which is necessary for the implementation of the most “bad” function having the maximum rank; lmid is the arithmetic mean value of the number of LUT levels for all transition functions. Note that in the process of splitting the FSM internal states, the value lmid will increase and the value lmax will decrease, therefore the algorithm execution always comes to an end.

3 Method for High-Speed FSM Synthesis

According to the above discussion, the algorithm of state splitting for high-speed FSM synthesis is described as follows.

Further synthesis of the FSM is performed using traditional techniques, for example, automatically by means of using a design tool synthesizer. For this purpose, it is enough to describe the FSM received after splitting internal states in one of the design languages (Verilog or VHDL). The value of coefficient k (step 1 of Algorithm 1) is defined empirically by means of synthesis of the test examples in the used design tool.

For splitting some ai state, \( i = \overline{1,M} \), which is executed in step 6 of Algorithm 1, Boolean matrix W is constructed as follows. Let C(ai) be the set of transitions to state ai. Rows of matrix W correspond to the elements of set C(ai). Columns of matrix W are divided on two parts according to types of arguments of transition function di. The first part of matrix W columns correspond to set B(ai) of FSM states, the transitions from which terminate in state ai, and the second part of matrix W columns correspond to set X(ai) of input variables, whose values initiate the transitions in state ai. A one is put at the intersection of row t (t = \( \overline{1,T} \), T = |C(ai)|) and column j of the first part of matrix W if the transition ct (ct ∈ C(ai)) is executed from state aj (aj ∈ B(ai)). A one is put at the intersection of row t and column j of the second part of matrix W if input variable xj (xj ∈ X(ai)) accepts a significant value (0 or 1) on transition ct (ct ∈ C(ai)). Now the task is reduced to a partition of matrix W on a minimum number H of row minors \( W_{ 1} , \ldots ,W_{H} \) so that the number of columns, which contain ones in each minor Wh (h = \( \overline{1,H} \)), do not exceed value r* defined according to (2). The rows of each minor Wh will define transitions in state ai_h (h = \( \overline{1,H} \)).

Let wt be some row of matrix W. For finding the row partition of matrix W on a minimum number H of row minors \( W_{ 1} , \ldots ,W_{H} \), the following algorithm can be used.

We show the operation of the offered synthesis method in the example. It is necessary to synthesize the high-speed FSM whose state diagram is shown in Fig. 3.

Fig. 3.
figure 3

State diagram of the initial FSM

This FSM represents the machine Moore, which has 6 states \( a_{ 1} , \ldots ,a_{ 6} \), 10 input variables \( x_{ 1} , \ldots ,x_{ 10} \), and one output variable y. The transitions from states a3, a4, and a5 are unconditional, therefore the logical value 1 is written on these transitions as a transition condition. The values of sets B(ai) and X(ai), and also ranks ri of the transition functions for the initial FSM are presented in Table 1, where Ø is an empty set. Since for this example we have max(|X(am,as)|) = 5, then (according to (2)) the value r* = 6. It is necessary to construct the FSM on FPGA with 6-input LUT, i.e. we have n = 6.

Table 1. Values of B(ai), X(ai), ri, \( l_{i}^{s} \), and \( l_{i}^{p} \) for the initial FSM

According to (3) and (4), the values \( l_{i}^{s} \) and \( l_{i}^{p} \) are defined for each state (they are presented in the appropriate columns of Table 1). We do not know how the compiler performs a decomposition of Boolean functions, therefore we assume the sequential decomposition (a worst variant) and the value of coefficient k in expression (5) is equal to 10, i.e. we have k = 10. As a result, the number of LUT levels (which are necessary for the implementation of each transition function) is defined by the value \( l_{i} = l_{i}^{s} \). Thus, for our example we have int(lmid) = int(8/6) = 2. In other words, splitting FSM internal states stops as soon as each transition function can be implemented in two levels of LUTs.

For this example, we have \( l_{max} = l_{2}^{s} = { 3} \), i.e. the condition (9) does not meet for state a2, since \( l_{max} = l_{2}^{s} = { 3 } > {\text{ int}}\left( {l_{mid} } \right) \, = { 2} \). For this reason, state a2 is split by means of Algorithm 2. Matrix W is constructed for splitting state a2 (Fig. 4).

Fig. 4.
figure 4

Matrix W for splitting state a2

Matrix W has two rows. Row w1 corresponds to the transition from state a1 to state a2, and row w2 corresponds to the transition from state a6 to state a2. The execution of Algorithm 2 leads to a partition of rows of matrix W into two subsets: W1 = {w1} and W2 = {w2}. So, state a2 is split into two states a2_1 and a2_2, as shown in Fig. 5.

Fig. 5.
figure 5

State diagram of the FSM after splitting state a2

The new values of B(ai), X(ai), ri, \( l_{i}^{s} \), and \( l_{i}^{p} \) are presented in Table 2. Now we have lmax = lmid = 1 and (according to (6)) running of Algorithm 1 is completed.

Table 2. Values of B(ai), X(ai), ri, \( l_{i}^{s} \), and \( l_{i}^{p} \) after splitting state a2

Thus, for the given FSM by splitting state a2 we reduced the number of LUT levels from 3 to 1, in the case of sequential decomposition, and from 2 to 1, in the case of parallel decomposition.

4 Experimental Results

The efficiency of the offered synthesis method was checked in the implementation of the initial FSM (Fig. 1) and the FSM after splitting state a2 (Fig. 2) on FPGAs from Altera by means of the design tool Quartus II version 15.0. The main optimization criterion had been selected as the parameter «speed». The «one-hot» method of state assignment was selected for the initial FSM, and the «user» method of state assignment was selected for the FSM after synthesis (the state codes are defined from the FSM description).

Table 3 presents the results of the experimental research of the offered method for various FPGA families, where nLUT1 and nLUT2 are the number of LUTs used in the implementation of the initial and the synthesized FSM, respectively; F1 and F2 are the clock frequency (in MHz) for the initial and the synthesized FSM, respectively; F1/F2 is the relation of the appropriate parameters.

Table 3. Results of the experimental researches

Analysis of Table 3 shows that the application of the offered method increased the performance of the FSM for 5 FPGA families from 7. Thus, for the family MAX II performance was increased by 1.52 times, and for the family Cyclone V performance increased by 1.35 times. In addition, the number of used LUTs decreased for the following families: Arria II GX, MAX V, and MAX II.

5 Conclusions

The presented results of the experimental research showed the following. Despite the fact that in the considered example the rank of transition function was reduced from 12 to 6, which allowed to reduce the number of LUT levels from 3 to 1 in the case of sequential decomposition, and from 2 to 1 in the case of parallel decomposition; however, the performance of the FSM did not increase for all FPGA families. This is a sign of the complexity of the synthesis task of high-speed FSMs. FSM performance depends not only on the results of logical synthesis, but also on the results of placing and routing. The reduction of the number of used LUTs for some FPGA families (as a result of the application of the offered method) can be accounted simply: with the reduction of the number of LUT levels, the LUT amount also decreases.

The present study was supported by a grant S/WI/1/2013 from Bialystok University of Technology and founded from the resources for research by Ministry of Science and Higher Education.