Dynamically reconfigurable dataflow architecture for high-performance digital signal processing

https://doi.org/10.1016/j.sysarc.2010.07.010Get rights and content

Abstract

In this paper a dataflow architecture is introduced that maps efficiently onto multi-FPGA platforms and is composed of communication channels which can be dynamically adapted to the dataflow of the algorithm. The reconfiguration of the topology can be accomplished within a single clock cycle while DSP operations are in progress. Finally, the programmability and scalability of the proposed architecture is demonstrated by a high-performance parallel FFT implementation.

Introduction

In comparison to ASICs, FPGAs are characterized by a 10- to 100-fold logical overhead in chip area due to their ability to be reconfigured. Moreover, the regularly arranged configurable logic cells have to be interconnected via programmable routing switches and the overall performance heavily depends on the routing results delivered by the design tools. Particularly, in complex designs where routing resources become rare, it is difficult to find solutions to this issue, and the interconnection delay dominates over the delay within configurable logic cells. This results in poor clock rates that are usually about 20 times lower compared to general-purpose processors [1]. To overcome this problem, the foremost issue for FPGAs is the need to extract massive amounts of parallelism. Additionally, todays FPGA vendors integrate highly optimized embedded multipliers, fast carry chains, large amounts of on-chip RAM, and dedicated arithmetic routing, all of which facilitate DSP operations.

Coupling these features with massive parallelism provided by FPGAs, the resulting systems can outperform the fastest DSP processors by one to two orders of magnitude. While this can be easily achieved, e.g., for matrix multiplication, it can be difficult in other cases, particularly when more data dependencies exist, e.g., in computing parallel fast Fourier transforms (FFTs) [2], [3], [4]. Furthermore, the dedicated DSP resources in FPGAs are strongly limited. Thus, the maximum achievable computational performance highly depends on how efficiently the system architecture scales to multi-FPGA platforms and is bounded by the total communication bandwidth between embedded DSP units. Therefore, modern FPGA devices are usually equipped with a large number of high-speed serial transceivers, which are characterized by high noise tolerance, clock data recovery, and error detection, all of which enable reliable transfer rates of several giga bits per second. Moreover, this allows the easy setup of arbitrary network topologies.

Another advantage of FPGAs is their ability to be reconfigured. For this reason, dynamic reconfiguration of FPGA architectures has become increasingly more attractive [5], [6]. The idea is to map DSP algorithms efficiently on hardware [7] and modify parts in real time to switch from one function to another, e.g., by loading different filters in multimedia applications or a coprocessor on demand. However, one major drawback is that it takes up to milliseconds to partially reconfigure FPGA architectures.

In this paper a dataflow architecture is introduced that can be efficiently mapped onto modern FPGAs. In this architecture, the topology of the interconnection between computational units can be dynamically reconfigured. In contrast to the concept of partially reconfiguring FPGAs, our approach is to connect DSP resources via a dynamically variable topology, so that the reconfiguration can be achieved within a single clock cycle and is done while arithmetic operations are in progress. Hence, the proposed dataflow architecture combines the basic idea of reconfiguration with the performance of scalable parallel processing.

Section snippets

Background

In the following the typical characteristics of parallel algorithms are pointed out. Moreover, a concept is explained how the topology of the architecture can be adapted to the dataflow of the algorithm to provide a direct inter-processor communication at all times and to maximize the computational throughput.

Dynamically reconfigurable dataflow architecture

The new contribution of this paper is the efficient mapping of a dynamically variable topology on modern FPGA architectures and, in particular, the development of a scalable extension of this concept for multi-FPGA platforms [12].

Parallel extension

In the previous sections the dataflow channels of the DRDA were defined to be serial only, i.e., W=1. However, a parallel extension of the transfer width W is motivated by the logarithmic increase of the propagation delay mentioned in Section 3.2.2. As a serial dataflow can be compensated by high operating frequencies for small values of M, it is not acceptable anymore for M>=16, i.e., when the maximum operating frequency falls below 200 MHz (see Fig. 8). In particular, when operands with single

Multi-FPGA hardware design

To prove the scalability of the proposed dataflow architecture, we developed a hardware board [17] that comprises two Xilinx Virtex-II Pro FPGAs that are connected on board-level via six high-speed RocketIO™ transceivers. Moreover, multi-board computing platforms can be easily composed via four optical links and have been successfully tested, each operating with up to 3.125 Gb/s.

In Fig. 13 is shown the block diagram of our printed circuit board (PCB) hardware design. We chose the Xilinx XC2VP30

Programming model

In recent years significant increases in silicon and algorithmic complexity of todays highly integrated embedded hardware and software systems have triggered a rise in design and verification costs. For this reason, the need for powerful development approaches have emerged and a new paradigm known as electronic system level (ESL) design is promising to usher in a new era in FPGA design. The term ESL refers to tools and methodologies that raise design abstraction to levels above the current

Application

Generally, the dynamically reconfigurable dataflow architecture (DRDA) is suitable for any kind of distributed high-performance digital signal processing (DSP). To demonstrate its effectiveness, the operational principle is explained on a high-performance parallel fast Fourier transform (FFT).

Conclusion

In this paper we presented an FPGA-based dataflow architecture that is composed of communication channels which can be dynamically adapted to the dataflow of the algorithm and maps efficiently onto multi-FPGA hardware platforms. The topology can be reconfigured within a single clock cycle while DSP operations are in progress. Moreover, only the computational unit of the DRDA components is application-specific and must be implemented according to the functional blocks of the DSP algorithm. The

Sven-Ole Voigt received both the MSc degree and the Ph.D. degree in computer engineering from the Hamburg University of Technology, Germany. He joined NEC Electronics, Singapore, in 2003 and was responsible for embedded multimedia architectures. Since 2004 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany, and has been recently promoted to an assistant professor. His research interests include high-performance dataflow architectures,

References (26)

  • M. Silva et al.

    Support for partial run-time reconfiguration of platform FPGAs

    JSA

    (2006)
  • J. McAllister et al.

    Rapid implementation and optimisation of DSP systems on FPGA-centric heterogeneous platforms

    JSA

    (2007)
  • Z. Guo, W. Najjar, F. Vahid, K. Vissers, A quantitative analysis of the speedup factors of FPGAs over processors, in:...
  • J. Palmer, B. Nelson, A parallel FFT architecture for FPGAs, in: Proc. of the Int. Conference on Field Programmable...
  • W. Gentleman, G. Sande, Fast Fourier transforms – for fun and profit, in: Proc. of the AFIPS Joint Computer Conference,...
  • J. Cooley et al.

    An algorithm for machine calculation of complex Fourier series

    Mathematics of Computation

    (1965)
  • B. Blodget, C. Bobda, M. Huebner, A. Niyonkuru, Partial and dynamically reconfiguration of Xilinx Virtex-II FPGAs, in:...
  • D. Heller

    A survey of parallel algorithms in numerical linear algebra

    SIAM Review

    (1978)
  • H. Richter

    Multiprocessor with dynamically variable topology

    Computer System Science and Engineering

    (1990)
  • V. Benes

    Mathematical Theory of Connecting Networks and Telephone Traffic

    (1965)
  • K. Lee

    On the rearrangeability of a (2logN-1) stage permutation network

    IEEE Transactions on Computers

    (1985)
  • S.Voigt, T. Teufel, Dynamically reconfigurable dataflow for high-performance digital signal processing on Multi-FPGA...
  • CoreConnect Bus Architecture – An Open, 32-, 64-, 128-Bit Core on-Chip Bus Standard, IBM Microelectronics,...
  • Cited by (6)

    • Enabling zero knowledge proof by accelerating zk-SNARK kernels on GPU

      2023, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      The most popular NTT/INTT algorithm method named butterfly in Fig. 5 has the theoretical lowest complexity except for the data shuffle at every layer including transposition and bit reserve operation implementation. It is evident that every unit continuously changes the data interaction object in every layer and achieves direct communication at all times [44] in the butterfly structure. This is a typical memory limited situation in GPU because global memory access is inevitable and frequent.

    • Highly Parallel Multi-FPGA System Compilation from Sequential C/C++ Code in the AWS Cloud

      2022, ACM Transactions on Reconfigurable Technology and Systems
    • Design of a multi-mode communication signal processing module

      2020, Journal of Tianjin Polytechnic University
    • Data redundancy problems in data-flow computing and solutions implemented on the recurrent architecture

      2017, Proceedings of the 2017 IEEE Russia Section Young Researchers in Electrical and Electronic Engineering Conference, ElConRus 2017

    Sven-Ole Voigt received both the MSc degree and the Ph.D. degree in computer engineering from the Hamburg University of Technology, Germany. He joined NEC Electronics, Singapore, in 2003 and was responsible for embedded multimedia architectures. Since 2004 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany, and has been recently promoted to an assistant professor. His research interests include high-performance dataflow architectures, reconfigurable application-specific instruction-set processors, embedded systems, and rapid prototyping.

    Malte Baesler received the MSc degree in electrical engineering from the Hamburg University of Technology, Germany. Since 2007 he is a researcher at the Institute for Reliable Computing, Hamburg University of Technology, Germany. His research interest include computer arithmetic, embedded systems and computer architecture.

    Thomas Teufel received the MSc degree in electrical engineering from the University of Bremen, Germany, and the Ph.D. degree in computer science, under the direction of Dr. Ulrich Kulisch, from the Karlsruhe Institute of Technology, Germany. Since 1991 he is an associate professor of computer engineering at the Institute for Reliable Computing, Hamburg University of Technology, Germany. His research interests include implementation of algorithms in hardware, chip design, embedded systems for automation and control engineering, rapid prototyping, computer arithmetic and real-time operating systems. He is a member of the IEEE.

    View full text