A configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device

doi:10.1016/j.micpro.2020.103174

Microprocessors and Microsystems

Volume 77, September 2020, 103174

https://doi.org/10.1016/j.micpro.2020.103174 Get rights and content

Abstract

To reduce DMA utilization for multiple algorithm IPs on FPGA, a channel configurable and multiplex DMA device (CMDMA) is proposed for asynchronous and heterogeneous algorithm IPs. Firstly, we abstract the entities and data-flow in CMDMA system with a formal description for function definition and work-flow analysis. Then based on the functions and work-flow, we design and implement a prototype of CMDMA, which includes CMDMA software driver (SW) and hardware circuits (HW) of one DMA IP, a configurable input switch (CISwitch), algorithm IPs, and an asynchronous output switch (AOSwitch). The configurable function of CMDMA is implemented by CISwitch through a configuration port in HW-level, and a configurable Round-Robin (CRR) algorithm is proposed to implement channel and input data schedule in SW-level. For output, a channel distinguishable output buffer (ChnDistBuf) is proposed, which is able to deliver channel ID and data size to SW earlier than the end time of an algorithm IP. With a double interrupt coordination method of both ChnDistBuf and algorithm IPs, CMDMA is able to successively store complete output data from different algorithm IPs. With a double interrupt coordination method of both ChnDistBuf and algorithm IPs, CMDMA is able to successively store complete output data from different algorithm IPs. The experiments based on 4 heterogeneous matrix multiplication algorithm IPs on Xilinx Zynq platform show that CMDMA is able to improve about 8%-29% average algorithm acceleration rates on single algorithm IP compared to the exclusive method that one DMA works for one algorithm IP only, and it is able to increase about 10–40 MB/s and 5–15 MB/s of DMA input and output data throughput with multiple algorithm IPs running in parallel. Moreover, the extended LUT and FF resources in CMDMA are 756 and 1219, both of which are about 1% of Zynq platform. Besides, in a double CNN algorithm IPs test on Mnist application, an enhanced function of data broadcasting in CMDMA is able to improve 4 s than the system with 4 exclusive DMA running in parallel, meanwhile reduce 3 DMA utilization and 0.03 W power consumption.

Introduction

The key challenge of data movement in a System-on-a-Chip (SoC) is the data transfer method from a peripheral or accelerated IP to a memory subsystem [1]. While small on-chip data movement can be accomplished using software instructions, larger data transfers are usually implemented with special data transfer resources, e.g. DMA (Direct Memory Access). DMA is an efficient data transfer device in both computer and embedded systems, which is able to load/store large size data without involving CPU systems. Generally, in a FPGA based embedded system, DMA is often integrated as a software IP core that is implemented with logic resources in FPGA, and there is only one pair of data input and output channels in one DMA device. As the resources in FPGA are precious and limited, designers always employed DMA for larger size and higher real-time data transfers in accelerated algorithm IPs or interfaces. Kidav et al. [2] proposed a cycle stealing DMA, which is to load large precomputed delay values from external flash for Array Signal Processor. Hossein et al. [3] use one DMA device for high performance data transfers of PCIe Interface design on FPGA. Rota et al. [4] employ two DMA devices to transmit and receive data in a pair of circularly in and out buffers for a PCIe core on FPGA. Kim et al. [5], use multiple DMA devices independently for movement of video stream data in different stages of object and event tracking algorithms on FPGA. For algorithm IPs with compute-intensive algorithms on FPGA, e.g. CNN, the DMA device is usually exclusively owned by one algorithm IP for large size data transfers from input and output buffers, to decrease algorithm total running time [6], [7], [8], [9], [10]. Above all, the DMA devices are used in an exclusive method that one DMA IP only works for one host IP. However, with the exclusive method, in an embedded system with multiple algorithm IPs, there need be multiple DMA IPs for data transfers in these algorithm IPs. Moreover, it may be uneconomic in the exclusive method as the DMA may be idle and wait a long time when there is no input or output data transfers in the only one host IP.

Multiplex method is a commonly and effectively used method to extend single channel to multiple in various system design [11], [12], [13], [14], [15]. In a multiplexed design, the difficulties mainly lie in the asynchronous scheduling mechanism [16], and it is hard to share transferring data between heterogeneous targets [17]. Generally, for the controlling of multiple data transfers, an arbitrator is integrated inside the multiplexer switch for multiplex channel arbitration [18], and the control algorithm of the arbitrator can be static, e.g. first-come-first-serve (FCFS) [19], and Round-Robin (RR) [20,21], or dynamic with data map tables [22,23]. However, the static arbitrator is deficient in flexibility of channel schedule and in a dynamic arbitrator algorithm, the data map tables inside the multiplexer cost more logic and memory resources, which may be uneconomical for resource limited FPGA platforms.

Another method is to enlarge cache buffer sizes of a DMA device [24], [25], [26], [27], [28]. In this method, to implement multiple channel data transfers, usually a parallel buffer, or buffer with cache algorithm [25], is needed to keep data coherence in DDR and FPGA, and the buffer size is larger, yet the data throughput is lower than actual transfer data, as some additional router information needs to be packaged to transfer data packets, e.g. target node ID for sending direction, timestamp for sending sequence, etc. [27]. More importantly, the storage of multiple temporary data packets outside the multiplexer switch consumes significant memory resources, which may be unnecessary for a multiplex DMA design on FPGA, as the DMA device is able to access DDR and send the data in DDR to an algorithm IP directly.

Unlike above methods, in this paper, a configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on a single DMA device (CMDMA) is proposed, which is able to extend a single DMA device to multiplex to reduce DMA utilization in a multiple algorithm IP system. CMDMA is designed as a multiplexed data transmission device with a software and hardware co-design method, that the input data are configured by software application and scheduled by CMDMA software driver, while the output is scheduled by hardware with the channel information carried in output data sent by each algorithm IPs. The contributions of this paper are listed as follows.

•
A method of multiplex data transfers on a single DMA device is proposed, i.e., CMDMA, which is configurable for multiplex data transfers in asynchronous and heterogeneous algorithm IPs. A special work in CMDMA is that it supports the configuration of data broadcasting to share transferring data to multiple channels at the same time.
•
A configurable multiplex data transfer model, i.e. CMmodel, is proposed, which abstracts entities and data-flow in a CMDMA system as vectors of components and vectors of configuration parameters and time events. CMmodel helps to define functions of each element in CMDMA and analyze data-flow between different entities in CMDMA system.
•
A configurable input switch (CISwitch) is proposed for multiplexed input data transfers in CMDMA. CISwitch is a multiplexed switch and the data transferring channels in CISwitch is configurable through a configuration port. CISwitch makes the input scheduling method in CMDMA customize by the user in software.
•
A configurable Round-Robin algorithm (CRR) is proposed for input data schedule in CMDMA software driver. CRR is a two-dimensional schedule algorithm for multiple algorithm IPs with multiple data queues. For algorithm IPs, the IP number and priority are configurable, and for data queues in each algorithm IPs, the DMA parameters and data transfer channels are configurable based on CISwitch.
•
A channel distinguishable buffer (ChnDistBuf) is proposed in CMDMA. The innovation in ChnDistBuf is that it is able to deliver channel ID and output data size to CMDMA software driver earlier than the real end time of an algorithm IP, which makes CMDMA successively store complete output data from different algorithm IPs, meanwhile improve single algorithm IP performance and DMA data throughput.
•
A prototype of CMDMA with four multiplex channels is implemented. To test the prototype and measure resource consumption and performance of CMDMA, one system with four heterogeneous matrix multiplication algorithm IPs and one system with double CNN IPs for Mnist application are implemented on Zynq platform.

The rest of this paper is organized as follows. Section II introduces the base DMA core and some of its extensions. Section III presents the function definition and work flow of each elements in CMDMA with a formal description of CMDMA system. Section IV describes the implementation of CMDMA. Section V shows the experimental results, and Section VI concludes this paper.

Section snippets

Related work

This section begins with an introduction of the base DMA core from Xilinx, then investigates some of its extensions, and finally, presents the differences and design objectives of CMDMA.

Multiplex data model of CMDMA

In embedded systems, it is often impossible to run applications with complex algorithms on software due to the limited computation ability of embedded CPU cores. One possible solution for complex applications is to move the inside complex algorithm to FPGA part as an algorithm IP, which can be called by the application. In this case, DMA is often employed to load and store data to these algorithm IPs from DDR directly to reduce data transfer time. However, for much more complex applications,

Implementation

Based on the functions and work flow analysis on CMmodel in last section, we implement a prototype of CMDMA on Zynq platform, as shown in Fig. 7.

The hardware part (HW) includes one DMA IP, a configurable input switch (CISwitch), algorithm IPs (AlgmIPs), and an asynchronous output switch (AOSwitch). The software part (SW) includes CMDMA driver and IPSTtable. CMDMA software driver is made up of CMDMAinit, input controller, and two interrupt handlers of ChnDistBuf and algorithm IPs. CRR schedule

Experiments

To measure the performance and resource consumption of CMDMA, we implement a prototype of CMDMA on Xilinx Zynq-7000 FPGA platform [39]. There are two parts in the Zynq platform. One is processing subsystem (PS), which is integrated with a dual ARM core for system control, and we run the host application of CMDMA and CMDMA driver on PS. The other part is programmable logic (PL) subsystem, which is the part of FPGA for user logic implementation, and we implement hardware IPs of CMDMA, which

Conclusion

In this paper, a configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device (CMDMA) is proposed to improve the flexibility and efficiency of a single DMA device on FPGA. Firstly, a data model for multiplex data transfers is proposed to define functions and analyze the work-flow of CMDMA in a system. Based on the data model, a prototype of CMDMA is implemented, which is made up of CMDMA software driver and hardware circuits of DMA,

Future work

The future work of CMDMA includes software and hardware parts. In software, the addressed scenario in this paper is just focused on the schedule of multiple algorithm IPs for one host application. In the future, it needs to deploy CMDMA to the scenario of multiple algorithm IPs, which are called by multiple applications, and a notifying method for swapping tasks need be added in CMDMA software driver. For the hardware of CMDMA, the work in this paper can be combined with Network-on-Chip (NoC)

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant no. 61602016.

Zhangqin Huang received the B.S., M.S., and Ph.D. degrees in computer science from Xi'an Jiaotong University, Xi'an, China, in 1986, 1989, and 2000, respectively. From 2001 to 2003, he was a Postdoctoral Researcher in the Technische Universiteit Eindhoven (TU/e), Eindhoven, the Netherlands. He is currently a professor, doctoral supervisor of the Faculty of Information Technology, Beijing University of Technology. His current research interests include Internet of Things, Co-design for Embedded

References (41)

J.U. Kidav et al.
Architecture and FPGA prototype of cycle stealing DMA array signal processor for ultrasound sector imaging systems
Microprocess. Microsyst.
(2019)
S. Attia et al.
Optimizing FPGA-based hard networks-on-chip by minimizing and sharing resources
Integration
(2018)
T. Hussain et al.
EMVs: embedded multi Vector-core System
J. Syst. Archit.
(2018)
Z. Hajduk
Simple method of asynchronous circuits implementation in commercial FPGAs
Integr. VLSI J.
(2017)
M. Najem et al.
Extended overlay architectures for heterogeneous FPGA cluster management
J. Syst. Archit.
(2017)
Y. Choi et al.
Evaluation of queue designs for true fully adaptive routers
J. Parallel Distrib. Comput.
(2004)
E. Hyytiä et al.
On Round-Robin routing with FCFS and LCFS scheduling
Perform. Eval.
(2016)
J. Sirkunan et al.
Hardware transactional memory architecture with adaptive version management for multi-processor FPGA platforms
J. Syst. Archit.
(2017)
S. Erusalagandi, “Leveraging data-mover IPs for data movement in Zynq-7000 AP SoC systems", 2015. [Online]....
H. Kavianipour et al.
High performance FPGA-based DMA interface for pcie
IEEE Trans. Nucl. Sci.
(2014)

L. Rota et al.

A PCIe DMA architecture for multi-GB per second data transmission

IEEE Trans. Nucl. Sci.

(2015)

S. Kim et al.

Multi-object tracking coprocessor for multi-channel embedded DVR systems

IEEE Trans. Consum. Electron.

(2012)

J. Qiu, S. Song, Y. Wang, H. Yang, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, and N. Xu, “Going Deeper...

G. Natale et al.

On how to design dataflow FPGA-based accelerators for convolutional neural networks

D.T. Nguyen et al.

A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection

IEEE Trans. Very Large Scale Integr. Syst.

(2019)

S. Zhang et al.

HACO-F: an accelerating hls-based floating-point ant colony optimization algorithm on FPGA

Int. J. Perform. Eng.

(2017)

Z. Huang et al.

An efficient method of parallel multiplication on a single DSP slice for embedded FPGAs

IEEE ACCESS

(2019)

M. Oveis-Gharan et al.

Efficient dynamic virtual channel organization and architecture for NoC systems

IEEE Trans. Very Large Scale Integr. Syst.

(2016)

I. Seitanidis et al.

ElastiStore: flexible elastic buffering for virtual-channel-based networks on chip

IEEE Trans. Very Large Scale Integr. Syst.

(2015)

W. Sun et al.

A ripple control dual-mode single-inductor dual-output buck converter with fast transient response

IEEE Trans. Very Large Scale Integr. Syst.

(2015)

Cited by (0)

Shuo Zhang was born in Pinggu, Beijing, China, in 1991. He received the B.S. degree in Software Engineering of School of Software from Beijing University of Technology, Beijing, China, in 2013. He got the MBA-DBA in Software Engineering of School of Software from Beijing University of Technology at 2014. He is currently pursuing the Ph.D. degree in Faculty of Information Technology at Beijing University of Technology, Beijing, China. His current research interests include co-design for embedded software and hardware, embedded system architecture, and FPGA hardware acceleration.

Han Gao received the B.S. degree in Tianjin Normal University, Tianjin, China, in 2012 and the M.S. degree in Beijing University of Technology, Beijing, China, in 2017. She is currently pursuing the Ph.D. degree in software engineering at Beijing University of Technology, Beijing, China. Her research interests include network communication, co-design for embedded software and hardware, multiprocessing systems, and Internet of Things.

Xiaobo Zhang received the B.S. degree in software engineering from North University of China, Shanxi, China, in 2016 and the M.S. degree in software engineering of Beijing University of Technology, Beijing, China, in 2019. He is currently pursuing the Ph.D. degree in software engineering at Beijing University of Technology, Beijing, China. From 2017 to 2019, he was a Research Member with the Beijing Engineering Research Center for IoT Software and Systems. He participated in the design and development of the LoRa gateway of the National Forest Development and Reform Commission's Smart Forest Project.

Shengqi Yang received the B.S. in the Department of mechanics and engineering, Peking University in 2000.8, and a double B.S. from the China Center for economic research, Peking University. He got his PhD in 2006 from Princeton. He serves as the adjunct professor at Beijing University of Technology.

View full text

A configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device

Abstract

Introduction

Section snippets

Related work

Multiplex data model of CMDMA

Implementation

Experiments

Conclusion

Future work

Declaration of Competing Interest

Funding

Microprocess. Microsyst.

Integration

J. Syst. Archit.

Integr. VLSI J.

J. Syst. Archit.

J. Parallel Distrib. Comput.

Perform. Eval.

J. Syst. Archit.

High performance FPGA-based DMA interface for pcie

IEEE Trans. Nucl. Sci.

A PCIe DMA architecture for multi-GB per second data transmission

IEEE Trans. Nucl. Sci.

Multi-object tracking coprocessor for multi-channel embedded DVR systems

IEEE Trans. Consum. Electron.

On how to design dataflow FPGA-based accelerators for convolutional neural networks

A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection

IEEE Trans. Very Large Scale Integr. Syst.

HACO-F: an accelerating hls-based floating-point ant colony optimization algorithm on FPGA

Int. J. Perform. Eng.

An efficient method of parallel multiplication on a single DSP slice for embedded FPGAs

IEEE ACCESS

Efficient dynamic virtual channel organization and architecture for NoC systems

IEEE Trans. Very Large Scale Integr. Syst.

ElastiStore: flexible elastic buffering for virtual-channel-based networks on chip

IEEE Trans. Very Large Scale Integr. Syst.

A ripple control dual-mode single-inductor dual-output buck converter with fast transient response

IEEE Trans. Very Large Scale Integr. Syst.