A configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device
Introduction
The key challenge of data movement in a System-on-a-Chip (SoC) is the data transfer method from a peripheral or accelerated IP to a memory subsystem [1]. While small on-chip data movement can be accomplished using software instructions, larger data transfers are usually implemented with special data transfer resources, e.g. DMA (Direct Memory Access). DMA is an efficient data transfer device in both computer and embedded systems, which is able to load/store large size data without involving CPU systems. Generally, in a FPGA based embedded system, DMA is often integrated as a software IP core that is implemented with logic resources in FPGA, and there is only one pair of data input and output channels in one DMA device. As the resources in FPGA are precious and limited, designers always employed DMA for larger size and higher real-time data transfers in accelerated algorithm IPs or interfaces. Kidav et al. [2] proposed a cycle stealing DMA, which is to load large precomputed delay values from external flash for Array Signal Processor. Hossein et al. [3] use one DMA device for high performance data transfers of PCIe Interface design on FPGA. Rota et al. [4] employ two DMA devices to transmit and receive data in a pair of circularly in and out buffers for a PCIe core on FPGA. Kim et al. [5], use multiple DMA devices independently for movement of video stream data in different stages of object and event tracking algorithms on FPGA. For algorithm IPs with compute-intensive algorithms on FPGA, e.g. CNN, the DMA device is usually exclusively owned by one algorithm IP for large size data transfers from input and output buffers, to decrease algorithm total running time [6], [7], [8], [9], [10]. Above all, the DMA devices are used in an exclusive method that one DMA IP only works for one host IP. However, with the exclusive method, in an embedded system with multiple algorithm IPs, there need be multiple DMA IPs for data transfers in these algorithm IPs. Moreover, it may be uneconomic in the exclusive method as the DMA may be idle and wait a long time when there is no input or output data transfers in the only one host IP.
Multiplex method is a commonly and effectively used method to extend single channel to multiple in various system design [11], [12], [13], [14], [15]. In a multiplexed design, the difficulties mainly lie in the asynchronous scheduling mechanism [16], and it is hard to share transferring data between heterogeneous targets [17]. Generally, for the controlling of multiple data transfers, an arbitrator is integrated inside the multiplexer switch for multiplex channel arbitration [18], and the control algorithm of the arbitrator can be static, e.g. first-come-first-serve (FCFS) [19], and Round-Robin (RR) [20,21], or dynamic with data map tables [22,23]. However, the static arbitrator is deficient in flexibility of channel schedule and in a dynamic arbitrator algorithm, the data map tables inside the multiplexer cost more logic and memory resources, which may be uneconomical for resource limited FPGA platforms.
Another method is to enlarge cache buffer sizes of a DMA device [24], [25], [26], [27], [28]. In this method, to implement multiple channel data transfers, usually a parallel buffer, or buffer with cache algorithm [25], is needed to keep data coherence in DDR and FPGA, and the buffer size is larger, yet the data throughput is lower than actual transfer data, as some additional router information needs to be packaged to transfer data packets, e.g. target node ID for sending direction, timestamp for sending sequence, etc. [27]. More importantly, the storage of multiple temporary data packets outside the multiplexer switch consumes significant memory resources, which may be unnecessary for a multiplex DMA design on FPGA, as the DMA device is able to access DDR and send the data in DDR to an algorithm IP directly.
Unlike above methods, in this paper, a configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on a single DMA device (CMDMA) is proposed, which is able to extend a single DMA device to multiplex to reduce DMA utilization in a multiple algorithm IP system. CMDMA is designed as a multiplexed data transmission device with a software and hardware co-design method, that the input data are configured by software application and scheduled by CMDMA software driver, while the output is scheduled by hardware with the channel information carried in output data sent by each algorithm IPs. The contributions of this paper are listed as follows.
- •
A method of multiplex data transfers on a single DMA device is proposed, i.e., CMDMA, which is configurable for multiplex data transfers in asynchronous and heterogeneous algorithm IPs. A special work in CMDMA is that it supports the configuration of data broadcasting to share transferring data to multiple channels at the same time.
- •
A configurable multiplex data transfer model, i.e. CMmodel, is proposed, which abstracts entities and data-flow in a CMDMA system as vectors of components and vectors of configuration parameters and time events. CMmodel helps to define functions of each element in CMDMA and analyze data-flow between different entities in CMDMA system.
- •
A configurable input switch (CISwitch) is proposed for multiplexed input data transfers in CMDMA. CISwitch is a multiplexed switch and the data transferring channels in CISwitch is configurable through a configuration port. CISwitch makes the input scheduling method in CMDMA customize by the user in software.
- •
A configurable Round-Robin algorithm (CRR) is proposed for input data schedule in CMDMA software driver. CRR is a two-dimensional schedule algorithm for multiple algorithm IPs with multiple data queues. For algorithm IPs, the IP number and priority are configurable, and for data queues in each algorithm IPs, the DMA parameters and data transfer channels are configurable based on CISwitch.
- •
A channel distinguishable buffer (ChnDistBuf) is proposed in CMDMA. The innovation in ChnDistBuf is that it is able to deliver channel ID and output data size to CMDMA software driver earlier than the real end time of an algorithm IP, which makes CMDMA successively store complete output data from different algorithm IPs, meanwhile improve single algorithm IP performance and DMA data throughput.
- •
A prototype of CMDMA with four multiplex channels is implemented. To test the prototype and measure resource consumption and performance of CMDMA, one system with four heterogeneous matrix multiplication algorithm IPs and one system with double CNN IPs for Mnist application are implemented on Zynq platform.
The rest of this paper is organized as follows. Section II introduces the base DMA core and some of its extensions. Section III presents the function definition and work flow of each elements in CMDMA with a formal description of CMDMA system. Section IV describes the implementation of CMDMA. Section V shows the experimental results, and Section VI concludes this paper.
Section snippets
Related work
This section begins with an introduction of the base DMA core from Xilinx, then investigates some of its extensions, and finally, presents the differences and design objectives of CMDMA.
Multiplex data model of CMDMA
In embedded systems, it is often impossible to run applications with complex algorithms on software due to the limited computation ability of embedded CPU cores. One possible solution for complex applications is to move the inside complex algorithm to FPGA part as an algorithm IP, which can be called by the application. In this case, DMA is often employed to load and store data to these algorithm IPs from DDR directly to reduce data transfer time. However, for much more complex applications,
Implementation
Based on the functions and work flow analysis on CMmodel in last section, we implement a prototype of CMDMA on Zynq platform, as shown in Fig. 7.
The hardware part (HW) includes one DMA IP, a configurable input switch (CISwitch), algorithm IPs (AlgmIPs), and an asynchronous output switch (AOSwitch). The software part (SW) includes CMDMA driver and IPSTtable. CMDMA software driver is made up of CMDMAinit, input controller, and two interrupt handlers of ChnDistBuf and algorithm IPs. CRR schedule
Experiments
To measure the performance and resource consumption of CMDMA, we implement a prototype of CMDMA on Xilinx Zynq-7000 FPGA platform [39]. There are two parts in the Zynq platform. One is processing subsystem (PS), which is integrated with a dual ARM core for system control, and we run the host application of CMDMA and CMDMA driver on PS. The other part is programmable logic (PL) subsystem, which is the part of FPGA for user logic implementation, and we implement hardware IPs of CMDMA, which
Conclusion
In this paper, a configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device (CMDMA) is proposed to improve the flexibility and efficiency of a single DMA device on FPGA. Firstly, a data model for multiplex data transfers is proposed to define functions and analyze the work-flow of CMDMA in a system. Based on the data model, a prototype of CMDMA is implemented, which is made up of CMDMA software driver and hardware circuits of DMA,
Future work
The future work of CMDMA includes software and hardware parts. In software, the addressed scenario in this paper is just focused on the schedule of multiple algorithm IPs for one host application. In the future, it needs to deploy CMDMA to the scenario of multiple algorithm IPs, which are called by multiple applications, and a notifying method for swapping tasks need be added in CMDMA software driver. For the hardware of CMDMA, the work in this paper can be combined with Network-on-Chip (NoC)
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant no. 61602016.
Zhangqin Huang received the B.S., M.S., and Ph.D. degrees in computer science from Xi'an Jiaotong University, Xi'an, China, in 1986, 1989, and 2000, respectively. From 2001 to 2003, he was a Postdoctoral Researcher in the Technische Universiteit Eindhoven (TU/e), Eindhoven, the Netherlands. He is currently a professor, doctoral supervisor of the Faculty of Information Technology, Beijing University of Technology. His current research interests include Internet of Things, Co-design for Embedded
References (41)
- et al.
Architecture and FPGA prototype of cycle stealing DMA array signal processor for ultrasound sector imaging systems
Microprocess. Microsyst.
(2019) - et al.
Optimizing FPGA-based hard networks-on-chip by minimizing and sharing resources
Integration
(2018) - et al.
EMVs: embedded multi Vector-core System
J. Syst. Archit.
(2018) Simple method of asynchronous circuits implementation in commercial FPGAs
Integr. VLSI J.
(2017)- et al.
Extended overlay architectures for heterogeneous FPGA cluster management
J. Syst. Archit.
(2017) - et al.
Evaluation of queue designs for true fully adaptive routers
J. Parallel Distrib. Comput.
(2004) - et al.
On Round-Robin routing with FCFS and LCFS scheduling
Perform. Eval.
(2016) - et al.
Hardware transactional memory architecture with adaptive version management for multi-processor FPGA platforms
J. Syst. Archit.
(2017) - S. Erusalagandi, “Leveraging data-mover IPs for data movement in Zynq-7000 AP SoC systems", 2015. [Online]....
- et al.
High performance FPGA-based DMA interface for pcie
IEEE Trans. Nucl. Sci.
(2014)
A PCIe DMA architecture for multi-GB per second data transmission
IEEE Trans. Nucl. Sci.
Multi-object tracking coprocessor for multi-channel embedded DVR systems
IEEE Trans. Consum. Electron.
On how to design dataflow FPGA-based accelerators for convolutional neural networks
A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection
IEEE Trans. Very Large Scale Integr. Syst.
HACO-F: an accelerating hls-based floating-point ant colony optimization algorithm on FPGA
Int. J. Perform. Eng.
An efficient method of parallel multiplication on a single DSP slice for embedded FPGAs
IEEE ACCESS
Efficient dynamic virtual channel organization and architecture for NoC systems
IEEE Trans. Very Large Scale Integr. Syst.
ElastiStore: flexible elastic buffering for virtual-channel-based networks on chip
IEEE Trans. Very Large Scale Integr. Syst.
A ripple control dual-mode single-inductor dual-output buck converter with fast transient response
IEEE Trans. Very Large Scale Integr. Syst.
Cited by (0)
Zhangqin Huang received the B.S., M.S., and Ph.D. degrees in computer science from Xi'an Jiaotong University, Xi'an, China, in 1986, 1989, and 2000, respectively. From 2001 to 2003, he was a Postdoctoral Researcher in the Technische Universiteit Eindhoven (TU/e), Eindhoven, the Netherlands. He is currently a professor, doctoral supervisor of the Faculty of Information Technology, Beijing University of Technology. His current research interests include Internet of Things, Co-design for Embedded Software and Hardware, Human-Computer Interaction based on Internet, and mass data storage.
Shuo Zhang was born in Pinggu, Beijing, China, in 1991. He received the B.S. degree in Software Engineering of School of Software from Beijing University of Technology, Beijing, China, in 2013. He got the MBA-DBA in Software Engineering of School of Software from Beijing University of Technology at 2014. He is currently pursuing the Ph.D. degree in Faculty of Information Technology at Beijing University of Technology, Beijing, China. His current research interests include co-design for embedded software and hardware, embedded system architecture, and FPGA hardware acceleration.
Han Gao received the B.S. degree in Tianjin Normal University, Tianjin, China, in 2012 and the M.S. degree in Beijing University of Technology, Beijing, China, in 2017. She is currently pursuing the Ph.D. degree in software engineering at Beijing University of Technology, Beijing, China. Her research interests include network communication, co-design for embedded software and hardware, multiprocessing systems, and Internet of Things.
Xiaobo Zhang received the B.S. degree in software engineering from North University of China, Shanxi, China, in 2016 and the M.S. degree in software engineering of Beijing University of Technology, Beijing, China, in 2019. He is currently pursuing the Ph.D. degree in software engineering at Beijing University of Technology, Beijing, China. From 2017 to 2019, he was a Research Member with the Beijing Engineering Research Center for IoT Software and Systems. He participated in the design and development of the LoRa gateway of the National Forest Development and Reform Commission's Smart Forest Project.
Shengqi Yang received the B.S. in the Department of mechanics and engineering, Peking University in 2000.8, and a double B.S. from the China Center for economic research, Peking University. He got his PhD in 2006 from Princeton. He serves as the adjunct professor at Beijing University of Technology.