ABSTRACT
Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks, provides significant advantages in run-time speed and energy. Efficient management of data sharing among multiple computational kernels can rapidly turn into a complicated problem. The Accelerator coherency port (ACP) emerges as a possible solution by enabling hardware accelerators to issue coherent accesses to the memory space. In this paper, we quantify the advantages of using ACP over the traditional method of sharing data on the DRAM. We select the Xilinx ZYNQ as target and develop an infrastructure to stress the ACP and high-performance (HP) AXI interfaces of the ZYNQ device. Hardware accelerators on both of HP and ACP AXI interfaces reach full duplex data processing bandwidth of over 1.6 GBytes/s running at 125 MHz on a XC7Z020-1C device. The effect of background DRAM and cache traffic on the performance of accelerators is analyzed. For a sample image filtering task, the cooperative operation of CPU and ACP accelerator (CPU-ACP) gains a speed-up of 1.2X over CPU and HP acceleration (CPU-HP). In terms of energy efficiency, an improvement of 2.5 nJ (> 20%) is shown for each byte of processed data. This is the first work which represents detailed practical comparisons on the speed and energy efficiency of various processor-accelerator memory sharing techniques in a configurable heterogeneous platform.
- L. Benini, E. Flamand, D. Fuin, and D. Melpignano. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 983--987, 2012. Google ScholarDigital Library
- T. Berg. Maintaining i/o data coherence in embedded multicore systems. Micro, IEEE, 29(3):10--19, 2009. Google ScholarDigital Library
- C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development, 54(5):5:1--5:10, 2010. Google ScholarDigital Library
- J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski. Impact of cache architecture and interface on performance and area of fpga-based processor/parallel-accelerator systems. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 17--24, 2012. Google ScholarDigital Library
- F. Clermidy, C. Bernard, R. Lemaire, J. Martin, I. Miro-Panades, Y. Thonnart, P. Vivet, and N. Wehn. Magali: A network-on-chip based multi-core system-on-chip for mimo 4g sdr. In IC Design and Technology (ICICDT), 2010 IEEE International Conference on, pages 74--77, 2010.Google ScholarCross Ref
- C. Fajardo, Z. Fang, R. Iyer, G. Garcia, S. E. Lee, and L. Zhao. Buffer-integrated-cache: A cost-effective sram architecture for handheld and embedded platforms. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 966--971, 2011. Google ScholarDigital Library
- P. Greenhalgh. big.little processing with arm cortex-a15 & cortex-a7. september 2011.Google Scholar
- Altera. Inc. Adding hardware accelerators to reduce power in embedded systems. september 2009.Google Scholar
- ARM. Inc. Introducing neon development, 2009. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html.Google Scholar
- ARM. Inc. Cortex-A9 MPCore Technical Reference Manual, 2012. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0460c/CIAIIJCE.html.Google Scholar
- ARM. Inc. AMBA AXI and ACE Protocol Specification, February 2013. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022e/index.html.Google Scholar
- Synopsys. Inc. DesignWare DDR3/2 SDRAM Memory Controller, 2013. http://www.synopsys.com/dw/ipdir.php?ds=dwc_ddr3_mem.Google Scholar
- Xilinx. Inc. LogiCORE IP AXI Master Burst (DS844), June 2011. http://www.xilinx.com/support/documentation/ip_documentation/axi_master_burst/v1_00_a/ds844_axi_master_burst.pdf.Google Scholar
- Xilinx. Inc. LogiCORE IP ChipScope AXI Monitor (DS810), March 2011. http://www.xilinx.com/support/documentation/ip_documentation/chipscope_axi_monitor/v2_00_a/ds810_chipscope_axi_monitor.pdf.Google Scholar
- Xilinx. Inc. ZC-702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable SoC, April 2013. http://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pdf.Google Scholar
- Xilinx. Inc. Zynq-7000 All Programmable SoC Technical Reference Manual (UG585), March 2013. http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf.Google Scholar
- S. Ishikawa, A. Tanaka, and T. Miyazaki. Hardware accelerator for blast. In Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on, pages 16--22, 2012. Google ScholarDigital Library
- S. Kaxiras and A. Ros. Efficient, snoopless, system-on-chip coherence. In SOC Conference (SOCC), 2012 IEEE International, pages 230--235, 2012.Google ScholarCross Ref
- A. Kennedy, X. Wang, and B. Liu. Energy efficient packet classification hardware accelerator. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--8, 2008. Google ScholarDigital Library
- G. Kyriazis. Heterogeneous system architecture: A technical review. Technical report, Advanced Micro Devices, August 2012.Google Scholar
- S. Lafond and J. Lilius. Interrupt costs in embedded system with short latency hardware accelerators. In Engineering of Computer Based Systems, 2008. ECBS 2008. 15th Annual IEEE International Conference and Workshop on the, pages 317--325, 2008. Google ScholarDigital Library
- J. Levon, M. Johnson, et al. Oprofile: A system profiler for linux. "http://oprofile.sourceforge.net/.Google Scholar
- O. Mencer. Maximum performance computing for exascale applications. In Embedded Computer Systems (SAMOS), 2012 International Conference on, pages iii--iii, 2012.Google Scholar
- M. Nadeem, S. Wong, G. Kuzmanov, and A. Shabbir. A high-throughput, area-efficient hardware accelerator for adaptive deblocking filter in h.264/avc. In Embedded Systems for Real-Time Multimedia, 2009. ESTIMedia 2009. IEEE/ACM/IFIP 7th Workshop on, pages 18--27, 2009.Google ScholarCross Ref
- M. O'Connor. Accelerated processing and the fusion system architecture. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, pages 93--93, 2012.Google ScholarCross Ref
- M. Sadri. Technical report: Energy and performance exploration of accelerator coherency port using xilinx zynq. Technical report, Department of Electrical, Electronic and Information Engineering, University of Bologna, May 2013.Google Scholar
- N. C. Stephane Eric Sebastien Brochier. Managing the storage of data in coherent data stores, 09 2009.Google Scholar
- T. Suh, D. Blough, and H.-H. Lee. Supporting cache coherence in heterogeneous multiprocessor systems. In Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, volume 2, pages 1150--1155 Vol.2, 2004. Google ScholarDigital Library
Index Terms
- Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ
Recommendations
Rapid Implementation of Embedded Systems using Xilinx Zynq Platform
SEEDA-CECNSM '16: Proceedings of the SouthEast European Design Automation, Computer Engineering, Computer Networks and Social Media ConferenceIn any digital system design, it is crucial to achieve the lowest time-to-market possible. Indeed, that need has pushed large FPGA manufacturers to produce SoCs which will implement reprogrammable logic along with CPU and DSP cores. Especially, during ...
HW/SW Co-design of an IEEE 802.11a/g Receiver on Xilinx Zynq SoC using High-Level Synthesis
HEART '17: Proceedings of the 8th International Symposium on Highly Efficient Accelerators and Reconfigurable TechnologiesThis paper presents an implementation of an Orthogonal Frequency-Division Multiplexing (OFDM) receiver using the high-level synthesis tool, from Xilinx called Software Defined System-on-Chip (SDSoC). The Zynq SoCs containing an ARM processor besides a ...
Comments