skip to main content
10.1145/2513683.2513688acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfpgaworldConference Proceedingsconference-collections
research-article

Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ

Published:10 September 2013Publication History

ABSTRACT

Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks, provides significant advantages in run-time speed and energy. Efficient management of data sharing among multiple computational kernels can rapidly turn into a complicated problem. The Accelerator coherency port (ACP) emerges as a possible solution by enabling hardware accelerators to issue coherent accesses to the memory space. In this paper, we quantify the advantages of using ACP over the traditional method of sharing data on the DRAM. We select the Xilinx ZYNQ as target and develop an infrastructure to stress the ACP and high-performance (HP) AXI interfaces of the ZYNQ device. Hardware accelerators on both of HP and ACP AXI interfaces reach full duplex data processing bandwidth of over 1.6 GBytes/s running at 125 MHz on a XC7Z020-1C device. The effect of background DRAM and cache traffic on the performance of accelerators is analyzed. For a sample image filtering task, the cooperative operation of CPU and ACP accelerator (CPU-ACP) gains a speed-up of 1.2X over CPU and HP acceleration (CPU-HP). In terms of energy efficiency, an improvement of 2.5 nJ (> 20%) is shown for each byte of processed data. This is the first work which represents detailed practical comparisons on the speed and energy efficiency of various processor-accelerator memory sharing techniques in a configurable heterogeneous platform.

References

  1. L. Benini, E. Flamand, D. Fuin, and D. Melpignano. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 983--987, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Berg. Maintaining i/o data coherence in embedded multicore systems. Micro, IEEE, 29(3):10--19, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Cascaval, S. Chatterjee, H. Franke, K. Gildea, and P. Pattnaik. A taxonomy of accelerator architectures and their programming models. IBM Journal of Research and Development, 54(5):5:1--5:10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski. Impact of cache architecture and interface on performance and area of fpga-based processor/parallel-accelerator systems. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 17--24, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Clermidy, C. Bernard, R. Lemaire, J. Martin, I. Miro-Panades, Y. Thonnart, P. Vivet, and N. Wehn. Magali: A network-on-chip based multi-core system-on-chip for mimo 4g sdr. In IC Design and Technology (ICICDT), 2010 IEEE International Conference on, pages 74--77, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  6. C. Fajardo, Z. Fang, R. Iyer, G. Garcia, S. E. Lee, and L. Zhao. Buffer-integrated-cache: A cost-effective sram architecture for handheld and embedded platforms. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 966--971, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Greenhalgh. big.little processing with arm cortex-a15 & cortex-a7. september 2011.Google ScholarGoogle Scholar
  8. Altera. Inc. Adding hardware accelerators to reduce power in embedded systems. september 2009.Google ScholarGoogle Scholar
  9. ARM. Inc. Introducing neon development, 2009. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html.Google ScholarGoogle Scholar
  10. ARM. Inc. Cortex-A9 MPCore Technical Reference Manual, 2012. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0460c/CIAIIJCE.html.Google ScholarGoogle Scholar
  11. ARM. Inc. AMBA AXI and ACE Protocol Specification, February 2013. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0022e/index.html.Google ScholarGoogle Scholar
  12. Synopsys. Inc. DesignWare DDR3/2 SDRAM Memory Controller, 2013. http://www.synopsys.com/dw/ipdir.php?ds=dwc_ddr3_mem.Google ScholarGoogle Scholar
  13. Xilinx. Inc. LogiCORE IP AXI Master Burst (DS844), June 2011. http://www.xilinx.com/support/documentation/ip_documentation/axi_master_burst/v1_00_a/ds844_axi_master_burst.pdf.Google ScholarGoogle Scholar
  14. Xilinx. Inc. LogiCORE IP ChipScope AXI Monitor (DS810), March 2011. http://www.xilinx.com/support/documentation/ip_documentation/chipscope_axi_monitor/v2_00_a/ds810_chipscope_axi_monitor.pdf.Google ScholarGoogle Scholar
  15. Xilinx. Inc. ZC-702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable SoC, April 2013. http://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pdf.Google ScholarGoogle Scholar
  16. Xilinx. Inc. Zynq-7000 All Programmable SoC Technical Reference Manual (UG585), March 2013. http://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf.Google ScholarGoogle Scholar
  17. S. Ishikawa, A. Tanaka, and T. Miyazaki. Hardware accelerator for blast. In Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on, pages 16--22, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Kaxiras and A. Ros. Efficient, snoopless, system-on-chip coherence. In SOC Conference (SOCC), 2012 IEEE International, pages 230--235, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Kennedy, X. Wang, and B. Liu. Energy efficient packet classification hardware accelerator. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1--8, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Kyriazis. Heterogeneous system architecture: A technical review. Technical report, Advanced Micro Devices, August 2012.Google ScholarGoogle Scholar
  21. S. Lafond and J. Lilius. Interrupt costs in embedded system with short latency hardware accelerators. In Engineering of Computer Based Systems, 2008. ECBS 2008. 15th Annual IEEE International Conference and Workshop on the, pages 317--325, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Levon, M. Johnson, et al. Oprofile: A system profiler for linux. "http://oprofile.sourceforge.net/.Google ScholarGoogle Scholar
  23. O. Mencer. Maximum performance computing for exascale applications. In Embedded Computer Systems (SAMOS), 2012 International Conference on, pages iii--iii, 2012.Google ScholarGoogle Scholar
  24. M. Nadeem, S. Wong, G. Kuzmanov, and A. Shabbir. A high-throughput, area-efficient hardware accelerator for adaptive deblocking filter in h.264/avc. In Embedded Systems for Real-Time Multimedia, 2009. ESTIMedia 2009. IEEE/ACM/IFIP 7th Workshop on, pages 18--27, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  25. M. O'Connor. Accelerated processing and the fusion system architecture. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, pages 93--93, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  26. M. Sadri. Technical report: Energy and performance exploration of accelerator coherency port using xilinx zynq. Technical report, Department of Electrical, Electronic and Information Engineering, University of Bologna, May 2013.Google ScholarGoogle Scholar
  27. N. C. Stephane Eric Sebastien Brochier. Managing the storage of data in coherent data stores, 09 2009.Google ScholarGoogle Scholar
  28. T. Suh, D. Blough, and H.-H. Lee. Supporting cache coherence in heterogeneous multiprocessor systems. In Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings, volume 2, pages 1150--1155 Vol.2, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        FPGAworld '13: Proceedings of the 10th FPGAworld Conference
        September 2013
        75 pages
        ISBN:9781450324960
        DOI:10.1145/2513683

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 September 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate4of6submissions,67%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader