Automatic cache partitioning method for high-level synthesis

https://doi.org/10.1016/j.micpro.2019.02.013Get rights and content

Abstract

Existing algorithms can be automatically translated from software to hardware using High-Level Synthesis (HLS), allowing for quick prototyping or deployment of embedded designs. High-level software is written with a single main memory in mind, whereas hardware designs can take advantage of many parallel memories. The translation and optimization of memory usage, and the generation of resulting architectures, is important for high-performance designs. Tools provide optimizations on memory structures targeting data reuse and partitioning, but generally these are applied separately for a given object in memory. Memory access that cannot be effectively optimized is serialized to the memory, hindering any further parallelization of the surrounding generated hardware.

In this work, we present an automated optimization method for creating custom cache memory architectures for HLS generated designs. Our optimization uses runtime profiling data, and is performed at a localized scope. This method combines data reuse savings and memory partitioning to further increase the potential parallelism and alleviate the serialized memory access, increasing performance. Comparisons are made against architectures without this optimization, and against other HLS caching approaches. Results are presented showing this method requires 72% of the number of execution cycles compared to a single-cache design, and 31% compared to designs with no caches.

Introduction

In recent years, the performance requirements for embedded systems have grown significantly. A standard embedded system using a microprocessor often cannot keep up with the requirements of high-performance applications, including many parallel tasks, increased data throughput, and computation power. Many developers are turning to reconfigurable hardware to speed up their designs. An architecture that is created specifically for the task at hand will always outperform a general purpose architecture.

Field-Programmable Gate Arrays (FPGAs) are becoming more popular in the area of embedded systems, which allow for the creation of special-purpose digital circuits for applications. They can be used in a wide range of applications, and have shown to be beneficial for use in high-performance systems while also reducing power consumption.

Designing a high-performance embedded system on an FPGA can be difficult, whether it is created from scratch, or migrated/hybridized from an existing system. HLS tools allow for the automatic translation of existing software algorithms into a Register-Transfer Level (RTL) hardware description, which can then be synthesized to FPGAs or Application-Specific Integrated Circuits (ASICs). Hardware design time can be significantly shortened by using this tool flow. This gives developers the ability to rapidly prototype new systems, or easily utilize legacy code in a high-level language, with C/C++ being the most common.

In the automatic generation of FPGA hardware from software, the memory architecture is a key component, and a prime focus for optimization. Software is written under the assumption of a single monolithic memory. FPGAs on the other hand, are designed with a distributed network of on-chip memory resources, Block Random-Access Memory (BRAM). HLS tools must be intelligent when translating software into hardware in order to best utilize the distributed memory resources. With most designs, memory is the bottleneck, and the ability for tools to effectively distribute and optimize the memory architecture is critical to achieve high-performance requirements.

HLS tools offer a wide range of optimizations that are possible when translating a design from software to hardware. Typically an extensive knowledge of hardware is not needed, but some is required to understand how the specific tool works and what abilities are available to the end user. There are usually opportunities for the user to get automatic benefits through various optimizations. Tools do a relatively good job of this on their own, however, user directives are available and typically required to perform further optimizations.

In general, maximizing the throughput and speed of the design can be achieved by increasing parallelism. With regard to memory access, throughput can be increased by reducing the access time to resources, and parallelizing multiple operations to allow simultaneous access. Memory operations accessing the same memory port must wait for one another, and their execution becomes serialized. We classify this to be the memory access serialization problem, where further parallelization of a design is hindered by the serial access requirement of a memory resource. Reducing the individual access times or executing multiple instructions in parallel allows for a overall reduction in computation time.

Common HLS optimizations such as loop unrolling and loop pipelining further accentuate the access serialization problem. Loop unrolling is the act of duplicating the body of the loop in exchange for a reduction in the number of loop iterations. By copying the loop, the number of memory operations doubles, and typically the accesses will be incremented versions of the previous iteration (e.g. a[i], a[i+1]). Loop iterations can be parallelized but access to memory must be serialized. Loop pipelining is where multiple iterations of a loop are pipelined such that they overlap in execution. With both of these optimizations potential parallelism is limited by the serial requirement for access to memory.

This work aims to address this problem with respect to HLS designs. We present an automated approach for creating multiple localized and partitioned caches based on memory access patterns of the chosen application. Commercial tools, such as Vivado HLS [1], do not currently support caching architectures for HLS designs, and there is limited research on the topic. This method combines data reuse and partitioning techniques (explained in Section 2), supports both on-chip and off-chip memory, operates on both global and local scopes of an application, and utilizes many analytical techniques.

The remainder of this paper is organized as follows. Section 2 outlines some existing work. Section 3 describes the HLS toolchain used to implement this work. Section 4 presents the method for this work. Section 5 gives results for this method showing a significant speedup in performance, and reduction in execution latency. Finally, a conclusion and future work are provided in Section 6.

Section snippets

Related work

There is a large body of research on the optimization of memory architectures in HLS generated designs. In this section, a discussion is given on some of those approaches with regards to data reuse and partitioning of memory resources. Specifically, we outline optimizations involving data reuse, partitioning, and caching.

Many researchers use the polyhedral model for analyzing memory access patterns [2], [3], [4], [5], [6], [7], [8], [9]. The polyhedral model is a method for modeling loop

Flowpaths high-level synthesis tool

In this work, for convenience, we utilize our own HLS tool called Flowpaths. Although most HLS tools use C/C++ as an input language, this tool uses Java, and outputs to VHDL. Previously, Flowpaths utilized the stack-based Java bytecode as its Intermediate Representation (IR) [18], [19], developing a simple, yet fast, architecture of chained operations that connected to a single global memory. In the latest version, a parallel execution model was desired that allowed for fine-grained control of

Overview

In this work, we present an automated multi-level cache architecture method for optimizing data reuse and memory partitioning in HLS generated architectures by creating caches that are localized to specific regions in a design. Fig. 1 outlines the flow of the method and how it fits into the Flowpaths compiler. Our method uses a profiling-based analysis for determining data dependence among memory read and write instructions. Dependence is determined by comparing the address space that each

Results

In this section, we provide results of the optimization presented here using the Flowpaths HLS toolchain. Comparisons are made between architectures with and without the optimization. All other optimizations and compiler parameters are kept the same between compilations; the only difference is the activation of the optimization.

Results are presented from several benchmarks ranging from simple loop kernels to larger problems such as clustering. Each Java example was processed through the

Conclusion and future work

In this work, we presented an automated method for creating multi-level custom cache architectures for generated HLS designs. A profiling-based analysis is used to determine regions where caches can be placed based on the memory access patterns, creating a custom caching architecture tuned for that application. A search method was provided that tests many possible caching structures and compares them using a lower-bound cycle estimation based on the HLS hardware scheduling.

Results were provided

Bryant Jones received his B.S. degree from Oakland University in 2011 with honors in both computer engineering and electrical engineering, and his M.S. degree in embedded systems in 2012, and Ph.D. in electrical and computer engineering in 2018 at Oakland University. Dr. Jones completed his undergraduate and graduate research as a Research Assistant with the Nano Imaging Laboratory at Oakland University between 2010 and 2018. Since 2017 he has been a Senior Embedded Engineer with Intrepid

References (22)

  • Xilinx, Vivado design suite user guide: high-level synthesis, 2016,...
  • Q. Liu et al.

    Automatic on-chip memory minimization for data reuse

    Proceedings 2007 IEEE Symposium on Field-Programme Custom Computing Machines, FCCM 2007

    (2007)
  • J. Cong et al.

    Combined loop transformation and hierarchy allocation for data reuse optimization

    Proceedings of the International Conference on Computer-Aided Design

    (2011)
  • L.-N. Pouchet et al.

    Polyhedral-based data reuse optimization for configurable computing

    Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays

    (2013)
  • I. Issenin et al.

    Data reuse analysis technique for software-controlled memory hierarchies

    Proceedings Design, Automation and Test in Europe Conference and Exhibition

    (2004)
  • J. Cong et al.

    Optimizing memory hierarchy allocation with loop transformations for high-level synthesis

    Proceedings of the 49th Annual Design Automation Conference

    (2012)
  • L. Gallo et al.

    Area implications of memory partitioning for high-level synthesis on FPGAs

    2014 24th International Conference on Field Programmable Logic and Applications (FPL)

    (2014)
  • Y. Wang et al.

    Theory and algorithm for generalized memory partitioning in high-level synthesis

    Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

    (2014)
  • J. Cong et al.

    Automatic memory partitioning and scheduling for throughput and power optimization

    2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers

    (2009)
  • Y.T. Chen et al.

    Automated generation of banked memory architectures in the high-level synthesis of multi-threaded software

    2017 27th International Conference on Field Programmable Logic and Applications, FPL 2017

    (2017)
  • Y. Zhou et al.

    A New Approach to Automatic Memory Banking using Trace-Based Address Mining

    Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA ’17

    (2017)
  • Cited by (1)

    Bryant Jones received his B.S. degree from Oakland University in 2011 with honors in both computer engineering and electrical engineering, and his M.S. degree in embedded systems in 2012, and Ph.D. in electrical and computer engineering in 2018 at Oakland University. Dr. Jones completed his undergraduate and graduate research as a Research Assistant with the Nano Imaging Laboratory at Oakland University between 2010 and 2018. Since 2017 he has been a Senior Embedded Engineer with Intrepid Control Systems, Inc. His research interests include high-level synthesis, embedded micro-processor design, high-performance embedded systems, and optimization.

    Darrin M. Hanna (M’99) received his B.S. in computer engineering and mathematics and M.S. degree in computer science and engineering from Oakland University, Rochester, Michigan, in 1999 and Ph.D. in systems engineering from Oakland University in 2003. He is Professor of Engineering in the Department of Electrical and Computer Engineering and Director of the High-Speed Vehicle Networks Laboratory at Oakland University supported by Intrepid Control Systems. Since 1999 he has served as Consultant to several companies for research, development, and commercialization. He is the author of 14 books, more than 50 articles, and 3 licensed technologies. His research interested include high-level synthesis, artificial intelligence, high-speed embedded systems, and nanoimaging. Professor Hanna is a Member of IEEE and ASEE. He was the recipient of the 2007 IEEE Computer Society Computer Science and Engineering Undergraduate Teaching Award.

    View full text