An optimal allocation of memory buffers for complex multicore platforms

https://doi.org/10.1016/j.sysarc.2016.05.002Get rights and content

Abstract

In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.

In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear programming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.

Introduction

In the era of multi-cores and many-cores several programming abstractions have been proposed to hide the complexity of concurrent execution and memory management. For deeply embedded applications, however, the placement of tasks to processors and data to memories is still carefully done statically by hand. This is the case, for example, in digital baseband processing on base stations for today’s wireless communication standards. In these systems, tasks are allocated statically, often to customized processors that were specifically designed to implement a particular task. Once the task allocation is done, designers spend a considerable amount of time either placing the data to the right memory or designing the memory subsystem that best suits the application. Static data allocation is preferred over a dynamic one to prevent fragmentation, non-deterministic allocation time and out-of-memory errors. Due to the increase in both software and hardware complexity, this manual allocation has become prohibitively complex, since the number of possibilities grows exponentially. The somewhat modest example of allocating 100 logical buffers to 40 physical memories already has about as many possible allocations as there are atoms in the visible universe.

The complexity of embedded software has increased as a consequence of the development of new standards (e.g., for communication or video encoding).

Applications have become more dynamic and irregular, with data and scenario-dependent execution paths. Examples of this are the different modes of the Long-Term Evolution (LTE) standard [1] or multicore engine control unit (ECU) applications in the automotive industry [2]. The hardware complexity has increased accordingly, showing a steady increase in processor counts and an even more dramatic evolution of system interconnect and memory interfaces. There are today new possibilities to interface to Dynamic Random-Access Memory (DRAM) that provide high bandwidth over multiple channels. Examples for these emerging techniques, include 2.5D (High Bandwidth Memory) or 3D (Wide I/O) DRAM integration [3]. These new storage capabilities are reflected in a wide variety of communication possibilities in modern multi-cores. In the future, we expect even more complex architectures, such as the HAEC box [4] or the Hybrid Memory Cube (HMC) [5], where multiple paths exist to route communication between two components, such that the decision of which path to take becomes non-trivial.

The problem of mapping logical data buffers to memory has been analyzed in the context of well-structured application models with explicit communication, such as directed acyclic task graphs [6], [7], [8] and dataflow programming models [9], [10], [11]. In those models there is a clear producer-consumer relationship among tasks, which makes it easier to reason about task interactions and the impact of buffer allocation on application performance (throughput and latency). Additionally, these models only take into account the relationship between tasks that arises from the computation itself, disregarding the architecture’s topology and its implications. In this paper we adopt a more general view in which logical buffers can be accessed and modified by different computational tasks at arbitrary instances of time, and which also takes into account specifics of the architecture where the computation will be executed. In our approach, instead of looking for an allocation that maximizes application throughput, we find an allocation that meets individual bandwidth demands required to meet multiple throughput constraints in complex applications.

We call the binding of a computational task running on a processor and accessing a given buffer a flow. Intuitively, an application is represented as a collection of small computational tasks that operate on a single shared buffer in memory. Multiple tasks can work on the same buffer, but no task accesses more than one buffer (see upper part of Fig. 1).

This model allows system architects to reason about the memory subsystem while fixing all other aspects of the execution behavior of the application.

The model used in this paper can be seen as a generalization of variable lifetime ranges in traditional compilers used for register allocation. Dataflow models, commonly used in the embedded domain [12], can also be represented with our model. The flows can be obtained from profiling runs of dataflow applications and individual bandwidth demands can be obtained from global application throughput constraints.

The buffer-allocation problem is then to find a valid and optimal allocation of the logical buffers of the application onto the platform memories (see Fig. 1). A valid solution is one in which all buffers can be accessed with the bandwidth required by the application and that all buffers fit into the platform memories. What is considered an optimal solution depends on the optimization criteria. In this case it is a solution that either has maximal balancing of the bandwidth loads in all channels or the balanced use of memory. A combination of both can also be considered by weighting both of the aforementioned criteria.

In this paper we introduce a mathematical model of the buffer-allocation problem and propose a solution using linear programming. The main contributions of this paper are:

  • A clear presentation and formulation of the problem and a structured, layered solution using mixed-integer linear programming (MILP). This allows to decompose the problem into sub-problems and rises the potential for sequential optimizations.

  • A topology-aware, optimal-allocation formulation that can deal with complex architectures and tackles memory fragmentation issues by directly generating buffer addresses.

  • A detailed complexity and scalability analysis of the problem and the presented solution. In particular, we show that the problem is NP-complete.

  • A more scalable formulation as a semi-definite programming problem.

The rest of this paper is organized as follows: Section 2 discusses related work. Section 3 gives a formal presentation of the problem and our proposed solution with an MILP formulation. It gradually introduces sub-problems and their solutions until the complete problem is addressed. Then, an analysis of the model—including the complexity of the problem and scalability of our approach—is presented in Section 4. Section 5 presents the results of a real-world use case, namely LTE, which shows both the benefits of our approach and its limitations. Finally, we conclude our work and give an outlook on potential directions of future work in Section 6.

Section snippets

Related work

The concept of formulating and solving an allocation problem as a MILP problem is certainly not new. In the field of hardware synthesis, similar models and ideas for memory allocation have been used for application-specific integrated circuit (ASIC) design [13], [14]. In more recent work, Meftali et al. [15] present a complete workflow for generating the memory subsystem, from a hardware-design perspective.

In the context of software synthesis, the problem of allocating logical memory units,

The buffer allocation problem

The purpose of this section is to give a formal treatment of the buffer-allocation problem. Starting with a formalization of the problem itself, in which we give a precise mathematical definition of the information given and the constraints required, we propose a layered solution to the problem using MILP, for which we gradually introduce subproblems and their solutions that build up to the complete buffer-allocation problem as motivated in the introduction (c.f. Fig. 1).

Model analysis

The formulation as an LP allows us to find optimal solutions within the pool of valid allocations, even in strict conditions on bandwidth and memory usage. This power comes with some drawbacks. While LP methods are well-studied and a plethora of methods and heuristics for efficient solving exist, worst-case examples of exponential run-times exist (cf. Theorem 11.2 of [32]). In methods of combinatorial nature as in the buffer-allocation problem, the simplex algorithm tends to be slow or even to

Case study

In this section we briefly give details of how the tool flow was implemented and describe our experimental setup. We then present an evaluation of the techniques described in Sections 3 and 4 on a case-study from the wireless communication domain, namely an implementation of the LTE baseband receiver.

Conclusion

We have seen that optimally allocating buffers in complex multi- and many-core architectures is an NP-complete problem that needs to be solved for performance-critical applications in emerging technologies. In this paper we formalized this problem and presented a structured, layered solution using MILP, as well as an alternative SDP formulation. The MILP formulation is also useful as a mathematical formalization of the constraints imposed by the problem. It can be used for obtaining bounds to

Acknowledgments

This work is supported in part by the German Research Foundation (DFG) within the Cluster of Excellence Center for Advancing Electronics Dresden (cfaed).

Andres Goens received a Master of Science in Mathematics from the RWTH Aachen University in 2014. Since late 2014 he is a full-time researcher and Ph.D. student at the Chair for Compiler Construction at the TU Dresden in Germany. His research interests lie on programming methodologies for heterogeneous multiprocessor systems.

References (39)

  • C. Cox

    An Introduction to LTE: LTE, LTE-Advanced, SAE, VoLTE and 4G Mobile Communications

    (2014)
  • S. Fürst

    Challenges in the design of automotive software

    Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, European Design and Automation Association, 3001 Leuven, Belgium, Belgium

    (2010)
  • JEDEC: Global Standards for the Microelectronics Industry, 3D-ICs, [Online] Available...
  • G. Fettweis et al.

    Pathways to servers of the future: highly adaptive energy efficient computing (HAEC)

    Proceedings of the Conference on Design, Automation and Test in Europe, EDA Consortium

    (2012)
  • 2014, The Hybrid Memory Cube Consortium, Hybrid Memory Cube Specification 2.0: Hybrid Memory Cube with HMC-30G-VSR...
  • YenT.-Y. et al.

    Communication synthesis for distributed embedded systems

    1995 IEEE/ACM International Conference on Computer-Aided Design, 1995. ICCAD-95. Digest of Technical Papers

    (1995)
  • KwokY.-K. et al.

    Static scheduling algorithms for allocating directed task graphs to multiprocessors

    ACM Comput. Surv.

    (1999)
  • V. Suhendra et al.

    Integrated scratchpad memory optimization and task scheduling for MPSoC architectures

    Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, ACM

    (2006)
  • R. Govindarajan et al.

    Buffer allocation in regular dataflow networks: An approach based on coloring circular-arc graphs

    Third International Conference on High Performance Computing, 1996. Proceedings

    (1996)
  • T. Bijlsma et al.

    Omphale: Streamlining the communication for jobs in a multi processor system on chip

    Technical Report TR-CTIT-07-44

    (July 2007)
  • J. Castrillon et al.

    Communication-aware mapping of KPN applications onto heterogeneous MPSocs

    DAC ’12: Proceedings of the 49th Annual Conference on Design Automation

    (2012)
  • S. Sriram et al.

    Embedded Multiprocessors: Scheduling and Synchronization

    (2009)
  • SeoK. et al.

    Allocation of multiport memories in ASIC data path synthesis

    International Symposium on Circuits and Systems

    (1994)
  • JouJ.-M. et al.

    Multiport memory based data path allocation focusing on interconnection optimization

    International Symposium on Circuits and Systems

    (1994)
  • S. Meftali et al.

    An optimal memory allocation for application-specific multiprocessor system-on-chip

    Proceedings of the 14th International Symposium on System Synthesis

    (2001)
  • O. Jovanovic et al.

    Ilp-based memory-aware mapping optimization for MPSoCs.

    CSE

    (2012)
  • H. Salamy et al.

    An effective solution to task scheduling and memory partitioning for multiprocessor system-on-chip

    IEEE Trans. Comput. Aided Des. Integrated Circuits Syst.

    (2012)
  • KangS.h. et al.

    Multi-objective mapping optimization via problem decomposition for many-core systems

    2012 IEEE 10th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)

    (2012)
  • M. Lorenz et al.

    Optimized address assignment for DSPs with SIMD memory accesses

    Proceedings of the 2001 Asia and South Pacific Design Automation Conference

    (2001)
  • Cited by (7)

    • An interdisciplinary review on calibration strategies of engine management system for diverse alternative fuels in IC engine applications

      2020, Fuel
      Citation Excerpt :

      Firmware of the EMS encompasses the hardware and software components. As for the hardware, it mainly consists of an electronic integrated printed circuit board (PCB), with a laminated thin base or a ceramic substrate [15,16]. Microcontroller unit, which is the central processing unit, is the ‘master brain’ of the electronic circuit board.

    • Power-aware scheduling of real-time applications onto MPSoC platforms with multi-bank shared memory

      2019, Microprocessors and Microsystems
      Citation Excerpt :

      Most existing works on this field have focused only on monoprocessor architectures [3,18,19]. Of the works that consider multiprocessor architectures, many of them treat SPM management and task mapping/scheduling as decoupled steps [7,20], which could lead to unpredictable behavior due, for instance, bus contention. In the following, we describe relevant works that focus on multiprocessor architectures and, similar to our work, consider SPM management and task mapping/scheduling as an integrated process [21–23].

    • Memory-aware multiobjective design space exploration of heteregeneous MPSoC

      2018, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2018 - Proceedings
    • Optimal SDRAM Buffer Allocator for Efficient Reuse of Layer IO in CNNs Inference Framework

      2018, Proceedings - IEEE International Symposium on Circuits and Systems
    View all citing articles on Scopus

    Andres Goens received a Master of Science in Mathematics from the RWTH Aachen University in 2014. Since late 2014 he is a full-time researcher and Ph.D. student at the Chair for Compiler Construction at the TU Dresden in Germany. His research interests lie on programming methodologies for heterogeneous multiprocessor systems.

    Jeronimo Castrillon received the Electronics Engineering degree with honors from the Pontificia Bolivariana University in Colombia in 2004, the master degree from the ALaRI Institute in Switzerland in 2006 and the Ph.D. degree (Dr.-Ing.) on Electric Engineering and Information Technology with honors from the RWTH Aachen University in Germany in 2013. From early 2009 to April 2013 he was the chief engineer of the chair for Software for Systems on Silicon at the RWTH Aachen University, where he was enrolled as research staff since late 2006. From April 2013 to April 2014 he was senior scientific staff in the same institution. In June 2014, he joined the Department of Computer Science of the TU Dresden as professor for compiler construction in the context of the German excellence cluster “Center for Advancing Electronics Dresden” (cfaed). His research interests lie on methodologies, languages, tools and algorithms for programming complex computing systems.

    Maximilian Odendahl has received a Computer Engineering diploma from RWTH Aachen University in 2010 and joined the Institute for Communication Technologies and Embedded Systems (ICE) afterward as a full-time research assistant and Ph.D. student. His research interests include different aspects of Embedded System Design as well as Embedded Software Engineering.

    Rainer Leupers received the M.Sc. and Ph.D. degrees in Computer Science with honors from the Technical University of Dortmund, Germany, in 1992 and 1997. From 1997 to 2001 he was the chief engineer at the Embedded Systems chair at TU Dortmund. During 1999–2001 he was also a team leader at ICD, where he headed industrial service projects. In 2002, he joined RWTH Aachen University as a professor for Software for Systems on Silicon. He is also a visiting faculty member at the ALARI institute in Lugano. His research and teaching activities comprise software development tools, processor architectures, and electronic design automation for embedded systems, with emphasis on multiprocessor system-on-chip design tools. He published numerous books and technical papers, and he served as a program committee member and topic chair of leading international conferences, including DAC, DATE, and ICCAD. He was a co-chair of the MPSoC Forum and SCOPES. He received several scientific awards, including Best Paper Awards at DATE 2000, 2008 and DAC 2002. He has been a co-founder of LISATek, an EDA tool provider for embedded processor design, now part of Synopsys Inc. He has served as consultant for various companies, as an expert for the European Commission, and in the management boards of compound research projects like UMIC, HiPEAC, and ARTIST.

    View full text