Elsevier

Journal of Systems Architecture

Volume 62, January 2016, Pages 24-37
Journal of Systems Architecture

Buffer allocation for real-time streaming applications running on heterogeneous multi-processors without back-pressure

https://doi.org/10.1016/j.sysarc.2015.09.001Get rights and content

Abstract

The goal of buffer allocation for real-time streaming applications is to minimize total memory consumption, while reserving sufficient space for each data production, without overwriting any live data and guaranteeing the satisfaction of real-time constraints. Previous research has mostly focused on buffer allocation for systems with back-pressure. This paper addresses the problem of buffer allocation for systems without back-pressure. Since systems without back-pressure lack blocking behavior at the side of the producer, buffer allocation requires both best- and worst-case timing analysis.

Our contributions are (1) extension of the available dataflow techniques with best-case analysis; (2) the closest common dominator-based and closest common predecessor-based lifetime analysis techniques; (3) techniques to model the initialization behavior and enable token reuse.

Our benchmark set includes an MP3 decoder, a WLAN receiver, an LTE receiver and an LTE-Advanced receiver. We consider two key features of LTE-Advanced: (1) carrier aggregation and (2) EPDCCH processing. Through our experiments, we demonstrate that our techniques are effective in handling the complexities of real-world applications. For the LTE-Advanced receiver case study, our techniques enable us to compare buffer allocation required for different scheduling policies with effective impact on architectural decisions. A key insight in this comparison is that our improved techniques show a different scheduling policy to be superior in terms of buffer sizes compared to our previous technique. This dramatically changes the trade-off among different scheduling policies for LTE-Advanced receiver.

Introduction

In current embedded systems such as smartphones and tablets, power and on-chip memory resources are severely limited. Moreover, wireless streaming applications such as WLAN [1] and LTE receivers [2] running on these devices have strict real-time requirements, including minimum throughputs and maximum latencies. Such streaming applications run continuously, processing virtually infinite input sequences in an iterative and pipelined manner. These applications must therefore not only conform to their strict timing requirements, but also occupy as little on-chip memory as possible. They are often mapped onto heterogeneous multiprocessor platforms to meet the timing behavior through exploitation of parallel execution.

Dataflow is an effective analysis and programming model which is suitable to model streaming applications [3], [1]. An application is modeled as a graph in dataflow, where nodes denote processing elements referred to as actors and edges denote data dependencies. Actors communicate data items (tokens) through edges. We use static dataflow variants which allow rigorous timing analysis of streaming applications.

Tokens are communicated in First-In-First-Out (FIFO) manner on all the edges of a dataflow graph. Each such FIFO edge is mapped as a buffer on memory. The computation of the minimum memory required by the buffers while reserving sufficient space for each produced token without overwriting any live tokens and guaranteeing the satisfaction of real-time constraints is called Buffer Sizing [4], [5], [6], [7]. There are several buffer sizing solutions proposed for streaming applications modeled as dataflow graphs scheduled on a hardware platform with back-pressure. Back-pressure is a mechanism, implemented in either hardware or software, that allows an actor to fire only if there is sufficient space available to produce tokens on all output edges. In contrast, on a platform without back-pressure, an actor can fire as soon as it has sufficient tokens at its inputs without checking for availability of space on its outputs. This can result in data corruption if the producer of a buffer produces a token on a location which already is storing a token that is not yet processed by its consumer. Due to these differences, existing buffer sizing techniques for systems with back-pressure cannot be applied to compute buffer sizes for systems without back-pressure.

There are several hardware platforms [8], [9], [10], [11] which do not support back-pressure for some or all of its processing elements. In general, their communication interface allows them to read/write from/to buffers in a memory, but do not handle buffer management: they are simply given a memory address from/onto which to read/write, and will do so unconditionally. In fact, it is not uncommon to have hardware accelerators that do not support back-pressure. Although back-pressure mechanisms offer safer execution from a functional perspective, they come with several disadvantages especially in the context of embedded real-time systems: back-pressure incurs extra processing and synchronization overheads, it complicates the interface design between accelerators and interconnect, and consumes chip area and additional energy [12], [13]. Also, when dealing with sampled signals (external source), back-pressure cannot be applied, since the producer cannot be stopped. Moreover, the cyclic dependencies introduced by back-pressure make timing analysis difficult for traditional real-time analysis techniques [14].

In this paper, we present a buffer allocation solution for streaming applications running on a heterogeneous multi-processor platform without back-pressure. This technique has originally been proposed by us in [15]. In this paper, we extend solution for buffer allocation with several advanced techniques such as recursive dominator computation, the closest common predecessor-based lifetime analysis, token reuse and handling initialization behavior. We use an MP3 decoder, a WLAN receiver, an LTE receiver and an LTE-Advanced receiver as our benchmark set. LTE-Advanced receiver includes carrier aggregation and EPDCCH processing. Our techniques save up to 54% memory consumption compared to the technique shown in [15] for our benchmark set. We also show that our techniques and tool can handle complex practical applications, and allow to explore different scheduling choices for a given platform. Our improved techniques show a different buffer size trade-off among three scheduling policies compared to our previous technique for LTE-Advanced receiver. It infers a different scheduling policy to be superior in terms of the buffer sizes compared to our previous technique.

Section snippets

Motivation

Single-rate dataflow (SRDF) is a static dataflow variant that allows rigorous timing analysis of streaming applications: verification of deadlock freedom, execution in bounded memory, computation of minimum throughput and maximum latency [3]. Each SRDF actor has bound on its execution time. At each execution (firing) of an actor, it consumes one token from each of its input edges and produces one token on each of its output edges. SRDF graph examples are shown in Fig. 1a and b. Initial tokens,

Time-bounded single-rate dataflow

We use a time-bounded SRDF (Tb-SRDF) [15], [16], an SRDF graph extended with Best-Case Execution Time (BCET) and WCET per actor, to model real-time streaming applications. It is denoted by G=(V,E,d,τˇ,τ^), where V is the set of vertices of G, E is the set of edges of G. d is a valuation d:EN0, such that d(i, j) is the number of initial tokens (delays) in edge (i, j) ∈ E where i, jV. Valuations τˇ,τ^:VR0+ give the BCET and WCET of an actor respectively. The timing function is a valuation τ:V

Problem description

During execution, each live token occupies a space in the shared memory shown in the architecture diagram (Fig. 4) which is characterized by its starting memory address, SA:E×N0N0, and its size z:EN0 in bytes. SA(e, k) gives the start address of the token produced on edge e, at the (k+1)th iteration. The finish time of actor i in the (K+1)th iteration is defined as f(i,k,τ)=s(i,k,τ)+τ(i,k). A valuation Overlap:(E×N0)2Bool indicates if a pair of tokens have overlapping lifetimes or not:

Buffer computation and boundedness

In this section, we first discuss the exact expression of the buffer sizes needed for streaming applications running on non-back-pressured hardware platforms. We will then study sufficient conditions for bounded buffer sizes.

Buffer allocation technique

In this section, we describe the buffer allocation technique to compute buffer sizes for an application scheduled on hardware without back-pressure.

Back-pressure vs. no back-pressure

Dataflow graphs modeling back-pressure are equipped with back edges to regulate token productions and thereby to reduce buffer sizes by delaying actor firings if required. This implies that when back-pressure is supported, buffer sizes can be reduced compared to the buffer sizes needed when back-pressure is not supported. The WC-STS of a graph is always rate-optimal, i.e. the WC-STS always realizes the maximal possible throughput [1]. We assume back-pressured systems that are rate-optimal and

Experiments & results

In this section, we benchmark our techniques using an MP3 decoder, a WLAN receiver, an LTE receiver [2] and an LTE-Advanced receiver [25]. We specify these applications as CSDF graphs. Our tool then converts them into SRDF graphs to perform buffer allocation. All presented results are verified according to Section 6.6.

Related work

The topic of buffer allocation is thoroughly studied in the context of static dataflow [4], [5], [7]. Various approaches have been proposed for buffer sizing for the dataflow graphs scheduled on a platform with back-pressure [5], [6], [24]. In [5], authors have provided optimal buffer sizing algorithms for back-pressured systems by altering self-timed schedules. Unlike other techniques, our technique computes buffer sizes for a dataflow graph executed without back-pressure.

In [27], relative

Conclusion

In this paper we provide a comprehensive study of buffer allocation in the absence of back-pressure mechanisms in heterogeneous multi-processor platforms. We propose a closest common predecessor-based lifetime analysis technique, an extension to the closest common dominator-based lifetime analysis proposed in our previous work, which provides reduction in overall memory consumption. We also propose techniques to model initialization behavior and implement token reuse in dataflow.

We show in

Acknowledgements

This work was funded by the CA104 COBRA-NL project, granted within the European Catrene program. In addition, this work was in part carried out in Ericsson DSP Innovation Center, Eindhoven, The Netherlands, during 2013 and 2014.

Hrishikesh Salunkhe received the M.Sc. degree in embedded systems from the Technische Universiteit Eindhoven, The Netherlands in 2011. Currently, he has been working towards the Ph.D. degree in the System Architecture and Networking (SAN) Group, Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, since 2011. His research focus is on modeling and timing analysis of real-time embedded streaming systems in the software defined radio domain.

References (29)

  • O. Moreira

    Scheduling Real-Time Streaming Applications onto an Embedded Multiprocessor

    (2014)
  • S. Sesia et al.

    LTE, The UMTS Long Term Evolution: From Theory to Practice

    (2009)
  • S. Sriram et al.

    Embedded Multiprocessors: Scheduling and Synchronization

    (2000)
  • Q. Ning, G.R. Gao, A novel framework of register allocation for software pipelining,...
  • S. Geilen et al.

    Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs

  • M. Wiggers et al.

    Efficient computation of buffer capacities for multi-rate real-time systems with back-pressure

  • O. Moreira et al.

    Buffer sizing for rate-optimal single-rate data-flow scheduling revisited

    IEEE Trans. Comput.

    (2010)
  • T. von Eicken, Active messages: an efficient communication architecture for multiprocessors (Ph.D. thesis), ETH...
  • D. Nadezhkin et al.

    Realizing fifo communication when mapping kahn process networks onto the cell

  • P.M. Phothilimthana et al.

    Portable performance on heterogeneous architectures

    SIGARCH Comput. Archit. News

    (2013)
  • P.M. Mattheakis et al.

    Significantly reducing mpi intercommunication latency and power overhead in both embedded and hpc systems

    ACM Trans. Archit. Code Optim.

    (2013)
  • J. Chen, A. Burns, A fully asynchronous reader/writer mechanism for multiprocessor real-time systems, Tech. rep.,...
  • Y. Zhang, Non-blocking synchronization: algorithms and performance evaluation (Ph.D. thesis), TU Chalmers, Sweden,...
  • M. Jersak, Compositional performance analysis for complex embedded applications (Ph.D. thesis), TU Braunschweig,...
  • Cited by (0)

    Hrishikesh Salunkhe received the M.Sc. degree in embedded systems from the Technische Universiteit Eindhoven, The Netherlands in 2011. Currently, he has been working towards the Ph.D. degree in the System Architecture and Networking (SAN) Group, Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, since 2011. His research focus is on modeling and timing analysis of real-time embedded streaming systems in the software defined radio domain.

    Alok Lele is a PhD Student in the Department of Mathematics and Computer Science at the Eindhoven University of Technology, The Netherlands. He holds an MSc in Computer Science Engineering from Eindhoven University of Technology, The Netherlands and an M.Tech in Software Engineering from Manipal Institute of Technology, India. His research interests include modeling, simulation and analysis of embedded streaming applications, real-time multi-processor scheduling and resource management.

    Orlando Moreira graduated in electronics engineering from the University of Aveiro. He received his Ph.D. degree from the Technische Universiteit Eindhoven, The Netherlands in 2012. He holds a position at Intel as Compiler Product Owner and Team Leader. Until 2014 he was a Principal DSP Systems Engineer at Ericsson, where he worked in the systemization of a Dual-Call Software-defined multi-standard modem. He has published work on reconfigurable computing, real-time multiprocessor scheduling, multiprocessor resource management and dataflow-based hard real-time analysis.

    Kees van Berkel started his R&D career at Philips Research in 1980, after an MSc degree in EE from TU Delft. Since 2000 he has been a fellow at Philips, NXP, ST-Ericsson, and Ericsson. He obtained a PhD in CS from TU Eindhoven in 1992, where he is a part-time full professor since 1996. Kees pioneered asynchronous VLSI from theory to mass production. Likewise: embedded vector processing for software-defined radio. His research interests include multi-standard cellular modems, vector processors, multi-core architectures, resource management, and low power. His latest interest is (software-defined) radio astronomy.

    View full text