Efficient control generation for mapping nested loop programs onto processor arrays

doi:10.1016/j.sysarc.2006.10.009

Journal of Systems Architecture

Volume 53, Issues 5–6, May–June 2007, Pages 300-309

https://doi.org/10.1016/j.sysarc.2006.10.009 Get rights and content

Abstract

Processor array architectures are optimal platforms for computationally intensive applications. Such architectures are characterized by hierarchies of parallelism and memory structures, i.e. processor arrays apart from different levels of cache have a large number of processing elements (PE) where each PE can further contain sub-word parallelism. In order to handle large scale problems, balance local memory requirements with I/O-bandwidth, and use different hierarchies of parallelism and memory, one needs a sophisticated transformation called hierarchical partitioning. Innately the applications are data flow dominant and have almost no control flow, but the application of hierarchical partitioning techniques has the disadvantage of a more complex control flow. In a previous paper, the authors presented first time a methodology for the automated control path synthesis for the mapping of partitioned algorithms onto processor arrays. However, the control path contained complex multiplication and division operators. In this paper, we propose a significant extension to the methodology which reduces the hardware cost of the global controller and memory address generators by avoiding these costly operations.

Section snippets

Introduction and related work

In the last decade, there has been a dramatic growth in research and development of massively parallel processor arrays both in academia and industry. Examples of state-of-the-art reconfigurable processor array architectures are RAW [17], PACT-XPP64A [11] and WPPA [9]. Processor array architectures provide an optimal platform for the parallel execution of number crunching loop programs from fields of digital signal processing, image processing, linear algebra, etc. However, due to a lack of

Definitions, notations, and transformations

In this section we first give a brief overview of our existing mapping methodology PARO (see design flow of our approach in Fig. 1) based on the polytope model [10] for mapping of loop nests onto massively parallel architectures. The starting point is an algorithmic description as a set of recurrence equations called piecewise linear algorithm (see Definition 2.1).

Definition 2.1 PLA

A piecewise linear algorithm consists of a set of N quantified equations S₁[I], …, S_N[I], where each equation S_i[I] is of the form $\forall I \in I$

Control generation

Partitioning not only increases the PLA code size but also introduces a more complex control flow in the program. The iteration dependent if-conditionals occurring in a given PLA have to be replaced by control variables for efficient parallelization. Therefore, a methodology for control generation is needed that specifies the control units and signals of the processor array.

Conclusions and future work

The processor array specification is interpreted from the PLA after control generation. In [3], the authors validated their methodology with a case study showing up to 90% curtailment in control path area cost as compared to earlier methodologies [1], [6]. This huge reduction is attributed to the fact that earlier counters local to every PE updated the iteration variables leading to scaling of cost proportional to the number of PEs. Our scheduling methodology for partitioning techniques enables

References (19)

Alain Darte et al.
Constructing and exploiting linear schedules with prescribed parallelism
ACM Transactions on Design Automation of Electronic Systems
(2002)
Steven Derrien, Tanguy Risset, Interfacing Compiled FPGA Programs: The MMAlpha Approach, in: Proceedings of the...
Hritam Dutta, Frank Hannig, Jürgen Teich, Controller Synthesis for Mapping Partitioned Programs on Array Architectures,...
Hritam Dutta, Frank Hannig, Jürgen Teich, Hierarchical Partitioning for Piecewise Linear Algorithms, in: Proceedings of...
Uwe Eckhardt et al.
Hierarchical algorithm partitioning at system level for an improved utilization of memory structures
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
(1999)
Anne-Claire Guillou, Patrice Quinton, Tanguy Risset, Hardware Synthesis for Multi-Dimensional Time, in: Proceedings of...
Frank Hannig, Hritam Dutta, Jürgen Teich, Regular Mapping for Coarse-grained Reconfigurable Architectures, in:...
Frank Hannig, Jürgen Teich, Design Space Exploration for Massively Parallel Processor Arrays, in: Victor Malyshkin,...
Dmitrij Kissler, Frank Hannig, Kupriyanov Alexey, Jürgen Teich, Hardware Cost Analysis for Weakly Programmable...

There are more references available in the full text version of this article.

Cited by (3)

A direct method for optimal VLSI realization of deeply nested n-D loop problems
2013, Microprocessors and Microsystems
Citation Excerpt :
A VLSI hardware architecture in a template form was used to implement a parsing algorithm incorporated into an automated synthesis tool. The tool generates a HDL synthesisable source code for the given specifications of a control flow graph that implements a global controller to obtain an improved control path leading to a less complex control flow [24,25]. The generated source is simulated for validation, synthesised and tested on a Xilinx field programmable gate array (FPGA) board [25].
Many computationally intensive algorithms are often represented as n-dimensional (n-D) nested loop algorithms. Systolic-array-based projections and their modifications involving multidimensional vector space representations have been used to realise the optimal VLSI design of deeply nested loop problems. The approaches employed so far involve an extensive search of the feasible solution space through heuristic methods and yield near optimal solutions. This paper presents a method of identifying the optimal solution directly and through a logical procedure. The new allocation method is shown to evolve around the computational expression and the sub-space in which it lies. The array of processing elements termed as the PE array is allocated to the indentified computational sub-space which is strictly of lower dimension than the n-D problem space. The proposed new optimal allocation procedure is first explained using the 3-D matrix/matrix multiplication (MMM) problem. The effectiveness of the method for higher dimensional problem is demonstrated through the illustrative example flow of 6-D full search block motion (FSBM) algorithm. The various design possibilities of the above mapping procedure are explored analytically and the cost constraints termed the figure of merit (FoM) of the design are evolved for the various design trade-offs for MMM and 6-D FSBM problem.
An experimental methodology is developed using a hyper-graph model to represent the PE allocation to a particular sub-space of the n-D problem space. The advantage of our mapping procedure is illustrated by considering two cases namely, first an allocation represented by a vertex cover that covers the nodes of the identified computational (n − x)-D sub-space, where x < n, and in the second case as a random cover of group of nodes in the n-D problem space to model an allocation of PE array to a random sub-space. The design space exploration (DSE) results for the same are presented for the 6-D (FSBM) estimation algorithm using the high level synthesis tool ‘GAUT’ to compare the allocation of resources and utilisation in our method with the random PE array allocation and utilisation. It is found that our methodology leads to optimal number of resource allocation and their optimal utilisation for the various design possibilities using the timing constraint given as input to the HLS tool. Also the complexity of our approach is compared with that of existing methods which shows that the complexity of our approach does not grow with the n-D problem size.
A holistic approach for tightly coupled reconfigurable parallel processors
2009, Microprocessors and Microsystems
New standards in signal, multimedia, and network processing for embedded electronics are characterized by computationally intensive algorithms, high flexibility due to the swift change in specifications. In order to meet demanding challenges of increasing computational requirements and stringent constraints on area and power consumption in fields of embedded engineering, there is a gradual trend towards coarse-grained parallel embedded processors. Furthermore, such processors are enabled with dynamic reconfiguration features for supporting time- and space-multiplexed execution of the algorithms. However, the formidable problem in efficient mapping of applications (mostly loop algorithms) onto such architectures has been a hindrance in their mass acceptance. In this paper we present (a) a highly parameterizable, tightly coupled, and reconfigurable parallel processor architecture together with the corresponding power breakdown and reconfiguration time analysis of a case study application, (b) a retargetable methodology for mapping of loop algorithms, (c) a co-design framework for modeling, simulation, and programming of such architectures, and (d) loosely coupled communication with host processor.
PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications
2008, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

^☆: Supported in part by the German Science Foundation (DFG) in project under contract TE 163/13-1.

View full text

Journal of Systems Architecture

Efficient control generation for mapping nested loop programs onto processor arrays☆

Abstract

Section snippets

Introduction and related work

Definitions, notations, and transformations

Control generation

Conclusions and future work

Constructing and exploiting linear schedules with prescribed parallelism

ACM Transactions on Design Automation of Electronic Systems

Hierarchical algorithm partitioning at system level for an improved utilization of memory structures

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

A direct method for optimal VLSI realization of deeply nested n-D loop problems

A holistic approach for tightly coupled reconfigurable parallel processors

PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications