Timing optimization via nest-loop pipelining considering code size

doi:10.1016/j.micpro.2008.02.002

Microprocessors and Microsystems

Volume 32, Issue 7, October 2008, Pages 351-363

https://doi.org/10.1016/j.micpro.2008.02.002 Get rights and content

Abstract

Embedded systems have strict timing and code size requirements. Software pipelining is one of the most important optimization techniques to improve the execution time of loops by increasing the parallelism among successive loop iterations. However, there is no effective techniques exist for solving the software pipelining problem on nested loops. The existing software pipelining techniques for single loops can only explore the parallelism of the innermost loop, so the final timing performance is inferior. While multi-dimensional (MD) retiming can explore the outer loop parallelism, it introduces large overheads in loop index generation and code size due to loop transformation. In this paper, we show how the computation time and code size of a pipelined nested loop is affected by execution sequence and retiming, assuming there is no loop unfolding. We present the theory of Software PIpelining for NEsted loops (SPINE) to reveal the relationship among the computation time of an iteration, the execution sequence, and the software pipelining degree of a nested loop using retiming concepts. Two algorithms of Software PIpelining for NEsted loops (SPINE) are proposed based on the fundamental understanding of the properties of software pipelining for nested loops: the SPINE-FULL algorithm generates fully parallelized loops with the minimal overheads. The SPINE-ROW-WISE algorithm achieves the maximal parallelism in an iteration with a fixed row-wise execution sequence. Therefore, the overheads due to loop transformation are minimal. Our technique can be directly applied to imperfect nested loops. The experimental results show that the average improvement on the execution time of the pipelined loop generated by SPINE is 71.7% compared with that generated by the standard software pipelining technique. The average code size is reduced by 69.5% compared with that generated by the MD retiming technique.

Introduction

Embedded systems usually have stringent requirements in timing and code size. With the advance of the technology, embedded systems with multiple cores or VLIW-like architectures, such as TI’s TMS320C6x, Philips’ TriMedia, and IA64, etc., become necessary to achieve the required high performance for the applications with growing complexity. To reduce the execution time of loops, software pipelining is widely used to explore the instruction-level parallelism in a loop by parallelizing the execution of successive iterations [8], [15]. However, software pipelining can dramatically expand the code size by adding code sections in prologue and epilogue¹ [17], [21]. Code size is one of the most critical concerns for many embedded processors because the capacity of on-chip memory modules is still very limited due to the chip size, cost and power considerations. The designers try their best to fit the code into the small on-chip memory to avoid slow (external) memory accesses. If the software-pipelined code cannot be fit into on-chip memory, a designer, without proper techniques, may have to give up using software pipelining, resulting in a design with a deteriorated timing performance. This awkward situation still exists in leading industry such as Texas Instruments [7] because there is no effective design tools to consider the code size issue along with timing optimization. Therefore, loop optimization with both timing and code size requirements becomes a great challenge for embedded system design.

Timing and code size issues are even greater concerns for software pipelining on nested loops. Unlike software pipelining of single loops which has been extensively studied and implemented [8], [15], [3], [21], [13], very few work has been done for the software pipelining problem on nested loops. A few existing techniques that could be applied to nested loop optimization either cannot fully explore the parallelism in a nested loop or do not consider the overheads such as loop index and loop bounds computation, and code size expansion due to transformation. To the authors’ knowledge, there is no existent technique that can effectively solve the software pipelining problem for nested loops in embedded systems.

Software pipelining for single loops focuses on one-dimensional problems. When applied to nested loops, it only optimizes the innermost loop [8], [15], [3], [1], [2]. While nested loops usually exhibit dependencies cross loop dimensions. They provide abundant opportunities to increase the parallelism in an iteration. For example, the execution time of the Floyd-Steinberg algorithm generated by the modulo scheduling [15] is 25,000 time units according to our experimental results, while it is only 2950 time units when generated by our Software PIpelining for NEsted loops (SPINE) technique. The improvement on the loop execution time is 88.2%. It indicates that a lot of potential parallelism cannot be explored by software pipelining for single loops. Therefore, the performance improvement that can be obtained by the standards software pipelining techniques is very limited.

Another technique called hyperplane scheduling [6], [14] tries to convert a nested loop into a single loop using loop unrolling and skewing to reduce the execution time. However, this technique makes code generation extremely difficult, and results in large overheads computation time and code size due to loop transformation. Data locality is also disrupted as a result of loop skewing. The best effort existing in industry on nested loop pipelining is to overlap the executions of the prologue and epilogue of the innermost loop, called outer loop pipelining [10], [19]. In this method, the dependencies among the outer loop iterations are still not exploited. Hence, the potential parallelism that can be explored is very limited.

The only existing method that can fully explore the potential parallelism in multi-dimensional problems is multi-dimensional (MD) retiming [12]. MD retiming can achieve full parallelism of an multi-dimensional problem with polynomial time algorithm. That is, all the computations in an MD problem can be executed in parallel. MD retiming techniques can be effectively applied to high-level synthesis. However, MD retiming does not consider some critical issues for loop optimization, such as loop index generation and code size. We found that the regular row-wise execution sequence, which is implemented in most nested loops, can be altered unnecessarily by using MD retiming technique. Loop transformation needs to be performed to compute new loop indexes and loop bounds due to a skewed execution sequence [12], [20]. Therefore, large code size and computation overhead are introduced, and data locality is also disrupted. According to our experimental results, the code size of Floyd-Steinberg algorithm generated by MD retiming is 1646 instructions, while it is only 169 instructions using our SPINE technique. The code size is reduced by 89.7%. The execution time is also reduced by 31.6%. Although MD retiming can benefit high-level synthesis with specialized hardware support, it is not suitable for software pipelining on nested loops.

The following example shows that a skewed execution sequence significantly affects the performance and code size of the generated code. Fig. 1a shows the original code of a nested loop. Fig. 1c shows the pipelined loop generated by the standard MD retiming technique [12]. It uses a diagonal execution sequence to achieve full parallelism. The code size grows dramatically not only because prologue and epilogue sections are produced in both loop levels, but also because extra codes are required to compute the new loop bounds and loop indexes. These extra computations deteriorate the performance and code quality of the final code. Due to the space limitation, we cannot show the whole piece of the program. Fig. 1b shows the software-pipelined code generated by our SPINE algorithm using row-wise execution. The loop is fully parallelized. Assuming that each computation in the loop body can be executed in one time unit, then, the execution time of one iteration of the pipelined loop is just one time unit. Only the innermost loop has prologue and epilogue. The code size can be further reduced by directly applying the code size reduction technique presented in [21].

In our research, multi-dimensional retiming framework is used to model the software pipelining on nested loops. In the SPINE theory, we show how schedule vector and retiming affect the computation time of an iteration for nested loops assuming that loops are not unfolded [11], [5]. Based on the SPINE theory, we show how to achieve maximum parallelism in an iteration while keeping the overheads minimal using a row-wise execution sequence whenever it is possible. Then, more efficient SPINE algorithms are developed to optimize nested loops considering various design requests. We make our contributions as follows:

1.
We use multi-dimensional retiming concept to effectively model the software pipelining problem on nested loops (Section 2.2).
2.
We present the theory of software pipelining for nested loops to reveal the relationship among the computation time of a loop iteration, the execution sequence, and the software pipelining degree (Section 3).
3.
We prove that the minimum computation time of an iteration for a nested loop with given execution sequence can be achieved by using a retiming vector that is orthogonal to the execution sequence (Theorem 3.4).
4.
We develop two algorithms for Software PIpelining for NEsted loops (SPINE) technique (Section 4):
- •
  The SPINE-FULL algorithm fully parallelizes a nested loop with the minimal overheads.
- •
  The SPINE-ROW-WISE algorithm achieves the maximal parallelism in an iteration with a fixed row-wise execution sequence. Therefore, the overheads due to loop transformation are minimal.
5.
Our technique can be directly applied to loops with codes between loop levels (imperfect nested loops) (Section 4).

We conduct experiments on a set of two-dimensional benchmarks to compare the code quality generated by SPINE with that generated by the standard software pipelining, and MD retiming. Our experimental results show that SPINE out-performs or ties both of the other two techniques on all of our benchmarks. The average improvement on the execution time of the pipelined loop generated by SPINE is 71.7% compared with that generated by the standard software pipelining technique, such as modulo scheduling. The average code size is reduced by 69.5% compared with that generated by MD retiming. Based on the result of this paper, the future research can be extended to other optimization objectives of nested loops, such as low power scheduling, address register allocation, etc.

The rest of the paper is organized as follows: Section 2 gives an overview for the graph representation of nested loops and multi-dimensional retiming model. A brief discussion on the lower bound of the computation time of a nested loop iteration is also provided in this section. Section 3 presents the theory of Software Pipelining for NEsted loops (SPINE). The SPINE algorithms and an illustrative example are presented in Section 4. We also show that our technique can be applied to imperfect nested loops. Section 5 shows our experimental results on a set of 2D benchmarks. Finally, we conclude the paper in Section 6.

Section snippets

Basic principles

In this section, we give an overview of basic concepts and principles related to software pipelining problem for nested loops. These include multi-dimensional data flow graph, multi-dimensional retiming, and software pipelining. We demonstrate that retiming and software pipelining are essentially the same concept. A discussion of the limitations of the existing techniques for optimizing nested loops will be provided in Section 2.4.

Theory of software pipelining for nested loops

In this section, we present the theoretical foundation of software pipelining of nested loops with two loop levels based on retiming concept. We study the timing property of cycles in an MDFG considering both schedule vector and retiming. Although the theorems are derived for two-dimensional, unit-time MDFGs, they can be generalized to multi-dimensional, general-time cases. First, we will introduce definitions and assumptions that are necessary for the understanding of the theorems.

Definition 3.1

Given a

SPINE algorithms

In this section, we present two algorithms of nest-loop software pipelining. The SPINE-FULL algorithm generates fully parallelized nested loops with computation and code size overheads as small as possible. It will be interesting to see that the chained MD retiming becomes a special case of the SPINE-FULL algorithm.

The SPINE-ROW-WISE algorithm fixes row-wise execution sequence, while generating a software-pipelined loop schedule with length equal to the schedule bound. Although the algorithms

Experiments

In our experiments, we compare software-pipelined loops generated by three different approaches: the standard software pipelining technique, modulo scheduling (“modulo”), the standard MD retiming technique, chained MD retiming (“Chained”), and the SPINE-FULL algorithm (“SPINE”). Our benchmarks include a set of 2D nested loops: wave digital filter (“WDF”), differential pulse-code modulation device (“DPCM”), two-dimensional filter (“2D”), Floyd-Steinberg algorithm (“Floyd”), a small

Conclusion

The existing techniques cannot optimize nested loops effectively for many embedded systems with strict timing and code size requirements. The standard software pipelining techniques only explore the parallelism in one-dimension. Multi-dimensional retiming can fully parallelize a nested loop, but does not consider timing and code size overheads due to loop transformation. In this paper, we present the theory of Software Pipelining for NEsted loops (SPINE) based on the fundamental understanding

Acknowledgement

This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001, NSF CCR-0309461, NSF IIS-0513669, and Microsoft, USA.

References (21)

R. Bailey, D. Defoe, R. Halverson, R. Simpson, N. Passos, A study of software pipelining of multi-dimensional problems,...
N. Chabini, W. Wolf, An approach for integrating basic retiming and software pipelining, in: Proceedings of the 4th...
L.-F. Chao et al.
Rotation scheduling: a loop pipelining algorithm
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
(1997)
L.-F. Chao et al.
Static scheduling for synthesis of DSP algorithms on various models
Journal of VLSI Signal Processing
(1995)
L.-F. Chao et al.
Scheduling data-flow graphs via retiming and unfolding
IEEE Transactions on Parallel and Distributed Systems
(1997)
A. Darte et al.
Constructive methods for scheduling uniform loop nests
IEEE Transactions on Parallel and Distributed Systems
(1994)
E. Granston, R. Scales, E. Stotzer, A. Ward, J. Zbiciak, Controlling code size of software-pipelined loops on the...
M. Lam, Software pipelining: an effective scheduling technique for VLIW machines, in: Proceedings of the SIGPLAN’88 ACM...
C.E. Leiserson et al.
Retiming synchronous circuitry
Algorithmica
(1991)
K. Muthukumar, G. Doshi, Software pipelining of nested loops, in: R. Wilhelm (Eds.), CC 2001, LNCS 2027,...

There are more references available in the full text version of this article.

Cited by (9)

Energy and memory-aware software pipelining streaming applications on NoC-based MPSoCs
2020, Future Generation Computer Systems
Citation Excerpt :
Integrating retiming with DVFS can significantly reduce energy consumption because there are no Intra-period data dependencies tasks and the slack that is otherwise wasted due to these dependencies or because to inter-processor communication overhead is utilized for energy-optimization. Therefore, pipelining-based loop scheduling approaches [34–40] and [41] have been proposed to minimize the schedule makespan or improve system performance. A few approaches focus on optimizing energy consumption by integrating DVFS with software pipelining.
In this article, we explore the problem of energy-aware scheduling of real-time applications modelled by conditional task graphs on NoC based MPSoC such that the total energy consumption is minimized. We propose a novel energy and memory-aware retiming conditional task graph (EMRCTG) approach that integrates task-level coarse-grained software pipelining with Dynamic Voltage and Frequency Scaling (DVFS). Our approach not only optimizes energy consumption but ensures that memory capacity constraints are satisfied. EMRCTG has two phases. In the first phase, we map tasks to processors, transform intra-period data dependencies into inter-period and generate a schedule by a Non-Linear Programming (NLP)-based algorithm assuming infinite memory capacity. The NLP-based algorithm assigns a continuous frequency and voltage to each task and each communication and uses a polynomial-time heuristic to transform the continuous frequencies and voltages to discrete frequencies and voltages. We analyse the memory consumption of the generated schedule and initiate schedule repair phase 2 if the memory capacity constraints violate. The schedule repair phase finds a set of nodes such that by reducing their retiming values the memory capacity constraints satisfy.
We compare our approach against two existing approaches GeneS and JCCTS. GeneS is a genetic algorithm that first transforms the dependent task set into an independent task set and then collectively performs task mapping, ordering and voltage scaling. JCCTS is a mixed integer linear programming based approach that optimally removes inter-processor communication overhead. Our experimental result show that compared to the approach GeneS our approach can obtain an improvement in range of 1.6 to 18 percent and an average improvement of 11 percent. Compared to the approach JCCTS our approach can achieve an improvement in range of 9 to 42 percent and an average improvement of 26 percent.
Synthesizing distributed pipelining systems with timing constraints via optimal functional unit assignment and communication selection
2018, Journal of Computational Science
Citation Excerpt :
How et al. [36] addressed the problem of assigning task to functional units in distributed real-time systems to satisfy timing constraints. To synthesize a distributed system, it is critical to make sure that the resultant system satisfies both performance and cost requirement [8,37–39,9,40]. Shao et al. [8] presented efficient techniques to conduct functional unit assignment to minimize the total cost under timing constraint.
The design of efficient optimization techniques is important to synthesize application-specific distributed systems with timing constraints. In many applications, represented by task graphs, the consecutive executions of a task graph can be overlapped in a pipelined fashion with a proper buffer placement. The performance of such a system is closely related to the behavior of pipelining. Given a timing (throughput) constraint, however, using the fastest functional units or communication protocols may incur unacceptable high cost. In the design of such distributed pipelining systems with timing constraint, several problems need to be solved: how to properly place buffers, assign functional unit types for each task, and select communication protocols for each pair of tasks. This paper presents efficient optimization algorithms by integrally considering the above problems, such that the resultant systems can satisfy the timing constraints with the minimum total cost. In this paper, we first study the properties of distributed pipelining systems by using a rigorous model, called self-timed model, and then we present theorems to accurately compute the system throughput. Based on these understandings, we devise efficient algorithms to obtain the optimal solutions. Experiments show that the typical greedy approaches cannot find a feasible solution for the tight timing requirements while our algorithms can. For the few cases that greedy approaches may find solutions, our algorithms can achieve significant reductions in total cost.
Optimal functional assignment and communication selection under timing constraint for self-Timed pipelines
2017, Proceedings - 2016 13th International Conference on Embedded Software and System, ICESS 2016
Optimal functional-unit assignment and buffer placement for probabilistic pipelines
2016, Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES 2016
WCET nested-loop minimization in terms of instruction-level-parallelism
2015, Proceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015
Execution time optimisation using delayed multidimensional retiming
2015, International Journal of High Performance Systems Architecture

View all citing articles on Scopus

View full text

Timing optimization via nest-loop pipelining considering code size

Abstract

Introduction

Section snippets

Basic principles

Theory of software pipelining for nested loops

SPINE algorithms

Experiments

Conclusion

Acknowledgement

Rotation scheduling: a loop pipelining algorithm

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Static scheduling for synthesis of DSP algorithms on various models

Journal of VLSI Signal Processing

Scheduling data-flow graphs via retiming and unfolding

IEEE Transactions on Parallel and Distributed Systems

Constructive methods for scheduling uniform loop nests

IEEE Transactions on Parallel and Distributed Systems

Retiming synchronous circuitry

Algorithmica