Abstract
Majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated loop bounds and loop indexes calculations. This paper proposes a new technique, loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where all iterations in a stripe are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for loop striping transformations. The experimental results show that loop striping always achieves better iteration period than software pipelining and loop unfolding, improving average iteration period by 50 and 54% respectively.
Similar content being viewed by others
References
A. Aiken and A. Nicolau, “Optimal Loop Parallelization,” in ACM Conference on Programming Language Design and Implementation, 1988, pp. 308–317.
A. Aiken and A. Nicolau, Fine-Grain Parallelization and the Wavefront Method, MIT Press, 1990.
J. R. Allen and K. Kennedy, “Automatic Loop Interchange,” in ACMSIGPLAN symposium on Compiler construction, 1984, pp. 233–246.
G. I. C. Amy, W. Lim, and M. S. Lam, “An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication,” in International Conference on Supercomputing, 1999, pp. 228–237.
J. M. Anderson and M. S. Lam, “Global optimizations for parallelism and locality on scalable parallel machines,” in ACM SIGPLAN Conference on Programming Language Design and Implementations, June, 1993, pp. 112–125.
U. Banerjee, Unimodular Transformations of Double Loops, MIT Press, 1991.
K. Iwano and S. Yeh, “An efficient algorithm for optimal loop parallelization,” in Proc. of the First International Symposium of Algorithms, Dec., 1990, pp. 201–210.
R. M. Karp, “A Characterization of the Minimum Cycle Mean in a Digraph,” Discrete Math., vol. 23, 1978, pp. 309–311.
L. Lamport, “The Parallel Execution of do Loops,” Commun. ACM SIGPLAN, vol. 17, FEB. 1991, pp. 82–93.
C. E. Leiserson and J. B. Saxe, “Retiming Synchronous Circuitry,” Algorithmica, vol. 6, 1991, pp. 5–35.
A. W. Lim and M. S. Lam, “Maximizing Parallelism and Minimizing Synchronization with Affine Transforms,” in ACM SIGPLAN Symposium on Principles of Programming Languages, Jan., 1997, pp. 201–214.
K. K. Parhi and D. G. Messerschmitt, “Static Rate-optimal Scheduling of Iterative Data-flow Programs via Optimum Unfolding,” IEEE Trans. Comput., vol. 40, 1991, pp. 178–195.
N. L. Passos and E. H.-M. Sha, “Full Parallelism in Uniform Nested Loops using Multidimensional Retiming,” in International Conference on Parallel Processing, Aug., 1994, pp. 130–133.
M. Wolfe, “Loop Skewing: The Wavefront Method Revisited,” Int. J. Parallel Program., vol. 15, no. 4, 1986, pp. 284–294.
M. E. Wolf and M. S. Lam, “A Data Locality Optimizing Algorithm,” in ACM SIGPLAN conference on Programming Language Design and Implementation, June, vol., 2, 1991, pp. 30–44.
M. E. Wolf and M. S. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel Distrib. Syst., vol. 2, 1991, pp. 452–471.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001 and NSF CCR-0309461, USA.
Rights and permissions
About this article
Cite this article
Xue, C., Shao, Z. & Sha, E.HM. Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping. J VLSI Sign Process Syst Sign Image Video Technol 47, 153–167 (2007). https://doi.org/10.1007/s11265-006-0034-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-006-0034-5