Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

Xue, Chun; Shao, Zili; Sha, Edwin H.-M.

doi:10.1007/s11265-006-0034-5

Chun Xue¹,
Zili Shao² &
Edwin H.-M. Sha¹

73 Accesses
6 Citations
Explore all metrics

Abstract

Majority of scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are generally applied to increase parallelism for these nested loops. Most of the existing loop transformation techniques either can not achieve maximum parallelism, or can achieve maximum parallelism but with complicated loop bounds and loop indexes calculations. This paper proposes a new technique, loop striping, that can maximize parallelism while maintaining the original row-wise execution sequence with minimum overhead. Loop striping groups iterations into stripes, where all iterations in a stripe are independent and can be executed in parallel. Theorems and efficient algorithms are proposed for loop striping transformations. The experimental results show that loop striping always achieves better iteration period than software pipelining and loop unfolding, improving average iteration period by 50 and 54% respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey on Pipelined FFT Hardware Architectures

Article Open access 06 July 2021

Mario Garrido

Efficient High-Level Programming in Plain Java

Article 05 December 2022

Rui S. Silva & João L. Sobral

GPU Architecture

References

A. Aiken and A. Nicolau, “Optimal Loop Parallelization,” in ACM Conference on Programming Language Design and Implementation, 1988, pp. 308–317.
A. Aiken and A. Nicolau, Fine-Grain Parallelization and the Wavefront Method, MIT Press, 1990.
J. R. Allen and K. Kennedy, “Automatic Loop Interchange,” in ACMSIGPLAN symposium on Compiler construction, 1984, pp. 233–246.
G. I. C. Amy, W. Lim, and M. S. Lam, “An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication,” in International Conference on Supercomputing, 1999, pp. 228–237.
J. M. Anderson and M. S. Lam, “Global optimizations for parallelism and locality on scalable parallel machines,” in ACM SIGPLAN Conference on Programming Language Design and Implementations, June, 1993, pp. 112–125.
U. Banerjee, Unimodular Transformations of Double Loops, MIT Press, 1991.
K. Iwano and S. Yeh, “An efficient algorithm for optimal loop parallelization,” in Proc. of the First International Symposium of Algorithms, Dec., 1990, pp. 201–210.
R. M. Karp, “A Characterization of the Minimum Cycle Mean in a Digraph,” Discrete Math., vol. 23, 1978, pp. 309–311.
MATH MathSciNet Google Scholar
L. Lamport, “The Parallel Execution of do Loops,” Commun. ACM SIGPLAN, vol. 17, FEB. 1991, pp. 82–93.
Google Scholar
C. E. Leiserson and J. B. Saxe, “Retiming Synchronous Circuitry,” Algorithmica, vol. 6, 1991, pp. 5–35.
Article MATH MathSciNet Google Scholar
A. W. Lim and M. S. Lam, “Maximizing Parallelism and Minimizing Synchronization with Affine Transforms,” in ACM SIGPLAN Symposium on Principles of Programming Languages, Jan., 1997, pp. 201–214.
K. K. Parhi and D. G. Messerschmitt, “Static Rate-optimal Scheduling of Iterative Data-flow Programs via Optimum Unfolding,” IEEE Trans. Comput., vol. 40, 1991, pp. 178–195.
Article Google Scholar
N. L. Passos and E. H.-M. Sha, “Full Parallelism in Uniform Nested Loops using Multidimensional Retiming,” in International Conference on Parallel Processing, Aug., 1994, pp. 130–133.
M. Wolfe, “Loop Skewing: The Wavefront Method Revisited,” Int. J. Parallel Program., vol. 15, no. 4, 1986, pp. 284–294.
Article Google Scholar
M. E. Wolf and M. S. Lam, “A Data Locality Optimizing Algorithm,” in ACM SIGPLAN conference on Programming Language Design and Implementation, June, vol., 2, 1991, pp. 30–44.
Google Scholar
M. E. Wolf and M. S. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel Distrib. Syst., vol. 2, 1991, pp. 452–471.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Texas at Dallas, Richardson, TX, 75083, USA
Chun Xue & Edwin H.-M. Sha
Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Zili Shao

Authors

Chun Xue
View author publications
You can also search for this author in PubMed Google Scholar
Zili Shao
View author publications
You can also search for this author in PubMed Google Scholar
Edwin H.-M. Sha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun Xue.

Additional information

This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001 and NSF CCR-0309461, USA.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xue, C., Shao, Z. & Sha, E.HM. Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping. J VLSI Sign Process Syst Sign Image Video Technol 47, 153–167 (2007). https://doi.org/10.1007/s11265-006-0034-5

Download citation

Received: 13 June 2006
Accepted: 12 December 2006
Published: 17 February 2007
Issue Date: May 2007
DOI: https://doi.org/10.1007/s11265-006-0034-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

Abstract

Access this article

Similar content being viewed by others

A Survey on Pipelined FFT Hardware Architectures

Efficient High-Level Programming in Plain Java

GPU Architecture

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Maximize Parallelism Minimize Overhead for Nested Loops via Loop Striping

Abstract

Access this article

Similar content being viewed by others

A Survey on Pipelined FFT Hardware Architectures

Efficient High-Level Programming in Plain Java

GPU Architecture

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation