Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Wang, Li; Xue, Jingling; Yang, Xuejun

doi:10.1007/s11227-010-0522-z

Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Published: 11 December 2010

Volume 59, pages 1229–1251, (2012)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Li Wang¹,
Jingling Xue² &
Xuejun Yang¹

110 Accesses
Explore all metrics

Abstract

Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

AMD (2006) AMD FireStream Stream Processor. http://atiamdcom/products/streamprocessor/specshtml
Banakar R, Steinke S, Lee BS, Balakrishnan M, Marwedel P (2002) Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: CODES ’02: proceedings of the tenth international symposium on hardware/software codesign. ACM Press, New York, pp 73–78
Chapter Google Scholar
Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Pakin S, Sancho JC (2008) Entering the petaflop era: the architecture and performance of roadrunner. In: SC ’08: proceedings of the 2008 ACM/IEEE conference on supercomputing, pp 1–11
Google Scholar
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for gpus: stream computing on graphics hardware. ACM Trans Graph 23(3):777–786
Article Google Scholar
Cuvillo J, Zhu W, Ziang H, Gao G (2005) FAST: a functionally accurate simulation toolset for the Cyclops64 cellular architecture. In: MoBS ’05: workshop on modeling, benchmarking, and simulation. ACM Press, New York, pp 11–20
Google Scholar
Dally WJ, Labonte F, Das A, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: SC ’03: proceedings of the 2003 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, pp 35–42
Chapter Google Scholar
Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT ’06: proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33–42
Chapter Google Scholar
Dimitroulakos G, Kostaras N, Galanis MD, Goutis CE (2009) Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays. J Supercomput 48:115–151
Article Google Scholar
Eichenberger AE, Davidson ES, Abraham SG (1995) Optimum modulo schedules for minimum register requirements. In: ICS ’95: proceedings of the 9th international conference on supercomputing. ACM Press, New York, pp 31–40
Chapter Google Scholar
Gummaraju J, Rosenblum M (2005) Stream programming on general-purpose processors. In: MICRO 38: proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture. IEEE Computer Society, Washington, pp 343–354
Google Scholar
Gummaraju J, Coburn J, Turner Y, Rosenblum M (2008) Streamware: programming general-purpose multicore processors using streams. In: ASPLOS XIII: proceedings of the 13th international conference on architectural support for programming languages and operating systems. ACM Press, New York, pp 297–307
Chapter Google Scholar
Kudlur M, Mahlke S (2008) Orchestrating the execution of stream programs on multicore platforms. In: PLDI ’08: proceedings of the 2008 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 114–124
Chapter Google Scholar
Labonte F, Mattson P, Thies W, Buck I, Kozyrakis C, Horowitz M (2004) The stream virtual machine. In: PACT ’04: proceedings of the 13th international conference on parallel architectures and compilation techniques, pp 267–277
Chapter Google Scholar
Lavery DM, Hwu WMW (1995) Unrolling-based optimizations for modulo scheduling. In: MICRO-28: proceedings of the 28th annual international symposium on microarchitecture, pp 327–337
Chapter Google Scholar
Leverich J, Arakida H, Solomatnikov A, Firoozshahian A, Horowitz M, Kozyrakis C (2007) Comparing memory systems for chip multiprocessors. In: ISCA ’07: proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 358–368
Chapter Google Scholar
Li H, Zhang C, Li L, Ren J (2008) Transform coding on programmable stream processors. J Supercomput 45:66–87
Article Google Scholar
Llosa J (1996) Swing modulo scheduling: a lifetime-sensitive approach. In: PACT ’96: proceedings of the 1996 conference on parallel architectures and compilation techniques. IEEE Computer Society, Washington, pp 80–86
Chapter Google Scholar
Makino J, Hiraki K, Inaba M (2007) GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing. In: SC ’07: proceedings of the 2007 ACM/IEEE conference on supercomputing. ACM Press, New York, pp 1–11
Chapter Google Scholar
NVIDIA (2009) CUDA Architecture Overview. http://developerdownloadnvidiacom/compute/cuda/docs/CUDA_Architecture_Overviewpdf
Owens JD, Kapasi UJ, Mattson P, Towles B, Serebrin B, Rixner S, Dally WJ (2002) Media processing applications on the imagine stream processor. In: ICCD ’02 proceedings of the 2002 IEEE international conference on computer design: VLSI in computers and processors, Freiburg, Germany. IEEE Computer Society, Washington, pp 295–302
Chapter Google Scholar
Rau BR (1994) Iterative modulo scheduling: an algorithm for software pipelining loops. In: MICRO-27: proceedings of the 27th annual international symposium on microarchitecture, pp 63–74
Chapter Google Scholar
Stotzer EJ, Leiss EL (2009) Modulo scheduling without overlapped lifetimes. In: LCTES ’09: proceedings of the 2009 ACM SIGPLAN/SIGBED conference on languages, compilers, and tools for embedded systems. ACM Press, New York, pp 1–10
Chapter Google Scholar
Thies W, Karczmarek M, Gordon M, Maze D, Wong J, Ho H, Brown M, Amarasinghe S (2001) StreamIt: a compiler for streaming applications. MIT-LCS Technical Memo TM-622
Wang L, Yang X, Xue J, Deng Y, Yan X, Tang T, Nguyen QH (2008) Optimizing scientific application loops on stream processors. In: LCTES ’08: proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems. ACM Press, New York, pp 161–170
Chapter Google Scholar
Wang L, Yang X, Xue J (2010) Reuse-aware modulo scheduling for stream processors. In: DATE ’10: proceedings of the conference on design, automation and test in Europe, pp 1112–1117
Google Scholar
Williams S, Shalf J, Oliker L, Kamil S, Husbands P, Yelick K (2006) The potential of the cell processor for scientific computing. In: CF ’06: proceedings of the 3rd conference on computing frontiers. ACM Press, New York, pp 9–20
Chapter Google Scholar
Wolf ME, Lam MS (1991) A data locality optimizing algorithm. In: PLDI ’91: proceedings of the 1991 conference on programming language design and implementation, pp 30–44
Chapter Google Scholar
Xue J, Huang CH (1997) Reuse-driven tiling for data locality. In: LCPC ’97: proceedings of the 10th workshop on languages and compilers for parallel computing. Springer, Berlin, pp 16–33
Google Scholar
Yang X, Yan X, Xing Z, Deng Y, Jiang J, Zhang Y (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA ’07: proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210–219
Chapter Google Scholar
Yang X, Du J, Yan X, Deng Y (2009) Matrix-based streamization approach for improving locality and parallelism on ft64 stream processor. J Supercomput 47:171–197
Article Google Scholar
Yang X, Zhang Y, Lu X, Xue J, Rogers I, Li G, Wang G, Fang X (2010) Exploiting the reuse supplied by loop-dependent stream references for stream processors. ACM Trans Archit Code Optim 7(11):1–35
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer, National University of Defense Technology, Changsha, 410073, China
Li Wang & Xuejun Yang
School of Computer Science and Engineering, UNSW, Sydney, NSW, 2052, Australia
Jingling Xue

Authors

Li Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jingling Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xuejun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Xue, J. & Yang, X. Optimizing modulo scheduling to achieve reuse and concurrency for stream processors. J Supercomput 59, 1229–1251 (2012). https://doi.org/10.1007/s11227-010-0522-z

Download citation

Published: 11 December 2010
Issue Date: March 2012
DOI: https://doi.org/10.1007/s11227-010-0522-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Abstract

Access this article

Similar content being viewed by others

A Static Greedy and Dynamic Adaptive Thread Spawning Approach for Loop-Level Parallelism

Scalable-Grain Pipeline Parallelization Method for Multi-core Systems

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Abstract

Access this article

Similar content being viewed by others

A Static Greedy and Dynamic Adaptive Thread Spawning Approach for Loop-Level Parallelism

Scalable-Grain Pipeline Parallelization Method for Multi-core Systems

A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation