Abstract
Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach.
Similar content being viewed by others
References
AMD (2006) AMD FireStream Stream Processor. http://atiamdcom/products/streamprocessor/specshtml
Banakar R, Steinke S, Lee BS, Balakrishnan M, Marwedel P (2002) Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: CODES ’02: proceedings of the tenth international symposium on hardware/software codesign. ACM Press, New York, pp 73–78
Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Pakin S, Sancho JC (2008) Entering the petaflop era: the architecture and performance of roadrunner. In: SC ’08: proceedings of the 2008 ACM/IEEE conference on supercomputing, pp 1–11
Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for gpus: stream computing on graphics hardware. ACM Trans Graph 23(3):777–786
Cuvillo J, Zhu W, Ziang H, Gao G (2005) FAST: a functionally accurate simulation toolset for the Cyclops64 cellular architecture. In: MoBS ’05: workshop on modeling, benchmarking, and simulation. ACM Press, New York, pp 11–20
Dally WJ, Labonte F, Das A, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: SC ’03: proceedings of the 2003 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, pp 35–42
Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT ’06: proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33–42
Dimitroulakos G, Kostaras N, Galanis MD, Goutis CE (2009) Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays. J Supercomput 48:115–151
Eichenberger AE, Davidson ES, Abraham SG (1995) Optimum modulo schedules for minimum register requirements. In: ICS ’95: proceedings of the 9th international conference on supercomputing. ACM Press, New York, pp 31–40
Gummaraju J, Rosenblum M (2005) Stream programming on general-purpose processors. In: MICRO 38: proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture. IEEE Computer Society, Washington, pp 343–354
Gummaraju J, Coburn J, Turner Y, Rosenblum M (2008) Streamware: programming general-purpose multicore processors using streams. In: ASPLOS XIII: proceedings of the 13th international conference on architectural support for programming languages and operating systems. ACM Press, New York, pp 297–307
Kudlur M, Mahlke S (2008) Orchestrating the execution of stream programs on multicore platforms. In: PLDI ’08: proceedings of the 2008 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 114–124
Labonte F, Mattson P, Thies W, Buck I, Kozyrakis C, Horowitz M (2004) The stream virtual machine. In: PACT ’04: proceedings of the 13th international conference on parallel architectures and compilation techniques, pp 267–277
Lavery DM, Hwu WMW (1995) Unrolling-based optimizations for modulo scheduling. In: MICRO-28: proceedings of the 28th annual international symposium on microarchitecture, pp 327–337
Leverich J, Arakida H, Solomatnikov A, Firoozshahian A, Horowitz M, Kozyrakis C (2007) Comparing memory systems for chip multiprocessors. In: ISCA ’07: proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 358–368
Li H, Zhang C, Li L, Ren J (2008) Transform coding on programmable stream processors. J Supercomput 45:66–87
Llosa J (1996) Swing modulo scheduling: a lifetime-sensitive approach. In: PACT ’96: proceedings of the 1996 conference on parallel architectures and compilation techniques. IEEE Computer Society, Washington, pp 80–86
Makino J, Hiraki K, Inaba M (2007) GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing. In: SC ’07: proceedings of the 2007 ACM/IEEE conference on supercomputing. ACM Press, New York, pp 1–11
NVIDIA (2009) CUDA Architecture Overview. http://developerdownloadnvidiacom/compute/cuda/docs/CUDA_Architecture_Overviewpdf
Owens JD, Kapasi UJ, Mattson P, Towles B, Serebrin B, Rixner S, Dally WJ (2002) Media processing applications on the imagine stream processor. In: ICCD ’02 proceedings of the 2002 IEEE international conference on computer design: VLSI in computers and processors, Freiburg, Germany. IEEE Computer Society, Washington, pp 295–302
Rau BR (1994) Iterative modulo scheduling: an algorithm for software pipelining loops. In: MICRO-27: proceedings of the 27th annual international symposium on microarchitecture, pp 63–74
Stotzer EJ, Leiss EL (2009) Modulo scheduling without overlapped lifetimes. In: LCTES ’09: proceedings of the 2009 ACM SIGPLAN/SIGBED conference on languages, compilers, and tools for embedded systems. ACM Press, New York, pp 1–10
Thies W, Karczmarek M, Gordon M, Maze D, Wong J, Ho H, Brown M, Amarasinghe S (2001) StreamIt: a compiler for streaming applications. MIT-LCS Technical Memo TM-622
Wang L, Yang X, Xue J, Deng Y, Yan X, Tang T, Nguyen QH (2008) Optimizing scientific application loops on stream processors. In: LCTES ’08: proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems. ACM Press, New York, pp 161–170
Wang L, Yang X, Xue J (2010) Reuse-aware modulo scheduling for stream processors. In: DATE ’10: proceedings of the conference on design, automation and test in Europe, pp 1112–1117
Williams S, Shalf J, Oliker L, Kamil S, Husbands P, Yelick K (2006) The potential of the cell processor for scientific computing. In: CF ’06: proceedings of the 3rd conference on computing frontiers. ACM Press, New York, pp 9–20
Wolf ME, Lam MS (1991) A data locality optimizing algorithm. In: PLDI ’91: proceedings of the 1991 conference on programming language design and implementation, pp 30–44
Xue J, Huang CH (1997) Reuse-driven tiling for data locality. In: LCPC ’97: proceedings of the 10th workshop on languages and compilers for parallel computing. Springer, Berlin, pp 16–33
Yang X, Yan X, Xing Z, Deng Y, Jiang J, Zhang Y (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA ’07: proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210–219
Yang X, Du J, Yan X, Deng Y (2009) Matrix-based streamization approach for improving locality and parallelism on ft64 stream processor. J Supercomput 47:171–197
Yang X, Zhang Y, Lu X, Xue J, Rogers I, Li G, Wang G, Fang X (2010) Exploiting the reuse supplied by loop-dependent stream references for stream processors. ACM Trans Archit Code Optim 7(11):1–35
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, L., Xue, J. & Yang, X. Optimizing modulo scheduling to achieve reuse and concurrency for stream processors. J Supercomput 59, 1229–1251 (2012). https://doi.org/10.1007/s11227-010-0522-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-010-0522-z