Skip to main content
Log in

Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. AMD (2006) AMD FireStream Stream Processor. http://atiamdcom/products/streamprocessor/specshtml

  2. Banakar R, Steinke S, Lee BS, Balakrishnan M, Marwedel P (2002) Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: CODES ’02: proceedings of the tenth international symposium on hardware/software codesign. ACM Press, New York, pp 73–78

    Chapter  Google Scholar 

  3. Barker KJ, Davis K, Hoisie A, Kerbyson DJ, Lang M, Pakin S, Sancho JC (2008) Entering the petaflop era: the architecture and performance of roadrunner. In: SC ’08: proceedings of the 2008 ACM/IEEE conference on supercomputing, pp 1–11

    Google Scholar 

  4. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for gpus: stream computing on graphics hardware. ACM Trans Graph 23(3):777–786

    Article  Google Scholar 

  5. Cuvillo J, Zhu W, Ziang H, Gao G (2005) FAST: a functionally accurate simulation toolset for the Cyclops64 cellular architecture. In: MoBS ’05: workshop on modeling, benchmarking, and simulation. ACM Press, New York, pp 11–20

    Google Scholar 

  6. Dally WJ, Labonte F, Das A, Hanrahan P et al (2003) Merrimac: supercomputing with streams. In: SC ’03: proceedings of the 2003 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, pp 35–42

    Chapter  Google Scholar 

  7. Das A, Dally WJ, Mattson P (2006) Compiling for stream processing. In: PACT ’06: proceedings of the 15th international conference on parallel architectures and compilation techniques. ACM Press, New York, pp 33–42

    Chapter  Google Scholar 

  8. Dimitroulakos G, Kostaras N, Galanis MD, Goutis CE (2009) Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays. J Supercomput 48:115–151

    Article  Google Scholar 

  9. Eichenberger AE, Davidson ES, Abraham SG (1995) Optimum modulo schedules for minimum register requirements. In: ICS ’95: proceedings of the 9th international conference on supercomputing. ACM Press, New York, pp 31–40

    Chapter  Google Scholar 

  10. Gummaraju J, Rosenblum M (2005) Stream programming on general-purpose processors. In: MICRO 38: proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture. IEEE Computer Society, Washington, pp 343–354

    Google Scholar 

  11. Gummaraju J, Coburn J, Turner Y, Rosenblum M (2008) Streamware: programming general-purpose multicore processors using streams. In: ASPLOS XIII: proceedings of the 13th international conference on architectural support for programming languages and operating systems. ACM Press, New York, pp 297–307

    Chapter  Google Scholar 

  12. Kudlur M, Mahlke S (2008) Orchestrating the execution of stream programs on multicore platforms. In: PLDI ’08: proceedings of the 2008 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 114–124

    Chapter  Google Scholar 

  13. Labonte F, Mattson P, Thies W, Buck I, Kozyrakis C, Horowitz M (2004) The stream virtual machine. In: PACT ’04: proceedings of the 13th international conference on parallel architectures and compilation techniques, pp 267–277

    Chapter  Google Scholar 

  14. Lavery DM, Hwu WMW (1995) Unrolling-based optimizations for modulo scheduling. In: MICRO-28: proceedings of the 28th annual international symposium on microarchitecture, pp 327–337

    Chapter  Google Scholar 

  15. Leverich J, Arakida H, Solomatnikov A, Firoozshahian A, Horowitz M, Kozyrakis C (2007) Comparing memory systems for chip multiprocessors. In: ISCA ’07: proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 358–368

    Chapter  Google Scholar 

  16. Li H, Zhang C, Li L, Ren J (2008) Transform coding on programmable stream processors. J Supercomput 45:66–87

    Article  Google Scholar 

  17. Llosa J (1996) Swing modulo scheduling: a lifetime-sensitive approach. In: PACT ’96: proceedings of the 1996 conference on parallel architectures and compilation techniques. IEEE Computer Society, Washington, pp 80–86

    Chapter  Google Scholar 

  18. Makino J, Hiraki K, Inaba M (2007) GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing. In: SC ’07: proceedings of the 2007 ACM/IEEE conference on supercomputing. ACM Press, New York, pp 1–11

    Chapter  Google Scholar 

  19. NVIDIA (2009) CUDA Architecture Overview. http://developerdownloadnvidiacom/compute/cuda/docs/CUDA_Architecture_Overviewpdf

  20. Owens JD, Kapasi UJ, Mattson P, Towles B, Serebrin B, Rixner S, Dally WJ (2002) Media processing applications on the imagine stream processor. In: ICCD ’02 proceedings of the 2002 IEEE international conference on computer design: VLSI in computers and processors, Freiburg, Germany. IEEE Computer Society, Washington, pp 295–302

    Chapter  Google Scholar 

  21. Rau BR (1994) Iterative modulo scheduling: an algorithm for software pipelining loops. In: MICRO-27: proceedings of the 27th annual international symposium on microarchitecture, pp 63–74

    Chapter  Google Scholar 

  22. Stotzer EJ, Leiss EL (2009) Modulo scheduling without overlapped lifetimes. In: LCTES ’09: proceedings of the 2009 ACM SIGPLAN/SIGBED conference on languages, compilers, and tools for embedded systems. ACM Press, New York, pp 1–10

    Chapter  Google Scholar 

  23. Thies W, Karczmarek M, Gordon M, Maze D, Wong J, Ho H, Brown M, Amarasinghe S (2001) StreamIt: a compiler for streaming applications. MIT-LCS Technical Memo TM-622

  24. Wang L, Yang X, Xue J, Deng Y, Yan X, Tang T, Nguyen QH (2008) Optimizing scientific application loops on stream processors. In: LCTES ’08: proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages, compilers, and tools for embedded systems. ACM Press, New York, pp 161–170

    Chapter  Google Scholar 

  25. Wang L, Yang X, Xue J (2010) Reuse-aware modulo scheduling for stream processors. In: DATE ’10: proceedings of the conference on design, automation and test in Europe, pp 1112–1117

    Google Scholar 

  26. Williams S, Shalf J, Oliker L, Kamil S, Husbands P, Yelick K (2006) The potential of the cell processor for scientific computing. In: CF ’06: proceedings of the 3rd conference on computing frontiers. ACM Press, New York, pp 9–20

    Chapter  Google Scholar 

  27. Wolf ME, Lam MS (1991) A data locality optimizing algorithm. In: PLDI ’91: proceedings of the 1991 conference on programming language design and implementation, pp 30–44

    Chapter  Google Scholar 

  28. Xue J, Huang CH (1997) Reuse-driven tiling for data locality. In: LCPC ’97: proceedings of the 10th workshop on languages and compilers for parallel computing. Springer, Berlin, pp 16–33

    Google Scholar 

  29. Yang X, Yan X, Xing Z, Deng Y, Jiang J, Zhang Y (2007) A 64-bit stream processor architecture for scientific applications. In: ISCA ’07: proceedings of the 34th annual international symposium on computer architecture. ACM Press, New York, pp 210–219

    Chapter  Google Scholar 

  30. Yang X, Du J, Yan X, Deng Y (2009) Matrix-based streamization approach for improving locality and parallelism on ft64 stream processor. J Supercomput 47:171–197

    Article  Google Scholar 

  31. Yang X, Zhang Y, Lu X, Xue J, Rogers I, Li G, Wang G, Fang X (2010) Exploiting the reuse supplied by loop-dependent stream references for stream processors. ACM Trans Archit Code Optim 7(11):1–35

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, L., Xue, J. & Yang, X. Optimizing modulo scheduling to achieve reuse and concurrency for stream processors. J Supercomput 59, 1229–1251 (2012). https://doi.org/10.1007/s11227-010-0522-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-010-0522-z

Keywords

Navigation