ABSTRACT
Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the performance of looping code in cases where software pipelining performs poorly or fails. This paper examines both situations, presenting methods to determine what and when to integrate.We evaluate our methods on C-language image and digital signal processing libraries and synthetic loop kernels. We compile them for a very long instruction word (VLIW) digital signal processor (DSP) -- the Texas Instruments (TI) C64x architecture. Loops which benefit little from software pipelining (SWP-Poor) speed up by 26% (harmonic mean, HM). Loops for which software pipelining fails (SWP-Fail) due to conditionals and calls speed up by 16% (HM). Combining SWP-Good and SWP-Poor loops leads to a speedup of 55% (HM).
- A. Aiken and A. Nicolau. Perfect pipelining: A new loop parallelization technique. In Proceedings of the 2nd European Symposium on Programming (ESOP '88), pages 221--235. Springer-Verlag, 1988. Google ScholarDigital Library
- J. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pages 177--189, 1983. Google ScholarDigital Library
- G. Berry and G. Gonthier. The Esterel synchronous programming language: Design, semantics, implementation. Science of Computer Programming, 19(2):87--152, 1992. Google ScholarDigital Library
- D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined architectures. Journal of Parallel and Distributed Computing, 5(4):334--358, 1988. Google ScholarDigital Library
- S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of 29th Hawaii International Conference on System Sciences, Jan. 1996. Google ScholarDigital Library
- S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 349--357. IEEE Computer Society, 1997. Google ScholarDigital Library
- K. D. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2):105--117, 1993.Google ScholarDigital Library
- A. G. Dean. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02), page 103. IEEE Computer Society, 2002. Google ScholarDigital Library
- A. G. Dean and J. P. Shen. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium, pages 322--333, 1998. Google ScholarDigital Library
- A. G. Dean and J. P. Shen. System-level issues for software thread integration: guest triggering and host selection. In Proceedings the 20th IEEE Real-Time Systems Symposium, pages 234--245, 1999. Google ScholarDigital Library
- J. Dean, C. Chambers, and D. Grove. Selective specialization for object-oriented languages. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation (PLDI '95), pages 93--102, New York, NY, USA, 1995. ACM Press. Google ScholarDigital Library
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3):319--349, July 1987. Google ScholarDigital Library
- T. Gautier, P. L. Guernic, and L. Besnard. Signal: A declarative language for synchronous programming of real-time systems. In Proceedings of a conference on Functional programming languages and computer architecture, pages 257--277. Springer-Verlag, 1987. Google ScholarDigital Library
- M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Proceedings of the 10th international conference on Architectural Support for Programming Languages and Operating Systems, pages 291--303. ACM Press, 2002. Google ScholarDigital Library
- E. Granston, R. Scales, E. Stotzer, A. Ward, and J. Zbiciak. Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture. In Proceedings of the 3rd Workshop on Media and Stream Processors, Dec. 2001.Google Scholar
- N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data-flow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305--1320, September 1991.Google ScholarCross Ref
- M. W. Hall, J. M. Mellor-Crummey, A. Carle, and R. Rodriguez. FIAT: A framework for interprocedural analysis and transfomation. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 522--545. Springer-Verlag, 1994. Google ScholarDigital Library
- M. W. Hall, B. R. Murphy, S. P. Amarasinghe, S. Liao, and M. S. Lam. Interprocedural analysis for parallelization. In Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing (LCPC '95), pages 61--80. Springer-Verlag, 1996. Google ScholarDigital Library
- R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI '93), pages 258--267. ACM Press, 1993. Google ScholarDigital Library
- B. Khailany, W. Dally, U. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, A. Chang, and S. Rixner. Imagine: media processing with streams. IEEE Micro, 21(2):35--46, 2001. Google ScholarDigital Library
- M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation (PLDI '88), pages 318--328. ACM Press, 1988. Google ScholarDigital Library
- D. M. Lavery and W. W. Hwu. Modulo scheduling of loops in control-intensive non-numeric programs. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29), pages 126--137. IEEE Computer Society, 1996. Google ScholarDigital Library
- M. Narayanan and K. A. Yelick. Generating permutation instructions from a high-level description. In Proceedings of the 6th Workshop on Media and Streaming Processors, 2004.Google Scholar
- A. Nene, S. Talla, B. Goldberg, and R. Rabbah. Trimaran - an infrastructure for compiler research in instruction-level parallelism - user manual. New York University, 1998.Google Scholar
- S. Pillai and M. F. Jacome. Compiler-directed ILP extraction for clustered VLIW/EPIC machines: Predication, speculation and modulo scheduling. In Proceedings of the conference on Design, Automation and Test in Europe (DATE '03), page 10422, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- Y. Qian, S. Carr, and P. Sweany. Loop fusion for clustered VLIW architectures. In Proceedings of the joint conference on Languages, compilers and tools for embedded systems (LCTES/SCOPES '02), pages 112--119. ACM Press, 2002. Google ScholarDigital Library
- Y. Qian, S. Carr, and P. H. Sweany. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 271--280. IEEE Computer Society, 2002. Google ScholarDigital Library
- W. So and A. G. Dean. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of Seventh Workshop on Interaction between Compilers and Computer Architecture (INTERACT-7), pages 27--36. IEEE Computer Society, Feb. 2003. Google ScholarDigital Library
- R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491--541, 1997.Google ScholarCross Ref
- M. G. Stoodley and C. G. Lee. Software pipelining loops with conditional branches. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29), pages 262--273. IEEE Computer Society, 1996. Google ScholarDigital Library
- E. Stotzer and E. Leiss. Modulo scheduling for the TMS320C6x VLIW DSP architecture. In Proceedings of the ACM SIGPLAN 1999 Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES '99), pages 28--34. ACM Press, 1999. Google ScholarDigital Library
- B. Su, S. Ding, J. Wang, and J. Xia. GURPR -- a method for global software pipelining. In Proceedings of the 20th annual workshop on Microprogramming (MICRO 20), pages 88--96. ACM Press, 1987. Google ScholarDigital Library
- M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, 2002. Google ScholarDigital Library
- Texas Instruments. Code Composer Studio User's Guide (Rev. B), Mar. 2000.Google Scholar
- Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide, Sept. 2000.Google Scholar
- Texas Instruments. TMS320C64x Technical Overview, Jan. 2001.Google Scholar
- Texas Instruments. TMS320C64x DSP Library Programmer's Reference, Apr. 2002.Google Scholar
- Texas Instruments. TMS320C64x Image/Video Processing Library Programmer's Reference, Apr. 2002.Google Scholar
- Texas Instruments. TMS320C6000 DSP Peripherals Overview Reference Guide (Rev. G), Sept. 2004.Google Scholar
- W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, Grenoble, France, Apr. 2002. Google ScholarDigital Library
- N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, 1992. ACM and IEEE. Google ScholarDigital Library
- N. J. Warter, S. A. Mahlke, W.-M. W. Hwu, and B. R. Rau. Reverse if-conversion. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI '93), pages 290--299, New York, NY, USA, 1993. ACM Press. Google ScholarDigital Library
- N. J. Warter-Perez and N. Partamian. Modulo scheduling with multiple initiation intervals. In Proceedings of the 28th annual international symposium on Microarchitecture (MICRO 28), pages 111--119, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press. Google ScholarDigital Library
Index Terms
- Complementing software pipelining with software thread integration
Recommendations
Complementing software pipelining with software thread integration
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsSoftware pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the ...
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP
CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systemsWhen integrating software threads together to boost performance on a processor with instruction-level parallel processing support, it is rarely clear which code regions should be aligned and integrated, and which regions should be left alone. This ...
Software thread integration for instruction-level parallelism
Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word)...
Comments