skip to main content
10.1145/1065910.1065930acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
Article

Complementing software pipelining with software thread integration

Published:15 June 2005Publication History

ABSTRACT

Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the performance of looping code in cases where software pipelining performs poorly or fails. This paper examines both situations, presenting methods to determine what and when to integrate.We evaluate our methods on C-language image and digital signal processing libraries and synthetic loop kernels. We compile them for a very long instruction word (VLIW) digital signal processor (DSP) -- the Texas Instruments (TI) C64x architecture. Loops which benefit little from software pipelining (SWP-Poor) speed up by 26% (harmonic mean, HM). Loops for which software pipelining fails (SWP-Fail) due to conditionals and calls speed up by 16% (HM). Combining SWP-Good and SWP-Poor loops leads to a speedup of 55% (HM).

References

  1. A. Aiken and A. Nicolau. Perfect pipelining: A new loop parallelization technique. In Proceedings of the 2nd European Symposium on Programming (ESOP '88), pages 221--235. Springer-Verlag, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pages 177--189, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Berry and G. Gonthier. The Esterel synchronous programming language: Design, semantics, implementation. Science of Computer Programming, 19(2):87--152, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined architectures. Journal of Parallel and Distributed Computing, 5(4):334--358, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of 29th Hawaii International Conference on System Sciences, Jan. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 349--357. IEEE Computer Society, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. D. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2):105--117, 1993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. G. Dean. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02), page 103. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. G. Dean and J. P. Shen. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium, pages 322--333, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. G. Dean and J. P. Shen. System-level issues for software thread integration: guest triggering and host selection. In Proceedings the 20th IEEE Real-Time Systems Symposium, pages 234--245, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean, C. Chambers, and D. Grove. Selective specialization for object-oriented languages. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation (PLDI '95), pages 93--102, New York, NY, USA, 1995. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3):319--349, July 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Gautier, P. L. Guernic, and L. Besnard. Signal: A declarative language for synchronous programming of real-time systems. In Proceedings of a conference on Functional programming languages and computer architecture, pages 257--277. Springer-Verlag, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Proceedings of the 10th international conference on Architectural Support for Programming Languages and Operating Systems, pages 291--303. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Granston, R. Scales, E. Stotzer, A. Ward, and J. Zbiciak. Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture. In Proceedings of the 3rd Workshop on Media and Stream Processors, Dec. 2001.Google ScholarGoogle Scholar
  16. N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data-flow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305--1320, September 1991.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. W. Hall, J. M. Mellor-Crummey, A. Carle, and R. Rodriguez. FIAT: A framework for interprocedural analysis and transfomation. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 522--545. Springer-Verlag, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. W. Hall, B. R. Murphy, S. P. Amarasinghe, S. Liao, and M. S. Lam. Interprocedural analysis for parallelization. In Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing (LCPC '95), pages 61--80. Springer-Verlag, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI '93), pages 258--267. ACM Press, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Khailany, W. Dally, U. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, A. Chang, and S. Rixner. Imagine: media processing with streams. IEEE Micro, 21(2):35--46, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation (PLDI '88), pages 318--328. ACM Press, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. M. Lavery and W. W. Hwu. Modulo scheduling of loops in control-intensive non-numeric programs. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29), pages 126--137. IEEE Computer Society, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Narayanan and K. A. Yelick. Generating permutation instructions from a high-level description. In Proceedings of the 6th Workshop on Media and Streaming Processors, 2004.Google ScholarGoogle Scholar
  24. A. Nene, S. Talla, B. Goldberg, and R. Rabbah. Trimaran - an infrastructure for compiler research in instruction-level parallelism - user manual. New York University, 1998.Google ScholarGoogle Scholar
  25. S. Pillai and M. F. Jacome. Compiler-directed ILP extraction for clustered VLIW/EPIC machines: Predication, speculation and modulo scheduling. In Proceedings of the conference on Design, Automation and Test in Europe (DATE '03), page 10422, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Qian, S. Carr, and P. Sweany. Loop fusion for clustered VLIW architectures. In Proceedings of the joint conference on Languages, compilers and tools for embedded systems (LCTES/SCOPES '02), pages 112--119. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Qian, S. Carr, and P. H. Sweany. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 271--280. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. W. So and A. G. Dean. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of Seventh Workshop on Interaction between Compilers and Computer Architecture (INTERACT-7), pages 27--36. IEEE Computer Society, Feb. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491--541, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. G. Stoodley and C. G. Lee. Software pipelining loops with conditional branches. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29), pages 262--273. IEEE Computer Society, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Stotzer and E. Leiss. Modulo scheduling for the TMS320C6x VLIW DSP architecture. In Proceedings of the ACM SIGPLAN 1999 Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES '99), pages 28--34. ACM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Su, S. Ding, J. Wang, and J. Xia. GURPR -- a method for global software pipelining. In Proceedings of the 20th annual workshop on Microprogramming (MICRO 20), pages 88--96. ACM Press, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Texas Instruments. Code Composer Studio User's Guide (Rev. B), Mar. 2000.Google ScholarGoogle Scholar
  35. Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide, Sept. 2000.Google ScholarGoogle Scholar
  36. Texas Instruments. TMS320C64x Technical Overview, Jan. 2001.Google ScholarGoogle Scholar
  37. Texas Instruments. TMS320C64x DSP Library Programmer's Reference, Apr. 2002.Google ScholarGoogle Scholar
  38. Texas Instruments. TMS320C64x Image/Video Processing Library Programmer's Reference, Apr. 2002.Google ScholarGoogle Scholar
  39. Texas Instruments. TMS320C6000 DSP Peripherals Overview Reference Guide (Rev. G), Sept. 2004.Google ScholarGoogle Scholar
  40. W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, Grenoble, France, Apr. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, 1992. ACM and IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. N. J. Warter, S. A. Mahlke, W.-M. W. Hwu, and B. R. Rau. Reverse if-conversion. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI '93), pages 290--299, New York, NY, USA, 1993. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. N. J. Warter-Perez and N. Partamian. Modulo scheduling with multiple initiation intervals. In Proceedings of the 28th annual international symposium on Microarchitecture (MICRO 28), pages 111--119, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Complementing software pipelining with software thread integration

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
      June 2005
      248 pages
      ISBN:1595930183
      DOI:10.1145/1065910
      • General Chair:
      • Yunheung Paek,
      • Program Chair:
      • Rajiv Gupta
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 40, Issue 7
        Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
        July 2005
        238 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1070891
        Issue’s Table of Contents

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 June 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate116of438submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader