Article

Complementing software pipelining with software thread integration

Authors:
Won So

North Carolina State University, Raleigh, NC

North Carolina State University, Raleigh, NC
View Profile

,
Alexander G. Dean

North Carolina State University, Raleigh, NC

North Carolina State University, Raleigh, NC
View Profile

LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsJune 2005Pages 137–146https://doi.org/10.1145/1065910.1065930

Published:15 June 2005Publication History

LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

Pages 137–146

ABSTRACT

Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the performance of looping code in cases where software pipelining performs poorly or fails. This paper examines both situations, presenting methods to determine what and when to integrate.We evaluate our methods on C-language image and digital signal processing libraries and synthetic loop kernels. We compile them for a very long instruction word (VLIW) digital signal processor (DSP) -- the Texas Instruments (TI) C64x architecture. Loops which benefit little from software pipelining (SWP-Poor) speed up by 26% (harmonic mean, HM). Loops for which software pipelining fails (SWP-Fail) due to conditionals and calls speed up by 16% (HM). Combining SWP-Good and SWP-Poor loops leads to a speedup of 55% (HM).

References

A. Aiken and A. Nicolau. Perfect pipelining: A new loop parallelization technique. In Proceedings of the 2nd European Symposium on Programming (ESOP '88), pages 221--235. Springer-Verlag, 1988. Google ScholarDigital Library
J. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pages 177--189, 1983. Google ScholarDigital Library
G. Berry and G. Gonthier. The Esterel synchronous programming language: Design, semantics, implementation. Science of Computer Programming, 19(2):87--152, 1992. Google ScholarDigital Library
D. Callahan, J. Cocke, and K. Kennedy. Estimating interlock and improving balance for pipelined architectures. Journal of Parallel and Distributed Computing, 5(4):334--358, 1988. Google ScholarDigital Library
S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of 29th Hawaii International Conference on System Sciences, Jan. 1996. Google ScholarDigital Library
S. Carr and Y. Guan. Unroll-and-jam using uniformly generated sets. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 349--357. IEEE Computer Society, 1997. Google ScholarDigital Library
K. D. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2):105--117, 1993.Google ScholarDigital Library
A. G. Dean. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02), page 103. IEEE Computer Society, 2002. Google ScholarDigital Library
A. G. Dean and J. P. Shen. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium, pages 322--333, 1998. Google ScholarDigital Library
A. G. Dean and J. P. Shen. System-level issues for software thread integration: guest triggering and host selection. In Proceedings the 20th IEEE Real-Time Systems Symposium, pages 234--245, 1999. Google ScholarDigital Library
J. Dean, C. Chambers, and D. Grove. Selective specialization for object-oriented languages. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation (PLDI '95), pages 93--102, New York, NY, USA, 1995. ACM Press. Google ScholarDigital Library
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3):319--349, July 1987. Google ScholarDigital Library
T. Gautier, P. L. Guernic, and L. Besnard. Signal: A declarative language for synchronous programming of real-time systems. In Proceedings of a conference on Functional programming languages and computer architecture, pages 257--277. Springer-Verlag, 1987. Google ScholarDigital Library
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Proceedings of the 10th international conference on Architectural Support for Programming Languages and Operating Systems, pages 291--303. ACM Press, 2002. Google ScholarDigital Library
E. Granston, R. Scales, E. Stotzer, A. Ward, and J. Zbiciak. Controlling code size of software-pipelined loops on the TMS320C6000 VLIW DSP architecture. In Proceedings of the 3rd Workshop on Media and Stream Processors, Dec. 2001.Google Scholar
N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data-flow programming language LUSTRE. Proceedings of the IEEE, 79(9):1305--1320, September 1991.Google ScholarCross Ref
M. W. Hall, J. M. Mellor-Crummey, A. Carle, and R. Rodriguez. FIAT: A framework for interprocedural analysis and transfomation. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 522--545. Springer-Verlag, 1994. Google ScholarDigital Library
M. W. Hall, B. R. Murphy, S. P. Amarasinghe, S. Liao, and M. S. Lam. Interprocedural analysis for parallelization. In Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing (LCPC '95), pages 61--80. Springer-Verlag, 1996. Google ScholarDigital Library
R. A. Huff. Lifetime-sensitive modulo scheduling. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI '93), pages 258--267. ACM Press, 1993. Google ScholarDigital Library
B. Khailany, W. Dally, U. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, A. Chang, and S. Rixner. Imagine: media processing with streams. IEEE Micro, 21(2):35--46, 2001. Google ScholarDigital Library
M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation (PLDI '88), pages 318--328. ACM Press, 1988. Google ScholarDigital Library
D. M. Lavery and W. W. Hwu. Modulo scheduling of loops in control-intensive non-numeric programs. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29), pages 126--137. IEEE Computer Society, 1996. Google ScholarDigital Library
M. Narayanan and K. A. Yelick. Generating permutation instructions from a high-level description. In Proceedings of the 6th Workshop on Media and Streaming Processors, 2004.Google Scholar
A. Nene, S. Talla, B. Goldberg, and R. Rabbah. Trimaran - an infrastructure for compiler research in instruction-level parallelism - user manual. New York University, 1998.Google Scholar
S. Pillai and M. F. Jacome. Compiler-directed ILP extraction for clustered VLIW/EPIC machines: Predication, speculation and modulo scheduling. In Proceedings of the conference on Design, Automation and Test in Europe (DATE '03), page 10422, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
Y. Qian, S. Carr, and P. Sweany. Loop fusion for clustered VLIW architectures. In Proceedings of the joint conference on Languages, compilers and tools for embedded systems (LCTES/SCOPES '02), pages 112--119. ACM Press, 2002. Google ScholarDigital Library
Y. Qian, S. Carr, and P. H. Sweany. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 271--280. IEEE Computer Society, 2002. Google ScholarDigital Library
W. So and A. G. Dean. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of Seventh Workshop on Interaction between Compilers and Computer Architecture (INTERACT-7), pages 27--36. IEEE Computer Society, Feb. 2003. Google ScholarDigital Library
R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491--541, 1997.Google ScholarCross Ref
M. G. Stoodley and C. G. Lee. Software pipelining loops with conditional branches. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29), pages 262--273. IEEE Computer Society, 1996. Google ScholarDigital Library
E. Stotzer and E. Leiss. Modulo scheduling for the TMS320C6x VLIW DSP architecture. In Proceedings of the ACM SIGPLAN 1999 Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES '99), pages 28--34. ACM Press, 1999. Google ScholarDigital Library
B. Su, S. Ding, J. Wang, and J. Xia. GURPR -- a method for global software pipelining. In Proceedings of the 20th annual workshop on Microprogramming (MICRO 20), pages 88--96. ACM Press, 1987. Google ScholarDigital Library
M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, 2002. Google ScholarDigital Library
Texas Instruments. Code Composer Studio User's Guide (Rev. B), Mar. 2000.Google Scholar
Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide, Sept. 2000.Google Scholar
Texas Instruments. TMS320C64x Technical Overview, Jan. 2001.Google Scholar
Texas Instruments. TMS320C64x DSP Library Programmer's Reference, Apr. 2002.Google Scholar
Texas Instruments. TMS320C64x Image/Video Processing Library Programmer's Reference, Apr. 2002.Google Scholar
Texas Instruments. TMS320C6000 DSP Peripherals Overview Reference Guide (Rev. G), Sept. 2004.Google Scholar
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, Grenoble, France, Apr. 2002. Google ScholarDigital Library
N. J. Warter, J. W. Bockhaus, G. E. Haab, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, 1992. ACM and IEEE. Google ScholarDigital Library
N. J. Warter, S. A. Mahlke, W.-M. W. Hwu, and B. R. Rau. Reverse if-conversion. In Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation (PLDI '93), pages 290--299, New York, NY, USA, 1993. ACM Press. Google ScholarDigital Library
N. J. Warter-Perez and N. Partamian. Modulo scheduling with multiple initiation intervals. In Proceedings of the 28th annual international symposium on Microarchitecture (MICRO 28), pages 111--119, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press. Google ScholarDigital Library

Index Terms

Complementing software pipelining with software thread integration
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Complementing software pipelining with software thread integration
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the ...
Read More
Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP
CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems

When integrating software threads together to boost performance on a processor with instruction-level parallel processing support, it is rarely clear which code regions should be aligned and integrated, and which regions should be left alone. This ...
Read More
Software thread integration for instruction-level parallelism

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word)...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
June 2005
248 pages
ISBN:1595930183
DOI:10.1145/1065910
General Chair:
Yunheung Paek
Seoul National University, Seoul, Korea
,
Program Chair:
Rajiv Gupta
University of Arizona, Tucson, USA
ACM SIGPLAN Notices Volume 40, Issue 7
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
July 2005
238 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1070891
Issue’s Table of Contents
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DSP
TI C6000
VLIW
coarse-grain parallelism
software pipelining
software thread integration
stream programming
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate116of438submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 474
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Complementing software pipelining with software thread integration

LCTES '05: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Complementing software pipelining with software thread integration

Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP

Software thread integration for instruction-level parallelism