research-article

Compiling for vector-thread architectures

Authors:

Krste AsanovicAuthors Info & Claims

CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

Pages 205 - 215

https://doi.org/10.1145/1356058.1356085

Published: 06 April 2008 Publication History

Abstract

Vector-thread (VT) architectures exploit multiple forms of parallelism simultaneously. This paper describes a compiler for the Scale VT architecture, which takes advantage of the VT features. We focus on compiling loops, and show how the compiler can transform code that poses difficulties for traditional vector or VLIW processors, such as loops with internal control flow or cross-iteration dependences, while still taking advantage of features not supported by multithreaded designs, such as vector memory instructions. We evaluate the compiler using several embedded benchmarks and show that we can obtain substantial speedups over a single-issue, in-order scalar machine.

References

[1]

EEMBC. http://www.eembc.org/.

[2]

GCC, the GNU Compiler Collection. http://gcc.gnu.org/.

[3]

Scale Home Page. http://www--ali.cs.umass.edu/scale/.

[4]

J. R. Allen et al. Conversion of control dependence to data dependence. In POPL--10, pages 177--189, January 1983.

Digital Library

[5]

R. Allen and K. Kennedy. Optimizing compilers for modern architectures: a dependence--based approach. Morgan Kaufmann Publishers, 2001.

Digital Library

[6]

K. Asanovic et al. Energy-exposed instruction sets. In Power Aware Computing, chapter 5. Kluwer Academic/Plenum Publishers, June 2002.

Digital Library

[7]

C. Batten et al. Cache refill/access decoupling for vector machines. In MICRO--37, pages 331--342, December 2004.

Digital Library

[8]

T. Bernard et al. A microthreaded architecture and its compiler. In Proceedings of the 12th International Workshop on Compilers for Parallel Computers, pages 326--340, January 2006.

[9]

T. c. Chiueh. Multi--threaded vectorization. In ISCA--18, pages 352--361, May 1991.

Digital Library

[10]

L. N. Chakrapani et al. Trimaran: an infrastructure for research in instruction--level parallelism. Lecture Notes in Computer Science, 3602:32--41, 2005.

Digital Library

[11]

M. Chu, K. Fan, and S. Mahlke. Region--based hierarchical operation partitioning for multicluster processors. In PLDI 2003, pages 300--311, June 2003.

Digital Library

[12]

K. Coons et al. A spatial path scheduling algorithm for EDGE architectures. In ASPLOS--12, pages 129--140, October 2006.

Digital Library

[13]

A. Das, W. J. Dally, and P. Mattson. Compiling for stream processing. In PACT--15, pages 33--42, September 2006.

Digital Library

[14]

A. E. Eichenberger et al. Optimizing compiler for the CELL processor. In PACT--14, pages 161--172, September 2005.

Digital Library

[15]

M. M. Islam et al. Limits on thread--level speculative parallelism in embedded applications. In INTERACT--11, pages 40--49, February 2007.

[16]

C. Jesshope. Implementing an efficient vector instruction set in a chip multi-processor using micro-threaded pipelines. In Proceedings of the 6th Australasian Conference on Computer Systems Architecture, pages 80--88, January 2001.

Digital Library

[17]

A. Kejariwal et al. Challenges in exploitation of loop parallelism in embedded applications. In Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis, pages 173--180, October 2006.

Digital Library

[18]

B. Khailany et al. Imagine: media processing with streams. IEEE Micro, 21(2):35--46, March/April 2001.

Digital Library

[19]

R. Krashinsky et al. The vector--thread architecture. In ISCA--31, pages 52--63, June 2004.

Digital Library

[20]

R. Krashinsky et al. The vector--thread architecture. IEEE Micro, 24(6):84--90, November 2004.

Digital Library

[21]

R. M. Krashinsky. Vector--thread architecture and implementation. PhD thesis, Massachusetts Institute of Technology, June 2007.

Digital Library

[22]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In PLDI 2000, pages 145--156, June 2000.

Digital Library

[23]

S. Larsen, R. Rabbah, and S. Amarasinghe. Exploiting vector parallelism in software pipelined loops. In MICRO--38, pages 119--129, November 2005.

Digital Library

[24]

D. B. Loveman. Program improvement by source-to-source transformation. Journal of the ACM, 24(1):121--145, January 1977.

Digital Library

[25]

R. Nagarajan et al. Static placement, dynamic issue (SPDI) scheduling for EDGE architectures. In PACT--13, pages 74--84, September-October 2004.

Digital Library

[26]

C. J. Newburn, A. S. Huang, and J. P. Shen. Balancing ne- and medium-grained parallelism in scheduling loops for the XIMD architecture. In Proceedings of the IFIP WG10.3 Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, pages 39--52, January 1993.

Digital Library

[27]

K. Sankaralingam et al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In ISCA--30, pages 422--433, June 2003.

Digital Library

[28]

J. Shin. Introducing control ow into vectorized code. In PACT--16, September 2007.

Digital Library

[29]

J. Shin, M. Hall, and J. Chame. Evaluating compiler technology for control-ow optimizations for multimedia extension architectures. In 6th Workshop on Media and Streaming Processors, December 2004.

[30]

J. Shin, M. Hall, and J. Chame. Superword--level parallelism in the presence of control ow. In CGO 2005, pages 165--175, March 2005.

Digital Library

[31]

A. Smith et al. Compiling for EDGE architectures. In CGO--4, pages 185--195, March 2006.

Digital Library

[32]

R. Tarjan. Depth first search and linear graph algorithms. SIAM Journal of Computing, 1(2):146--160, June 1972.

[33]

X. Tian et al. Exploiting thread-level and instruction-level parallelism for Hyper-Threading Technology. Intel Developer Update Magazine, January 2003.

[34]

R. P. Wilson et al. SUIF: an infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Notices, 29(12):31--37, December 1994.

Digital Library

[35]

A. Wolfe and J. P. Shen. A variable instruction stream extension to the VLIW architecture. In ASPLOS--4, pages 2--14, April 1991.

Digital Library

Cited By

Chen TJia HZhang YLi KLi ZZhao XYao JLi CGallivan KNikolopoulos DBeivide RGallopoulos E(2023)OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593735(398-409)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593735
Yuan LCao HZhang YLi KLu PYue Yde Supinski BHall MGamblin T(2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476149
Jinyang YRongcai ZQi WXiaohan T(2018)Loop-nest Auto-vectorization Method Based on Benefit AnalysisProceedings of the 2nd International Conference on Advances in Image Processing10.1145/3239576.3239620(240-244)Online publication date: 16-Jun-2018
https://dl.acm.org/doi/10.1145/3239576.3239620
Show More Cited By

Index Terms

Compiling for vector-thread architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Systolic arrays
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Fast, frequency-based, integrated register allocation and instruction scheduling

Instruction scheduling and register allocation are two of the most important optimization phases in modern compilers as they have a significant impact on the quality of the generated code. Unfortunately, the objectives of these two optimizations are in ...
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures

To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance ...
Compilation framework for code size reduction using reduced bit-width ISAs (rISAs)

For many embedded applications, program code size is a critical design factor. One promising approach for reducing code size is to employ a “dual instruction set”, where processor architectures support a normal (usually 32-bit) Instruction Set, and a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '08: Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

April 2008

235 pages

ISBN:9781595939784

DOI:10.1145/1356058

General Chair:
Mary Lou Soffa
University of Virginia, USA
,
Program Chair:
Evelyn Duesterwald
IBM Research, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CGO '08

Sponsor:

CGO '08: 6th Annual IEEE / ACM International Symposium on Code Generation and Optimization

April 5 - 9, 2008

MA, Boston, USA

Acceptance Rates

CGO '08 Paper Acceptance Rate 21 of 66 submissions, 32%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
541
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen TJia HZhang YLi KLi ZZhao XYao JLi CGallivan KNikolopoulos DBeivide RGallopoulos E(2023)OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593735(398-409)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593735
Yuan LCao HZhang YLi KLu PYue Yde Supinski BHall MGamblin T(2021)Temporal vectorization for stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476149(1-13)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476149
Jinyang YRongcai ZQi WXiaohan T(2018)Loop-nest Auto-vectorization Method Based on Benefit AnalysisProceedings of the 2nd International Conference on Advances in Image Processing10.1145/3239576.3239620(240-244)Online publication date: 16-Jun-2018
https://dl.acm.org/doi/10.1145/3239576.3239620
Baghsorkhi SVasudevan NWu Y(2016)FlexVec: auto-vectorization for irregular loopsACM SIGPLAN Notices10.1145/2980983.290811151:6(697-710)Online publication date: 2-Jun-2016
https://dl.acm.org/doi/10.1145/2980983.2908111
Baghsorkhi SVasudevan NWu YKrintz CBerger E(2016)FlexVec: auto-vectorization for irregular loopsProceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2908080.2908111(697-710)Online publication date: 2-Jun-2016
https://dl.acm.org/doi/10.1145/2908080.2908111
Jinlong Xu Huihui Sun Rongcai Zhao (2015)SIMD vectorization of nested loop based on strip mining2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)10.1109/SNPD.2015.7176176(1-7)Online publication date: Jun-2015
https://doi.org/10.1109/SNPD.2015.7176176
Lee YAvizienis RBishara AXia RLockhart DBatten CAsanović K(2013)Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel AcceleratorsACM Transactions on Computer Systems10.1145/249146431:3(1-38)Online publication date: 1-Aug-2013
https://dl.acm.org/doi/10.1145/2491464
Lee YAvizienis RBishara AXia RLockhart DBatten CAsanović K(2011)Exploring the tradeoffs between programmability and efficiency in data-parallel acceleratorsACM SIGARCH Computer Architecture News10.1145/2024723.200008039:3(129-140)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2024723.2000080
Lee YAvizienis RBishara AXia RLockhart DBatten CAsanović KIyer RYang QGonzález A(2011)Exploring the tradeoffs between programmability and efficiency in data-parallel acceleratorsProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000080(129-140)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2000064.2000080
Choi YLin YChong NMahlke SMudge T(2009)Stream Compilation for Real-Time Embedded Multicore SystemsProceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO.2009.27(210-220)Online publication date: 22-Mar-2009
https://dl.acm.org/doi/10.1109/CGO.2009.27
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten