article

Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

Authors:

M. Balakrishnan,

Anshul KumarAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 12, Issue 1

Article No.: 1, Pages 1 - 29

https://doi.org/10.1145/1188275.1188276

Published: 02 February 2007 Publication History

Abstract

VLIW processors have started gaining acceptance in the embedded systems domain. However, monolithic register file VLIW processors with a large number of functional units are not viable. This is because of the need for a large number of ports to support FU requirements, which makes them expensive and extremely slow. A simple solution is to break the register file into a number of smaller register files with a subset of FUs connected to it. These architectures are termed clustered VLIW processors.

In this article, we first build a case for clustered VLIW processors with four or more clusters by showing that the achievable ILP in most of the media applications for a 16 ALU and 8 LD/ST VLIW processor is around 20. We then provide a classification of the intercluster interconnection design space, and show that a large part of this design space is currently unexplored. Next, using our performance evaluation methodology, we evaluate a subset of this design space and show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably. We also establish that this behavior is heavily application dependent, emphasizing the importance of application-specific architecture exploration. We also present results about the statistical behavior of these different architectures by varying the number of clusters in our framework from 4 to 16. These results clearly show the advantages of one specific architecture over others. Finally, based on our results, we propose a new interconnection network, which should lower this performance gap.

References

[1]

Aditya, S., Kathail, V., and Rau, B. R. 1998. Elcor's machine description system: Version 3.0. Tech. rep. HPL-1998-128. Hewlett-Packard Laboratories, Palo Alto, CA.

[2]

Banerjia, S., Havanki, W. A., and Conte, T. M. 1997. Treegion scheduling for highly parallel processors. In Proceedings of the European Conference on Parallel Processing. 1074--1078.

[3]

Bhargava, R. and John, L. K. 2003. Improving dynamic cluster assignment for clustered trace cache processors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 264--274.

[4]

Chang, P. P., Mahlke, S. A., Chen, W. Y., Warter, N. J., and Hwu, W. W. 1991. IMPACT: An architectural framework for multiple-instruction-issue processors. ACM Comput. Architect. News 19, 3, 266--275.

Digital Library

[5]

Codina, J. M., Sanchez, J., and Gonzalez, A. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT 2001).

[6]

Cruz, J.-L., Gonzalez, A., and Valero, M. 2000. Multiple-banked register file architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA-2000).

[7]

Desoli, G. 1998. Instruction assignment for clustered VLIW DSP compilers: A new approach. Tech. rep. HPL-98-13. Hewlett-Packard Laboratories, Palt Alto, CA.

[8]

Faraboschi, P., Brown, G., Fisher, J. A., Desoli, G., and Homewood, F. M. O. 2000. Lx: A technology platform for customizable VLIW embedded processing. In Proceedings of the International Symposium on Computer Architecture (ISCA'2000). ACM Press, New York, NY.

[9]

Fisher, J. A., Faraboschi, P., and Desoli, G. 1996. Custom-fit processors: Letting applications define architectures. In Proceedings of the IEEE Symposium on Microarchitectures.

[10]

Fritts, J. and Mangione-Smith, B. 2002. MediaBench II---technology, status, and cooperation. In Proceedings of the Workshop on Media and Stream Processors (Istanbul, Turkey).

[11]

Fritts, J. and Wolf, W. 2000. Evaluation of static and dynamic scheduling for media processors. In Proceedings of the 2nd Workshop on Media Processors and DSPs in Conjunction with 33rd Annual International Symposium on Microarchitecture. ACM Press, New York, NY.

[12]

Fritts, J., Wu, Z., and Wolf, W. 1999. Parallel media processors for the billion-transistor era. In Proceedings of the International Conference on Parallel Processing. 354--362.

[13]

Gangwar, A., Balakrishnan, M., Panda, P. R., and Kumar, A. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE-2005). 730--735.

[14]

Hwu, W. W., Mahlke, A., S., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The Superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput 7, 1--2, 229--248.

Digital Library

[15]

Jacome, M. and de Veciana, G. 2000. Design challenges for new application specific processors. In IEEE Design and Test of Computers. Number 2. 40--50.

[16]

Jacome, M. F., de Veciana, G., and Lapinskii, V. 2000. Exploring performance tradeoffs for clustered VLIW ASIPs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'2000).

[17]

Kailas, K., Ebcioglu, K., and Agrawala, A. K. 2001. CARS: A new code generation framework for clustered ILP processors. In Proceedings of the HPCA. 133--144.

[18]

Kozyrakis, C. E., Perissakis, S., Patterson, D., Anderson, T., Asanovic, K., Cardwell, N., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Thomas, R., Treuhaft, N., and Yelick, K. 1997. Scalable processors in the billion-transistor era: IRAM. IEEE Comput. 30, 9 (Sept.), 75--78.

[19]

Lapinskii, V., Jacome, M. F., and de Veciana, G. 2001. High quality operation binding for clustered VLIW datapaths. In Proceedings of the IEEE/ACM Design Automation Conference (DAC'2001).

[20]

Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the International Symposium on Microarchitecture. 330--335.

Digital Library

[21]

Lee, H.-H., Wu, Y., and Tyson, G. 2000. Quantifying instruction-level parallelism limits on an EPIC architecture. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.

[22]

Lee, W., Barua, R., Frank, M., Srikrishna, D., Babb, J., Sarkar, V., and Amarasinghe, S. P. 1998. Space-time scheduling of instruction-level parallelism on a raw machine. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems. 46--57.

[23]

Leupers, R. 2000. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the IEEE PACT. 291--300.

[24]

Lewis, D., Galloway, D., Ierssel, M., Rose, J., and Chow, P. 1997. The transmogrifier-2: A 1-million gate rapid prototyping system. In Proceedings of the ACM 5th International Symposium on Field Programmable-Gate Arrays. Monterey, CA. 53--61.

[25]

Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture.

[26]

Mattson, P., Dally, W. J., Rixner, S., Kapasi, U. J., and Owens, J. D. 2001. Communication scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating System. 82--92.

[27]

Ozer, E., Banerjia, S., and Conte, T. M. 1998. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the International Symposium on Microarchitecture. 308--315.

[28]

Rixner, S., Dally, W. J., Khailany, B., Mattson, P. R., Kapasi, U. J., and Owens, J. D. 2000. Register organization for media processing. In Proceedings of 6th International Symposium on High Performance Computer Architecture. 375--386.

[29]

Sanchez, J., Gibert, E., and Gonzalez, A. 2002. An interleaved cache clustered VLIW processor. In Proceedings of the ACM International Conference on Supercomputing (ICS'2002).

[30]

Sanchez, J. and Gonzalez, A. 2000. Instruction scheduling for clustered VLIW architectures. In Proceedings of the International Symposium on System Synthesis (ISSS'2000).

[31]

Siroyan. 2002. Go online to http://www.siroyan.com.

[32]

Smits, J. E. 2001. Instruction-level distributed processing. IEEE Comput. 34, 4 (Apr.), 59--65.

[33]

Song, P. 1998. Demystifying EPIC and IA-64. Microprocessor Report, vol. 12, no. 1.

[34]

Stefanovic, D. and Martonosi, M. 2001. Limits and graph structure of available instruction-level parallelism (research note). In Euro-Par 2000 Parallel Processing, A. Bode, T. Ludwig, W. Karl, and R. Wismueller, Eds. Lecture Notes in Computer Science, vol. 1900. Springer-Verlag, Berlin, Germany, 1018--1022.

[35]

Terechko, A., Thenaff, E. L., Garg, M., van Eijndhoven, J., and Corporaal, H. 2003. Inter-cluster communication models for clustered VLIW processors. In Proceedings of the 9th International Symposium on High Performance Computer Architecture (Anaheim, CA). 298--309.

[36]

Texas Instruments. 2000. TMS3206000 CPU and Instruction Set Reference Guide. Texas Instruments, Dallas, TX.

[37]

Trimaran Consortium. 1998. The trimaran compiler infrastructure. Go online to http://www.trimaran.org.

[38]

Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. 160--169.

[39]

Zivojinovic, V., Velarde, J. M., Schlager, C., and Meyr, H. 1994. DSPStone---a DSP-oriented benchmarking methodology. In Proceedings of the International Conference on Signal Processing Application Technology (Dallas, TX). 715--720.

Cited By

Kessler C(2018)Compiling for VLIW DSPsHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_27(979-1020)Online publication date: 14-Oct-2018
https://doi.org/10.1007/978-3-319-91734-4_27
Guzma VPitkänen TTakala J(2013)Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architecturesEURASIP Journal on Embedded Systems10.1186/1687-3963-2013-92013:1Online publication date: 10-May-2013
https://doi.org/10.1186/1687-3963-2013-9
Fasthuber RCatthoor FRaghavan PNaessens FFasthuber RCatthoor FRaghavan PNaessens F(2013)Background and Related WorkEnergy-Efficient Communication Processors10.1007/978-1-4614-4992-8_2(25-68)Online publication date: 30-May-2013
https://doi.org/10.1007/978-1-4614-4992-8_2
Show More Cited By

Index Terms

Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Very long instruction word
    2. Serial architectures
      1. Complex instruction set computing
      2. Reduced instruction set computing

Recommendations

Evaluation of bus based interconnect mechanisms in clustered VLIW architectures

With new sophisticated compiler technology, it is possible to schedule distant instructions efficiently. As a consequence, the amount of exploitable instruction level parallelism (ILP) in applications has gone up considerably. However, monolithic ...
Inter-cluster communication in VLIW architectures

The traditional VLIW (very long instruction word) architecture with a single register file does not scale up well to address growing performance demands on embedded media processors. However, splitting a VLIW processor in smaller clusters, which are ...
Hybrid multithreading for VLIW processors
CASES '09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems

Several multithreading techniques have been proposed to reduce resource underutilization in Very Long Instruction Word (VLIW) processors. Simultaneous MultiThreading (SMT) is a popular technique that improves processor performance by issuing multiple ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 12, Issue 1

January 2007

194 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1188275

Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 02 February 2007

Published in TODAES Volume 12, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
623
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kessler C(2018)Compiling for VLIW DSPsHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_27(979-1020)Online publication date: 14-Oct-2018
https://doi.org/10.1007/978-3-319-91734-4_27
Guzma VPitkänen TTakala J(2013)Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architecturesEURASIP Journal on Embedded Systems10.1186/1687-3963-2013-92013:1Online publication date: 10-May-2013
https://doi.org/10.1186/1687-3963-2013-9
Fasthuber RCatthoor FRaghavan PNaessens FFasthuber RCatthoor FRaghavan PNaessens F(2013)Background and Related WorkEnergy-Efficient Communication Processors10.1007/978-1-4614-4992-8_2(25-68)Online publication date: 30-May-2013
https://doi.org/10.1007/978-1-4614-4992-8_2
Rakossy ZNaphade TChattopadhyay A(2012)Design and analysis of layered coarse-grained reconfigurable architecture2012 International Conference on Reconfigurable Computing and FPGAs10.1109/ReConFig.2012.6416736(1-6)Online publication date: Dec-2012
https://doi.org/10.1109/ReConFig.2012.6416736
Koenig RStripf THeisswolf JBecker J(2011)Architecture design space exploration of run-time scalable issue-width processors2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation10.1109/SAMOS.2011.6045447(77-84)Online publication date: Jul-2011
https://doi.org/10.1109/SAMOS.2011.6045447
Catthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar JCatthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar J(2010)Overall Framework for ExplorationUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_4(83-113)Online publication date: 3-Jul-2010
https://doi.org/10.1007/978-90-481-9528-2_4
Catthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar JCatthoor FRaghavan PLambrechts AJayapala MKritikakou AAbsar J(2010)Global State-of-the-Art OverviewUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_2(17-32)Online publication date: 3-Jul-2010
https://doi.org/10.1007/978-90-481-9528-2_2
Kessler C(2010)Compiling for VLIW DSPsHandbook of Signal Processing Systems10.1007/978-1-4419-6345-1_22(603-638)Online publication date: 16-Jul-2010
https://doi.org/10.1007/978-1-4419-6345-1_22
Xu CXue CHu BSha EWakabayashi K(2009)Computation and data transfer co-scheduling for interconnection bus minimizationProceedings of the 2009 Asia and South Pacific Design Automation Conference10.5555/1509633.1509716(311-316)Online publication date: 19-Jan-2009
https://dl.acm.org/doi/10.5555/1509633.1509716
Raghavan PJayapala MLambrechts AAbsar JCatthoor F(2009)Playing the trade-off gameACM Transactions on Design Automation of Electronic Systems10.1145/1529255.152925814:3(1-37)Online publication date: 4-Jun-2009
https://dl.acm.org/doi/10.1145/1529255.1529258
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents