skip to main content
article

Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

Published: 02 February 2007 Publication History

Abstract

VLIW processors have started gaining acceptance in the embedded systems domain. However, monolithic register file VLIW processors with a large number of functional units are not viable. This is because of the need for a large number of ports to support FU requirements, which makes them expensive and extremely slow. A simple solution is to break the register file into a number of smaller register files with a subset of FUs connected to it. These architectures are termed clustered VLIW processors.
In this article, we first build a case for clustered VLIW processors with four or more clusters by showing that the achievable ILP in most of the media applications for a 16 ALU and 8 LD/ST VLIW processor is around 20. We then provide a classification of the intercluster interconnection design space, and show that a large part of this design space is currently unexplored. Next, using our performance evaluation methodology, we evaluate a subset of this design space and show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably. We also establish that this behavior is heavily application dependent, emphasizing the importance of application-specific architecture exploration. We also present results about the statistical behavior of these different architectures by varying the number of clusters in our framework from 4 to 16. These results clearly show the advantages of one specific architecture over others. Finally, based on our results, we propose a new interconnection network, which should lower this performance gap.

References

[1]
Aditya, S., Kathail, V., and Rau, B. R. 1998. Elcor's machine description system: Version 3.0. Tech. rep. HPL-1998-128. Hewlett-Packard Laboratories, Palo Alto, CA.
[2]
Banerjia, S., Havanki, W. A., and Conte, T. M. 1997. Treegion scheduling for highly parallel processors. In Proceedings of the European Conference on Parallel Processing. 1074--1078.
[3]
Bhargava, R. and John, L. K. 2003. Improving dynamic cluster assignment for clustered trace cache processors. In Proceedings of the 30th Annual International Symposium on Computer Architecture. 264--274.
[4]
Chang, P. P., Mahlke, S. A., Chen, W. Y., Warter, N. J., and Hwu, W. W. 1991. IMPACT: An architectural framework for multiple-instruction-issue processors. ACM Comput. Architect. News 19, 3, 266--275.
[5]
Codina, J. M., Sanchez, J., and Gonzalez, A. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT 2001).
[6]
Cruz, J.-L., Gonzalez, A., and Valero, M. 2000. Multiple-banked register file architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA-2000).
[7]
Desoli, G. 1998. Instruction assignment for clustered VLIW DSP compilers: A new approach. Tech. rep. HPL-98-13. Hewlett-Packard Laboratories, Palt Alto, CA.
[8]
Faraboschi, P., Brown, G., Fisher, J. A., Desoli, G., and Homewood, F. M. O. 2000. Lx: A technology platform for customizable VLIW embedded processing. In Proceedings of the International Symposium on Computer Architecture (ISCA'2000). ACM Press, New York, NY.
[9]
Fisher, J. A., Faraboschi, P., and Desoli, G. 1996. Custom-fit processors: Letting applications define architectures. In Proceedings of the IEEE Symposium on Microarchitectures.
[10]
Fritts, J. and Mangione-Smith, B. 2002. MediaBench II---technology, status, and cooperation. In Proceedings of the Workshop on Media and Stream Processors (Istanbul, Turkey).
[11]
Fritts, J. and Wolf, W. 2000. Evaluation of static and dynamic scheduling for media processors. In Proceedings of the 2nd Workshop on Media Processors and DSPs in Conjunction with 33rd Annual International Symposium on Microarchitecture. ACM Press, New York, NY.
[12]
Fritts, J., Wu, Z., and Wolf, W. 1999. Parallel media processors for the billion-transistor era. In Proceedings of the International Conference on Parallel Processing. 354--362.
[13]
Gangwar, A., Balakrishnan, M., Panda, P. R., and Kumar, A. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE-2005). 730--735.
[14]
Hwu, W. W., Mahlke, A., S., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The Superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput 7, 1--2, 229--248.
[15]
Jacome, M. and de Veciana, G. 2000. Design challenges for new application specific processors. In IEEE Design and Test of Computers. Number 2. 40--50.
[16]
Jacome, M. F., de Veciana, G., and Lapinskii, V. 2000. Exploring performance tradeoffs for clustered VLIW ASIPs. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD'2000).
[17]
Kailas, K., Ebcioglu, K., and Agrawala, A. K. 2001. CARS: A new code generation framework for clustered ILP processors. In Proceedings of the HPCA. 133--144.
[18]
Kozyrakis, C. E., Perissakis, S., Patterson, D., Anderson, T., Asanovic, K., Cardwell, N., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Thomas, R., Treuhaft, N., and Yelick, K. 1997. Scalable processors in the billion-transistor era: IRAM. IEEE Comput. 30, 9 (Sept.), 75--78.
[19]
Lapinskii, V., Jacome, M. F., and de Veciana, G. 2001. High quality operation binding for clustered VLIW datapaths. In Proceedings of the IEEE/ACM Design Automation Conference (DAC'2001).
[20]
Lee, C., Potkonjak, M., and Mangione-Smith, W. H. 1997. Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems. In Proceedings of the International Symposium on Microarchitecture. 330--335.
[21]
Lee, H.-H., Wu, Y., and Tyson, G. 2000. Quantifying instruction-level parallelism limits on an EPIC architecture. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.
[22]
Lee, W., Barua, R., Frank, M., Srikrishna, D., Babb, J., Sarkar, V., and Amarasinghe, S. P. 1998. Space-time scheduling of instruction-level parallelism on a raw machine. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems. 46--57.
[23]
Leupers, R. 2000. Instruction scheduling for clustered VLIW DSPs. In Proceedings of the IEEE PACT. 291--300.
[24]
Lewis, D., Galloway, D., Ierssel, M., Rose, J., and Chow, P. 1997. The transmogrifier-2: A 1-million gate rapid prototyping system. In Proceedings of the ACM 5th International Symposium on Field Programmable-Gate Arrays. Monterey, CA. 53--61.
[25]
Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture.
[26]
Mattson, P., Dally, W. J., Rixner, S., Kapasi, U. J., and Owens, J. D. 2001. Communication scheduling. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating System. 82--92.
[27]
Ozer, E., Banerjia, S., and Conte, T. M. 1998. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the International Symposium on Microarchitecture. 308--315.
[28]
Rixner, S., Dally, W. J., Khailany, B., Mattson, P. R., Kapasi, U. J., and Owens, J. D. 2000. Register organization for media processing. In Proceedings of 6th International Symposium on High Performance Computer Architecture. 375--386.
[29]
Sanchez, J., Gibert, E., and Gonzalez, A. 2002. An interleaved cache clustered VLIW processor. In Proceedings of the ACM International Conference on Supercomputing (ICS'2002).
[30]
Sanchez, J. and Gonzalez, A. 2000. Instruction scheduling for clustered VLIW architectures. In Proceedings of the International Symposium on System Synthesis (ISSS'2000).
[31]
Siroyan. 2002. Go online to http://www.siroyan.com.
[32]
Smits, J. E. 2001. Instruction-level distributed processing. IEEE Comput. 34, 4 (Apr.), 59--65.
[33]
Song, P. 1998. Demystifying EPIC and IA-64. Microprocessor Report, vol. 12, no. 1.
[34]
Stefanovic, D. and Martonosi, M. 2001. Limits and graph structure of available instruction-level parallelism (research note). In Euro-Par 2000 Parallel Processing, A. Bode, T. Ludwig, W. Karl, and R. Wismueller, Eds. Lecture Notes in Computer Science, vol. 1900. Springer-Verlag, Berlin, Germany, 1018--1022.
[35]
Terechko, A., Thenaff, E. L., Garg, M., van Eijndhoven, J., and Corporaal, H. 2003. Inter-cluster communication models for clustered VLIW processors. In Proceedings of the 9th International Symposium on High Performance Computer Architecture (Anaheim, CA). 298--309.
[36]
Texas Instruments. 2000. TMS3206000 CPU and Instruction Set Reference Guide. Texas Instruments, Dallas, TX.
[37]
Trimaran Consortium. 1998. The trimaran compiler infrastructure. Go online to http://www.trimaran.org.
[38]
Zalamea, J., Llosa, J., Ayguade, E., and Valero, M. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture. 160--169.
[39]
Zivojinovic, V., Velarde, J. M., Schlager, C., and Meyr, H. 1994. DSPStone---a DSP-oriented benchmarking methodology. In Proceedings of the International Conference on Signal Processing Application Technology (Dallas, TX). 715--720.

Cited By

View all
  • (2018)Compiling for VLIW DSPsHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_27(979-1020)Online publication date: 14-Oct-2018
  • (2013)Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architecturesEURASIP Journal on Embedded Systems10.1186/1687-3963-2013-92013:1Online publication date: 10-May-2013
  • (2013)Background and Related WorkEnergy-Efficient Communication Processors10.1007/978-1-4614-4992-8_2(25-68)Online publication date: 30-May-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 12, Issue 1
January 2007
194 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/1188275
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 02 February 2007
Published in TODAES Volume 12, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ASIP
  2. VLIW
  3. clustered VLIW processors
  4. performance evaluation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Compiling for VLIW DSPsHandbook of Signal Processing Systems10.1007/978-3-319-91734-4_27(979-1020)Online publication date: 14-Oct-2018
  • (2013)Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architecturesEURASIP Journal on Embedded Systems10.1186/1687-3963-2013-92013:1Online publication date: 10-May-2013
  • (2013)Background and Related WorkEnergy-Efficient Communication Processors10.1007/978-1-4614-4992-8_2(25-68)Online publication date: 30-May-2013
  • (2012)Design and analysis of layered coarse-grained reconfigurable architecture2012 International Conference on Reconfigurable Computing and FPGAs10.1109/ReConFig.2012.6416736(1-6)Online publication date: Dec-2012
  • (2011)Architecture design space exploration of run-time scalable issue-width processors2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation10.1109/SAMOS.2011.6045447(77-84)Online publication date: Jul-2011
  • (2010)Overall Framework for ExplorationUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_4(83-113)Online publication date: 3-Jul-2010
  • (2010)Global State-of-the-Art OverviewUltra-Low Energy Domain-Specific Instruction-Set Processors10.1007/978-90-481-9528-2_2(17-32)Online publication date: 3-Jul-2010
  • (2010)Compiling for VLIW DSPsHandbook of Signal Processing Systems10.1007/978-1-4419-6345-1_22(603-638)Online publication date: 16-Jul-2010
  • (2009)Computation and data transfer co-scheduling for interconnection bus minimizationProceedings of the 2009 Asia and South Pacific Design Automation Conference10.5555/1509633.1509716(311-316)Online publication date: 19-Jan-2009
  • (2009)Playing the trade-off gameACM Transactions on Design Automation of Electronic Systems10.1145/1529255.152925814:3(1-37)Online publication date: 4-Jun-2009
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media