TTAs: Missing the ILP complexity wall

https://doi.org/10.1016/S1383-7621(98)00046-0Get rights and content

Abstract

A common approach to enhance the performance of processors is to increase the number of function units which operate concurrently. We observe this development in all recent general purpose superscalar processors, and in VLIW (very long instruction word) processors used for more dedicated application domains, like the multi-media domain. This paper analyzes the data path complexity of ILP processors (in particular VLIWs), and shows that they soon may hit the complexity wall; their complexity gets out of control when scaling to very high performance. Several methods are investigated for reducing this complexity. Essentially these methods trade hardware for software complexity, i.e., performing as much as possible at compile time. Combining these methods results in a new architecture, called transport triggered architecture or TTA. The concept of transport triggering is outlined together with its characteristics. It will be shown that the application of this concept results in a number of hardware advantages, and introduces a number of new scheduling optimizations. Together they substantially reduce the ILP complexity bottleneck, which will be demonstrated by a number of experiments.

Introduction

In order to fulfil continuously increasing demands for more processing power, most processor manufacturers upgraded their general purpose processor architectures with superscalar capabilities. This allows them to stay binary compatible with what is already on the market. However, as is well known, the exploitation of instruction-level parallelism using superscalar techniques is rather limited. Primarily this is caused by the limited instruction window of the hardware instruction dispatch mechanism. Enlarging this window results in high hardware costs for dependence checking and resource allocation, extra pipeline stages (which increase branch penalties), and a possible increase of the achievable cycle time [15].

A different, but binary incompatible, approach to exploiting instruction level parallelism (ILP) is taken by VLIW architectures. Currently, several interesting VLIWs hit the market, like the Trimedia of Philips, the Mpact of Chromatic and the TMS320C6x of Texas Instruments. They deliver high performance at reduced cost. The superscalar hardware has been replaced by compile-time dependence checking and resource allocation. Despite their good properties, the data path of VLIWs is still too complex, in particular when they are scaled to very high performance. This makes it interesting to look at alternative architectures which avoid this complexity, but keep the good properties of VLIWs.

In this paper we analyze the data path complexity of VLIW processors. We demonstrate several methods to reduce this complexity by replacing run-time complexity by compile-time complexity. This results in a new architectural concept: the concept of transport triggering. Transport triggered architectures (TTAs) are based on this concept. It will be demonstrated that TTAs do not hit the ILP complexity wall.

This paper is structured as follows. Section 2classifies different types of ILP architectures with respect to the amount of hardware they contain for parallelism detection. Section 3introduces the required terminology concerning the complexity of the data path of ILP processors and applies this terminology to RISC processors. Section 4treats the complexity of VLIW processors. Section 5demonstrates several methods to reduce ILP processor data path complexity. Section 6describes an approach we have taken to reduce complexity: transport triggered architectures are described. A quantitative evaluation of three main TTA advantages is presented in Section 7. Finally, Section 8presents a summary and draws major conclusions.

Section snippets

ILP architectures and the role of the compiler

Exploiting parallelism has its price: the parallelism has to be detected and exploited efficiently by the architecture under consideration. Let us therefore look at which translation and interpretation steps have to be taken in order to execute a sequential program written in a HLL on a single ILP processor:

  • 1.

    Frontend compilation. Lexical analysis and parsing of the program, performing optimizations, and compilation to basic operations. During this step alias analysis is also performed.

  • 2.

    Determine

Data path complexity: Terminology

In general an instruction specifies one or more operations, and for each operation zero or more source operands and also zero or more destination operands. These operands are accessed either from the register file (RF) or from memory. The latter may require complex address arithmetic. Small immediates are directly accessible from the instruction register. The next discussion about data path complexity concentrates on the transport of these operands. We start with the data path of a

Complexity of VLIW architectures

In this we apply the terminology introduced in the former section to the data path of VLIW processors, and analyze its complexity for arbitrary number of FUs (a similar analysis can be made for superscalar processors). VLIWs exploit ILP by having multiple FUs operating concurrently. The data path of a VLIW processor with two single cycle FUs is shown in Fig. 8. As indicated, multiple FUs may share a bus for immediate values. The figure shows one such bus; i.e. only one immediate can be

Reducing complexity

As shown in the previous section, ILP processors run into problems when supporting the exploitation of large amounts of concurrency. Especially the RF and bypass components become complex when many FUs are supported. In following subsections we explore methods to reduce this complexity.

Transport triggered architectures

The bypass circuit in Fig. 12 is still underutilized when a particular FU does not provide a result at a certain time; in other words, the bypass transport capacity is still to be designed for worst case traffic conditions. This is not a problem when at least certain code fragments require this amount of inter FU communication. However, it becomes interesting to reduce the bypass capacity (or inter FU connectivity) when the number of FU outputs is larger than the (worst case) communication

Experimental evaluation

In the former sections it was argued that making data transports visible at the architectural level results in a number of advantages. Three of the main advantages will be measured in this section: reduction of the number of required register ports, the extra concurrency available when splitting FUs, and the reduction of bypass connectivity. For these measurements we use a highly ILP optimizing TTA compiler and an architecture exploration tool, both developed by Hoogerbrugge [11]. The first two

Summary and conclusions

This paper analyzed the data path complexity problems of several ILP architectures. It introduced several complexity measures to perform this analysis. It showed that the data path of ILP architectures becomes very complex when scaled to high performance levels.

Methods were researched to reduce the complexity of the RF and the bypass circuitry. Their complexity could be substantially reduced by adding a new level of control, the data path control level, to the compiler, and removing this

Henk Corporaal is Associate Professor in Computer Architecture at the Delft University of Technology (TUD), in the Netherlands. He has managed a number of research projects in the areas of computer architecture, processor hardware design, and parallel processing. A key project, MOVE, concerns the automatic generation of hardware and software for embedded systems. Corporaal gained a Ph. D. in Electrical Engineering from the TUD and an M.Sc. in Physics from the University of Groningen (The

References (22)

  • H. Adriani, F. Harmsze, H. Corporaal, The utilization of a fully configurable microprocessor development environment...
  • A. Agarwal et al., Sparcle: an evolutionary processor design for large-scale multiprocessors, IEEE Micro (1993)...
  • M. Arnold, R. Lamberts, H. Corporaal, High performance image processing using TTAs, Second Annual Conference of ASCI,...
  • Arvind, D. Culler, Dataflow architectures, Annual Reviews in Computer Science 1 (1986)...
  • H.B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, Reading, MA,...
  • H. Corporaal, Microprocessor Architectures: from VLIW to TTA, John Wiley, 1997, ISBN...
  • H. Corporaal, J. Hoogerbrugge, Cosynthesis with the MOVE framework, CESA'96 IMACS Multiconference, Lille, France, 1996,...
  • D.E. Culler et al., Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract...
  • P.K. Dubey, K. O'Brien, C. Barton, Single-program speculative multithreading (SPSM) architecture: compiler-assisted...
  • J.L. Hennessy, D.A. Patterson, Computer Architecture, a Quantitative Approach, 2nd edition, Morgan Kaufmann, Los Altos,...
  • J. Hoogerbrugge, Code generation for transport triggered architectures, Ph.D. thesis, Delft University of Technology,...
  • Cited by (36)

    • MPSoC based on Transport Triggered Architecture for baseband processing of an LTE receiver

      2014, Journal of Systems Architecture
      Citation Excerpt :

      In the TTA instruction there is a slot for each bus to specify its associated move operation in a specific clock cycle. It more closely resembles to a VLIW architecture but with reduced complexity as compared to conventional VLIW architectures [15], as the processor is not programmed by operations but by defining the moves to functional units. A typical architecture consists of several buses, functional units,control unit, register files and load store units as shown in Fig. 4.

    • A high performance, area efficient TTA-like vertex shader architecture with optimized floating point arithmetic unit for embedded graphics applications

      2013, Microprocessors and Microsystems
      Citation Excerpt :

      Finally, a conclusion of this work is summarized in Section 6. Transport Triggered Architecture (TTA) [20] is a statically programmed ILP modular architecture with high resemblance to VLIW architectures at the point of similar instruction formats encoded horizontally by a number of fields. The main difference between TTAs and VLIWs is their programming method.

    • Allocation and Scheduling of Dataflow Graphs on Hybrid Dataflow/von Neumann Architectures

      2023, Proceedings - 2023 21st ACM/IEEE International Symposium on Formal Methods and Models for System Design, MEMOCODE 2023
    • Towards Buffers as a Scalable Alternative to Registers for Processor-Local Memory

      2023, MBMV 2023 - Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, 26. Workshop
    View all citing articles on Scopus

    1. Download : Download full-size image

    Henk Corporaal is Associate Professor in Computer Architecture at the Delft University of Technology (TUD), in the Netherlands. He has managed a number of research projects in the areas of computer architecture, processor hardware design, and parallel processing. A key project, MOVE, concerns the automatic generation of hardware and software for embedded systems. Corporaal gained a Ph. D. in Electrical Engineering from the TUD and an M.Sc. in Physics from the University of Groningen (The Netherlands). He lectures undergraduate, graduate and postgraduate courses on computer programming, computer architecture and parallel processing at the TUD and the Advanced School for Computing and Imaging. He has written a range of publications in areas such as computer architecture, embedded system design, run-time support for high level languages, MIMD computing, concurrent simulation, neural networks, and code generation for instruction level parallel processors.

    1

    Fax: +31 15 2784898; e-mail: [email protected]

    View full text