TTAs: Missing the ILP complexity wall
Introduction
In order to fulfil continuously increasing demands for more processing power, most processor manufacturers upgraded their general purpose processor architectures with superscalar capabilities. This allows them to stay binary compatible with what is already on the market. However, as is well known, the exploitation of instruction-level parallelism using superscalar techniques is rather limited. Primarily this is caused by the limited instruction window of the hardware instruction dispatch mechanism. Enlarging this window results in high hardware costs for dependence checking and resource allocation, extra pipeline stages (which increase branch penalties), and a possible increase of the achievable cycle time [15].
A different, but binary incompatible, approach to exploiting instruction level parallelism (ILP) is taken by VLIW architectures. Currently, several interesting VLIWs hit the market, like the Trimedia of Philips, the Mpact of Chromatic and the TMS320C6x of Texas Instruments. They deliver high performance at reduced cost. The superscalar hardware has been replaced by compile-time dependence checking and resource allocation. Despite their good properties, the data path of VLIWs is still too complex, in particular when they are scaled to very high performance. This makes it interesting to look at alternative architectures which avoid this complexity, but keep the good properties of VLIWs.
In this paper we analyze the data path complexity of VLIW processors. We demonstrate several methods to reduce this complexity by replacing run-time complexity by compile-time complexity. This results in a new architectural concept: the concept of transport triggering. Transport triggered architectures (TTAs) are based on this concept. It will be demonstrated that TTAs do not hit the ILP complexity wall.
This paper is structured as follows. Section 2classifies different types of ILP architectures with respect to the amount of hardware they contain for parallelism detection. Section 3introduces the required terminology concerning the complexity of the data path of ILP processors and applies this terminology to RISC processors. Section 4treats the complexity of VLIW processors. Section 5demonstrates several methods to reduce ILP processor data path complexity. Section 6describes an approach we have taken to reduce complexity: transport triggered architectures are described. A quantitative evaluation of three main TTA advantages is presented in Section 7. Finally, Section 8presents a summary and draws major conclusions.
Section snippets
ILP architectures and the role of the compiler
Exploiting parallelism has its price: the parallelism has to be detected and exploited efficiently by the architecture under consideration. Let us therefore look at which translation and interpretation steps have to be taken in order to execute a sequential program written in a HLL on a single ILP processor:
- 1.
Frontend compilation. Lexical analysis and parsing of the program, performing optimizations, and compilation to basic operations. During this step alias analysis is also performed.
- 2.
Determine
Data path complexity: Terminology
In general an instruction specifies one or more operations, and for each operation zero or more source operands and also zero or more destination operands. These operands are accessed either from the register file (RF) or from memory. The latter may require complex address arithmetic. Small immediates are directly accessible from the instruction register. The next discussion about data path complexity concentrates on the transport of these operands. We start with the data path of a
Complexity of VLIW architectures
In this we apply the terminology introduced in the former section to the data path of VLIW processors, and analyze its complexity for arbitrary number of FUs (a similar analysis can be made for superscalar processors). VLIWs exploit ILP by having multiple FUs operating concurrently. The data path of a VLIW processor with two single cycle FUs is shown in Fig. 8. As indicated, multiple FUs may share a bus for immediate values. The figure shows one such bus; i.e. only one immediate can be
Reducing complexity
As shown in the previous section, ILP processors run into problems when supporting the exploitation of large amounts of concurrency. Especially the RF and bypass components become complex when many FUs are supported. In following subsections we explore methods to reduce this complexity.
Transport triggered architectures
The bypass circuit in Fig. 12 is still underutilized when a particular FU does not provide a result at a certain time; in other words, the bypass transport capacity is still to be designed for worst case traffic conditions. This is not a problem when at least certain code fragments require this amount of inter FU communication. However, it becomes interesting to reduce the bypass capacity (or inter FU connectivity) when the number of FU outputs is larger than the (worst case) communication
Experimental evaluation
In the former sections it was argued that making data transports visible at the architectural level results in a number of advantages. Three of the main advantages will be measured in this section: reduction of the number of required register ports, the extra concurrency available when splitting FUs, and the reduction of bypass connectivity. For these measurements we use a highly ILP optimizing TTA compiler and an architecture exploration tool, both developed by Hoogerbrugge [11]. The first two
Summary and conclusions
This paper analyzed the data path complexity problems of several ILP architectures. It introduced several complexity measures to perform this analysis. It showed that the data path of ILP architectures becomes very complex when scaled to high performance levels.
Methods were researched to reduce the complexity of the RF and the bypass circuitry. Their complexity could be substantially reduced by adding a new level of control, the data path control level, to the compiler, and removing this
Henk Corporaal is Associate Professor in Computer Architecture at the Delft University of Technology (TUD), in the Netherlands. He has managed a number of research projects in the areas of computer architecture, processor hardware design, and parallel processing. A key project, MOVE, concerns the automatic generation of hardware and software for embedded systems. Corporaal gained a Ph. D. in Electrical Engineering from the TUD and an M.Sc. in Physics from the University of Groningen (The
References (22)
- H. Adriani, F. Harmsze, H. Corporaal, The utilization of a fully configurable microprocessor development environment...
- A. Agarwal et al., Sparcle: an evolutionary processor design for large-scale multiprocessors, IEEE Micro (1993)...
- M. Arnold, R. Lamberts, H. Corporaal, High performance image processing using TTAs, Second Annual Conference of ASCI,...
- Arvind, D. Culler, Dataflow architectures, Annual Reviews in Computer Science 1 (1986)...
- H.B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, Reading, MA,...
- H. Corporaal, Microprocessor Architectures: from VLIW to TTA, John Wiley, 1997, ISBN...
- H. Corporaal, J. Hoogerbrugge, Cosynthesis with the MOVE framework, CESA'96 IMACS Multiconference, Lille, France, 1996,...
- D.E. Culler et al., Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract...
- P.K. Dubey, K. O'Brien, C. Barton, Single-program speculative multithreading (SPSM) architecture: compiler-assisted...
- J.L. Hennessy, D.A. Patterson, Computer Architecture, a Quantitative Approach, 2nd edition, Morgan Kaufmann, Los Altos,...
Cited by (36)
MPSoC based on Transport Triggered Architecture for baseband processing of an LTE receiver
2014, Journal of Systems ArchitectureCitation Excerpt :In the TTA instruction there is a slot for each bus to specify its associated move operation in a specific clock cycle. It more closely resembles to a VLIW architecture but with reduced complexity as compared to conventional VLIW architectures [15], as the processor is not programmed by operations but by defining the moves to functional units. A typical architecture consists of several buses, functional units,control unit, register files and load store units as shown in Fig. 4.
A high performance, area efficient TTA-like vertex shader architecture with optimized floating point arithmetic unit for embedded graphics applications
2013, Microprocessors and MicrosystemsCitation Excerpt :Finally, a conclusion of this work is summarized in Section 6. Transport Triggered Architecture (TTA) [20] is a statically programmed ILP modular architecture with high resemblance to VLIW architectures at the point of similar instruction formats encoded horizontally by a number of fields. The main difference between TTAs and VLIWs is their programming method.
Energy-Efficient Exposed Datapath Architecture With a RISC-V Instruction Set Mode
2024, IEEE Transactions on ComputersConsistency Constraints for Mapping Dataflow Graphs to Hybrid Dataflow/von Neumann Architectures
2023, ACM Transactions on Embedded Computing SystemsAllocation and Scheduling of Dataflow Graphs on Hybrid Dataflow/von Neumann Architectures
2023, Proceedings - 2023 21st ACM/IEEE International Symposium on Formal Methods and Models for System Design, MEMOCODE 2023Towards Buffers as a Scalable Alternative to Registers for Processor-Local Memory
2023, MBMV 2023 - Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, 26. Workshop
Henk Corporaal is Associate Professor in Computer Architecture at the Delft University of Technology (TUD), in the Netherlands. He has managed a number of research projects in the areas of computer architecture, processor hardware design, and parallel processing. A key project, MOVE, concerns the automatic generation of hardware and software for embedded systems. Corporaal gained a Ph. D. in Electrical Engineering from the TUD and an M.Sc. in Physics from the University of Groningen (The Netherlands). He lectures undergraduate, graduate and postgraduate courses on computer programming, computer architecture and parallel processing at the TUD and the Advanced School for Computing and Imaging. He has written a range of publications in areas such as computer architecture, embedded system design, run-time support for high level languages, MIMD computing, concurrent simulation, neural networks, and code generation for instruction level parallel processors.
- 1
Fax: +31 15 2784898; e-mail: [email protected]