TTAs: Missing the ILP complexity wall

doi:10.1016/S1383-7621(98)00046-0

Journal of Systems Architecture

Volume 45, Issues 12–13, June 1999, Pages 949-973

https://doi.org/10.1016/S1383-7621(98)00046-0 Get rights and content

Abstract

A common approach to enhance the performance of processors is to increase the number of function units which operate concurrently. We observe this development in all recent general purpose superscalar processors, and in VLIW (very long instruction word) processors used for more dedicated application domains, like the multi-media domain. This paper analyzes the data path complexity of ILP processors (in particular VLIWs), and shows that they soon may hit the complexity wall; their complexity gets out of control when scaling to very high performance. Several methods are investigated for reducing this complexity. Essentially these methods trade hardware for software complexity, i.e., performing as much as possible at compile time. Combining these methods results in a new architecture, called transport triggered architecture or TTA. The concept of transport triggering is outlined together with its characteristics. It will be shown that the application of this concept results in a number of hardware advantages, and introduces a number of new scheduling optimizations. Together they substantially reduce the ILP complexity bottleneck, which will be demonstrated by a number of experiments.

Introduction

In order to fulfil continuously increasing demands for more processing power, most processor manufacturers upgraded their general purpose processor architectures with superscalar capabilities. This allows them to stay binary compatible with what is already on the market. However, as is well known, the exploitation of instruction-level parallelism using superscalar techniques is rather limited. Primarily this is caused by the limited instruction window of the hardware instruction dispatch mechanism. Enlarging this window results in high hardware costs for dependence checking and resource allocation, extra pipeline stages (which increase branch penalties), and a possible increase of the achievable cycle time [15].

A different, but binary incompatible, approach to exploiting instruction level parallelism (ILP) is taken by VLIW architectures. Currently, several interesting VLIWs hit the market, like the Trimedia of Philips, the Mpact of Chromatic and the TMS320C6x of Texas Instruments. They deliver high performance at reduced cost. The superscalar hardware has been replaced by compile-time dependence checking and resource allocation. Despite their good properties, the data path of VLIWs is still too complex, in particular when they are scaled to very high performance. This makes it interesting to look at alternative architectures which avoid this complexity, but keep the good properties of VLIWs.

In this paper we analyze the data path complexity of VLIW processors. We demonstrate several methods to reduce this complexity by replacing run-time complexity by compile-time complexity. This results in a new architectural concept: the concept of transport triggering. Transport triggered architectures (TTAs) are based on this concept. It will be demonstrated that TTAs do not hit the ILP complexity wall.

This paper is structured as follows. Section 2classifies different types of ILP architectures with respect to the amount of hardware they contain for parallelism detection. Section 3introduces the required terminology concerning the complexity of the data path of ILP processors and applies this terminology to RISC processors. Section 4treats the complexity of VLIW processors. Section 5demonstrates several methods to reduce ILP processor data path complexity. Section 6describes an approach we have taken to reduce complexity: transport triggered architectures are described. A quantitative evaluation of three main TTA advantages is presented in Section 7. Finally, Section 8presents a summary and draws major conclusions.

Section snippets

ILP architectures and the role of the compiler

Exploiting parallelism has its price: the parallelism has to be detected and exploited efficiently by the architecture under consideration. Let us therefore look at which translation and interpretation steps have to be taken in order to execute a sequential program written in a HLL on a single ILP processor:

1.
Frontend compilation. Lexical analysis and parsing of the program, performing optimizations, and compilation to basic operations. During this step alias analysis is also performed.
2.
Determine

Data path complexity: Terminology

In general an instruction specifies one or more operations, and for each operation zero or more source operands and also zero or more destination operands. These operands are accessed either from the register file (RF) or from memory. The latter may require complex address arithmetic. Small immediates are directly accessible from the instruction register. The next discussion about data path complexity concentrates on the transport of these operands. We start with the data path of a

Complexity of VLIW architectures

In this we apply the terminology introduced in the former section to the data path of VLIW processors, and analyze its complexity for arbitrary number of FUs (a similar analysis can be made for superscalar processors). VLIWs exploit ILP by having multiple FUs operating concurrently. The data path of a VLIW processor with two single cycle FUs is shown in Fig. 8. As indicated, multiple FUs may share a bus for immediate values. The figure shows one such bus; i.e. only one immediate can be

Reducing complexity

As shown in the previous section, ILP processors run into problems when supporting the exploitation of large amounts of concurrency. Especially the RF and bypass components become complex when many FUs are supported. In following subsections we explore methods to reduce this complexity.

Transport triggered architectures

The bypass circuit in Fig. 12 is still underutilized when a particular FU does not provide a result at a certain time; in other words, the bypass transport capacity is still to be designed for worst case traffic conditions. This is not a problem when at least certain code fragments require this amount of inter FU communication. However, it becomes interesting to reduce the bypass capacity (or inter FU connectivity) when the number of FU outputs is larger than the (worst case) communication

Experimental evaluation

In the former sections it was argued that making data transports visible at the architectural level results in a number of advantages. Three of the main advantages will be measured in this section: reduction of the number of required register ports, the extra concurrency available when splitting FUs, and the reduction of bypass connectivity. For these measurements we use a highly ILP optimizing TTA compiler and an architecture exploration tool, both developed by Hoogerbrugge [11]. The first two

Summary and conclusions

This paper analyzed the data path complexity problems of several ILP architectures. It introduced several complexity measures to perform this analysis. It showed that the data path of ILP architectures becomes very complex when scaled to high performance levels.

Methods were researched to reduce the complexity of the RF and the bypass circuitry. Their complexity could be substantially reduced by adding a new level of control, the data path control level, to the compiler, and removing this

References (22)

H. Adriani, F. Harmsze, H. Corporaal, The utilization of a fully configurable microprocessor development environment...
A. Agarwal et al., Sparcle: an evolutionary processor design for large-scale multiprocessors, IEEE Micro (1993)...
M. Arnold, R. Lamberts, H. Corporaal, High performance image processing using TTAs, Second Annual Conference of ASCI,...
Arvind, D. Culler, Dataflow architectures, Annual Reviews in Computer Science 1 (1986)...
H.B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI, Addison-Wesley, Reading, MA,...
H. Corporaal, Microprocessor Architectures: from VLIW to TTA, John Wiley, 1997, ISBN...
H. Corporaal, J. Hoogerbrugge, Cosynthesis with the MOVE framework, CESA'96 IMACS Multiconference, Lille, France, 1996,...
D.E. Culler et al., Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract...
P.K. Dubey, K. O'Brien, C. Barton, Single-program speculative multithreading (SPSM) architecture: compiler-assisted...
J.L. Hennessy, D.A. Patterson, Computer Architecture, a Quantitative Approach, 2nd edition, Morgan Kaufmann, Los Altos,...

J. Hoogerbrugge, Code generation for transport triggered architectures, Ph.D. thesis, Delft University of Technology,...

Cited by (36)

MPSoC based on Transport Triggered Architecture for baseband processing of an LTE receiver
2014, Journal of Systems Architecture
Citation Excerpt :
In the TTA instruction there is a slot for each bus to specify its associated move operation in a specific clock cycle. It more closely resembles to a VLIW architecture but with reduced complexity as compared to conventional VLIW architectures [15], as the processor is not programmed by operations but by defining the moves to functional units. A typical architecture consists of several buses, functional units,control unit, register files and load store units as shown in Fig. 4.
Wireless communication over LTE (long term evolution) brings several design challenges to industry and academia, due to its high throughput demand. Specially in the case of hand held mobile devices where the power budget is very limited and high throughput requires more computation power. On the other hand, the industry is struggling for flexible hardware solution, a Software Defined Radio (SDR), to amortize huge costs of hardware changes to suit the continued evolution in wireless standards. In this article, an MPSoC design has been presented for the baseband processing of a 20 MHz LTE system. Transport Triggered Architecture (TTA) has been preferred over conventional DSPs/VLIW architectures as processing element (PE) of MPSoC. Processing tasks are statically scheduled. Synchronization among the PEs is based on polling of a shared memory space. In addition an approach is presented to organize I/O buffer in such a way that the stalling probability of a PE should be reduced to exploit efficiently data and task level parallelism. The total power consumption by all the PEs synthesized on 130 nm technology at 200 MHz and 1.5 V is 105.04 mW. The total energy consumption to process one subframe including carrier recovery is 0.0767 mJ. Our study shows that TTA architecture brings several improvements in conventional SIMD/VLIW architectures. TTA as contrary to other run time designs has a guaranteed performance and lower energy consumption due to the fact that all the data dependency/independency issues are resolved at compile time. Further, it is also true due to the fact that TTA has a reduced register file (RF) traffic, number of RF ports and lower overall cycle count for a given task. To the best of author’s knowledge this article is among the first few published articles on LTE receiver implementation with published figures like time, frequency, power and perhaps the first article explaining further in detail about data access pattern to process an LTE subframe, memory organization, subsystem interconnection, and synchronization.
A high performance, area efficient TTA-like vertex shader architecture with optimized floating point arithmetic unit for embedded graphics applications
2013, Microprocessors and Microsystems
Citation Excerpt :
Finally, a conclusion of this work is summarized in Section 6. Transport Triggered Architecture (TTA) [20] is a statically programmed ILP modular architecture with high resemblance to VLIW architectures at the point of similar instruction formats encoded horizontally by a number of fields. The main difference between TTAs and VLIWs is their programming method.
A fully programmable vertex shader based on Transport Triggered Architecture (TTA) is proposed in this paper to provide high efficiency of performance and connectivity for embedded applications. At the architecture level, fine-grained data transport in TTA datapath and multi-threading method are adopted to exploit instruction and data level parallelism respectively in the graphics applications. The datapath connectivity can be optimized mainly by native architectural visible bypass in TTA and hybrid result re-collection schemes. At the shader core level, a novel SIMD multi-functional dot-production unit and an area efficient special function unit are introduced for floating-point processing. The proposed processor which achieves peak capacity of 1.5 GFLOPS and 125 Mvertices/s can totally acquire 17.6% reduction in hardware cost and can provide 1.3 times improvement in performance per logic cost ratio under a 0.18 μm CMOS process for real graphics benchmarks compared to previous expanded VLIW vertex processor.
Energy-Efficient Exposed Datapath Architecture With a RISC-V Instruction Set Mode
2024, IEEE Transactions on Computers
Consistency Constraints for Mapping Dataflow Graphs to Hybrid Dataflow/von Neumann Architectures
2023, ACM Transactions on Embedded Computing Systems
Allocation and Scheduling of Dataflow Graphs on Hybrid Dataflow/von Neumann Architectures
2023, Proceedings - 2023 21st ACM/IEEE International Symposium on Formal Methods and Models for System Design, MEMOCODE 2023
Towards Buffers as a Scalable Alternative to Registers for Processor-Local Memory
2023, MBMV 2023 - Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen, 26. Workshop

View all citing articles on Scopus

Henk Corporaal is Associate Professor in Computer Architecture at the Delft University of Technology (TUD), in the Netherlands. He has managed a number of research projects in the areas of computer architecture, processor hardware design, and parallel processing. A key project, MOVE, concerns the automatic generation of hardware and software for embedded systems. Corporaal gained a Ph. D. in Electrical Engineering from the TUD and an M.Sc. in Physics from the University of Groningen (The Netherlands). He lectures undergraduate, graduate and postgraduate courses on computer programming, computer architecture and parallel processing at the TUD and the Advanced School for Computing and Imaging. He has written a range of publications in areas such as computer architecture, embedded system design, run-time support for high level languages, MIMD computing, concurrent simulation, neural networks, and code generation for instruction level parallel processors.

¹: Fax: +31 15 2784898; e-mail: [email protected]

View full text