Compiler-assisted energy optimization for clustered VLIW processors

doi:10.1016/j.jpdc.2012.04.005

Journal of Parallel and Distributed Computing

Volume 72, Issue 8, August 2012, Pages 944-959

https://doi.org/10.1016/j.jpdc.2012.04.005 Get rights and content

Abstract

Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving the clock speed, reducing the energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, thereby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniaturization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse, and limits the usability of clustered architectures in smaller technologies. However, technological advancements now permit the design of interconnects and functional units with varying performance and power modes. In this paper, we propose scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low-power modes of functional units and interconnects. Finally, we present a synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improves the usability of clustered architectures by achieving better overall energy–performance trade-offs. Even with conservative estimates of the contribution of the functional units and interconnects to the overall processor energy consumption, the proposed combined scheme obtains on average 8% and 10% improvement in overall energy–delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine, respectively. We present a detailed experimental evaluation of the proposed schemes. Our test bed uses the Trimaran compiler infrastructure.

Highlights

► Clustered processors solve the scalability problem and are popular in embedded domains. ► This paper proposes energy-aware scheduling algorithms for clustered architectures. ► The proposed algorithm aggregates the scheduling slack and the communication slack. ► It utilizes low-power modes for functional units and interconnects exploiting the aggregated slack. ► Our detailed experimental study shows significant savings for clustered architectures.

Introduction

The proliferation of embedded systems has opened up many new research issues. Design challenges posed by embedded processors are ostensibly different from those offered by general-purpose systems. Apart from very high performance, they also demand low power consumption, low cost, and less chip area to be practical. The ever-increasing trend towards miniaturization of devices makes utilizing huge transistor budgets in a manner that enables high clock speed, low design complexity, and less energy consumption [31] even more challenging. However, resolving this challenge can enable the deployment of embedded systems for many performance-demanding new embedded applications at a lower cost. Another challenge posed by this technological advancement is the rising level of leakage energy consumption in the logic. The increase in transistor density requires reducing the supply voltage in order to operate the circuit reliably. The reduction in supply voltage also requires a reduction in the threshold voltage in order to maintain the speedup, and this leads to an exponential rise in the leakage component of the energy consumption [34]. With the 65 nm and smaller technologies currently in fabrication, the leakage energy is on par with the dynamic energy consumption. In future technologies, the leakage energy will further dominate the overall energy consumption [52].

Distribution or clustering is the common design theme that is being employed in one form or another to meet these challenges. The basic idea is to design simpler and smaller components and put together a collection of these components interconnected using a communication fabric. Smaller components are simpler to design, enable faster clock speed, and incur less energy consumption. Different architectural philosophies [12], [6], [51], [21], [29], [50] have used distribution in its varied form to tackle the scalability problem in the past. This trend is also expected to continue in the future with ever-growing number of transistors on the chip.

Clustered VLIW architectures [7], [14], [12] use clustering philosophy in the context of VLIW architectures. These architectures are being widely adopted in embedded domains because they overcome the scalability problem associated with centralized VLIW architectures. A clustered VLIW architecture [12] has more than one register file and connects only a subset of functional units to a register file (see Fig. 1, Fig. 2). Groups of small computation clusters can be interconnected using some interconnection topology, and communication can be enabled using any of the various inter-cluster communication models [53]. Clustering avoids area and power consumption problems of centralized register file architectures while retaining high clock speed which can be leveraged to get better performance. Texas Instrument’s VelociTI [54], HP/ST’s Lx [13], Analog’s TigerSHARC [16], and BOPS’ ManArray [47] are examples of the architectures developed on the basis of clustered ILP philosophy. IBM’s eLite [9] is a research proposal for a novel clustered architecture. Clustered VLIW architectures continue to be popular in embedded domains and are part of some of the most popular and recent chips planned to power smart phones and tablets [45] apart from their presence in low-end phones [49].

Though clustering helps to combat the scalability problem by making components simpler and thereby increasing the clock rate and reducing the dynamic energy consumption of functional components, an interconnection network is required for the communication of data values among different components. This communication in clustered architectures happens over long wires having high load capacitance, which in effect takes more time and incurs more energy consumption [31], [19]. This problem is becoming increasingly severe with each evolving process technology. As a result, clustered architectures are becoming more communication bound in terms of performance and energy consumption. Apart from the interconnects, functional units are another major source of energy consumption in clustered architectures. The frequent accesses to functional units raises the temperature level and makes the leakage energy consumption, which is specifically a concern in smaller technologies, even worse. Moreover, the contention for limited number of slow interconnects leads to many short idle cycles, and this further increases the leakage energy consumption in functional units.

Clustered VLIW architectures rely on compile-time scheduling. The static scheduling simplifies the issue logic by alleviating the need for a dedicated hardware for scheduling. Thus, a significant fraction of the total energy consumption in clustered VLIW architectures is attributed to interconnects and functional units. Although the exact percentage depends upon the architecture and circuit details, earlier studies report that a very high percentage (25–30%) of total processor energy consumption is attributed to interconnects. Similarly, a large fraction (30–35%) of static energy consumption in a VLIW architecture is attributed to functional units [23]. An architecture-level model developed in [5] also confirms that the leakage energy consumption in functional units constitutes a noticeable fraction of the overall processor leakage energy consumption despite having a smaller transistor count compared to the caches. Thus, optimizing energy in interconnects and functional units in clustered architectures is becoming increasingly important from one process generation to another.

However, the functional units and interconnects are often underutilized in clustered VLIW architectures. Apart from other usual causes such as data dependencies, the underutilization of functional units is also due to the contention for a limited number of slow interconnect channels that introduces many short idle cycles for functional units. At the same time, since the functional units are distributed among clusters, there is also more contention for functional resources, which leads to underutilization of interconnects. Finally, the contention for functional and interconnect resources in clustered VLIW architecture combines in a synergistic fashion and leads to greater available slack in clustered architectures as compared to VLIW architectures.

The advancements in VLSI technology now enable designing interconnects and functional units with different power and performance modes. For example, [4], [35] show that, using 45 nm technology, it is possible to design wires consuming 1/5 the energy but having twice the delay [4]. [3] proposes using interconnect composed of wires with different characteristics to improve the $E D^{2}$ ¹ of the superscalar processor. Similarly, the capabilities of dual-threshold domino logic with sleep mode (that can transition between active mode and sleep mode and vice versa without any performance penalty [24] but with moderate energy penalty) can be utilized to perform leakage energy management for short idle cycles in functional units. One such purely hardware-based scheme in the context of a superscalar architecture is due to Dropsho et al. [11]. Their scheme puts any integer ALU into low-leakage mode after one cycle of idleness. Their results confirm the benefits of such an aggressive scheme in smaller technologies. However, being a purely hardware-based scheme, the benefits are severely (on average, by 30%) affected by frequent transitions from active mode to sleep mode and vice versa because of many short idle periods.

In this paper, we propose a compiler-directed approach that leverages on these advancements in VLSI technology to improve the usability of clustered VLIW architecture in smaller technologies, targeting the two major source of energy consumption, namely interconnects and functional units. Although there has been some work in the past to reduce the leakage energy consumption in functional units in the context of superscalar and VLIW architectures, to the best of our knowledge, there has been no such work in the context of clustered VLIW architectures specifically targeting smaller technologies. Regarding interconnects, the primary focus of research had been to reduce the latency of communication. We are not aware of any work that targets reducing the energy consumption in interconnects in clustered VLIW architectures. In the context of inter-cluster communication, we limit our focus to the most popular inter-cluster communication models [53] such as explicit inter-cluster communication through inter-cluster move instructions and extended operand inter-cluster communication models [53] found in commercial clustered processors such as Texas Instrument’s VelociTI [54] and HP/ST’s Lx [13]. The novelty of our approach also lies in an integrated scheduling algorithm that simultaneously reduces the energy consumption in functional units as well as interconnects. The contention for a limited number of functional and communication resources in a clustered VLIW architecture leads to increased cycles of execution on a clustered machine as compared to an equivalent VLIW machine. Our approach aggregates the scheduling slack of instructions and communication slack of data values in a synergistic fashion to convert the inherent idleness of functional and communication resources in a clustered architecture to energy gains. The major contributions of our approach can be stated as follows.

•
We provide a scheduling algorithm for clustered VLIW architectures that exploits the scheduling slacks with an aim of reducing the number of transitions and associated overheads, thereby significantly improving the leakage energy consumption compared to the underlying architectural scheme.
•
We provide another scheduling scheme for clustered architectures that exploits the communication slack of data values and the scheduling slack of instructions to reduce the energy consumption in interconnects while achieving better performance for clustered architectures. The proposed scheme provides performance comparable to dual-bandwidth clustered architectures at nearly half the energy cost.
•
We provide an integrated scheme that simultaneously exploits the scheduling slack of instructions and the communication slack of data values to achieve better overall energy savings. This scheme converts any inherent performance loss due to contention for communication and computation resources into energy benefits.
•
We have significantly extended the Trimaran Compiler Framework to faithfully model different clustered VLIW configurations and inter-cluster communication models. We have implemented these schemes in extended Trimaran framework. We present a detailed performance analysis based on experimental evaluation of these algorithms for different clustered VLIW configuration and technology nodes. We specifically discern the benefits of a compiler-based scheme as compared to a hardware-only scheme and compare our results with those of some of the earlier algorithms. Readers interested in results in the restricted but more realistic context of commercially available real clustered machines such as C6X are referred to some of our earlier work [38], [37].

It is important to mention here that the work and experimental results presented in this paper also focus on interconnect energy saving and integrated interconnect and functional unit energy savings. These results go significantly beyond some of the initial results presented in one of our earlier works [40] that focuses solely on scheduling to save energy in functional units. Additionally, in this paper, we also present results of savings offered by different algorithms across different technology nodes.

The rest of the paper is structured as follows. Section 2 describes the motivation for this work and presents some experimental evidence. Section 3 describes different scheduling algorithms for leakage energy management in functional units, energy optimization in interconnects, and the combined scheme to optimize energy in functional units as well as interconnects. Section 3 also describes the scheduler implementation in detail. Section 4 describes the scheduling algorithm with the help of examples. Section 5 describes our experimental setup, results, and a detailed analysis of results. Section 6 describes the related work in the area of scheduling for clustered architectures, energy-aware scheduling for VLIW architectures, architectural approaches for leakage energy management, and efficient interconnect design. We conclude in Section 7 with pointers to future directions.

Section snippets

Motivation

VLIW and clustered VLIW architectures are optimized for peak performance in order to meet the real-time performance requirements of embedded applications. However, the functional units are underutilized due to the inherent variations in the ILP of the programs. The idleness is even more pronounced for a clustered VLIW architecture because of the contention for a limited number of slow interconnects, which manifests itself in the form of many short idle cycles. The graph titled ‘Base’ in Fig. 3

The scheduling algorithm

The Elcor backend of the Trimaran infrastructure has a cycle scheduling algorithm designed and implemented for flat VLIW architectures [55], [1]. We have modified this algorithm to perform leakage energy optimization for clustered VLIW architectures. Another loop has been added inside the main scheduling loop of the cycle scheduler to perform cluster scheduling in an integrated fashion. The integrated approach [46], [28], [20] to cluster scheduling makes the cluster assignment decision during

Examples

In this section, we present two examples that illustrate how the available slack of instructions and communications is exploited by the proposed scheduling Algorithms 2 and 1, respectively, to get energy benefits without hurting performance.

Fig. 5 shows a portion of a data dependency graph and Fig. 7(a) shows two possible schedules for this dependency graph. We assume a two-clustered machine with each cluster having an adder, a multiplier, and a fast communication bus. Schedule 1 has ADD1 and

Setup

We have used the Trimaran suite [55] for our experimentation. Trimaran was developed to conduct state-of-the-art research in compilation techniques for ILP architectures with a specific focus on VLIW class of architectures. We have modified the Trimaran suite to generate and simulate code for a variety of clustered VLIW configurations. The machine description module has been upgraded to describe various clustering-related parameters such as the number of clusters, number and types of functional

Related Work

In this section, we briefly describe the earlier work done in the area of instruction scheduling for clustered architectures, architectural approaches for leakage energy management, energy-aware scheduling for VLIW architectures, and efficient cross-path design.

Conclusions and future directions

In this work, we have proposed energy-aware instruction scheduling algorithms that exploit the instruction slack and the communication slack to save energy in two major energy hungry components of clustered VLIW architecture, namely functional units and interconnects. We have also proposed a combined scheduling algorithm that simultaneously saves energy in functional units and interconnects. A detailed experimental evaluation using the Trimaran framework confirms that the proposed schemes are

Rahul Nagpal received his MS and Ph.D. in computer science from the Indian Institute of Science in 2004 and 2008, respectively. His primary research interest are performance–power tradeoffs and energy-aware decentralized computing, with a special focus on clustered and multi-core processors.

References (59)

S.G. Abraham, W.M. Meleis, I.D. Baev, Efficient backtracking instruction schedulers, in: Proc. of Intl. Conf. on...
A. Aleta, J.M. Codina, J. Sanchez, A. Gonzalez, Graph-partitioning based instruction scheduling for clustered...
R. Balasubramonian, N. Muralimanohar, K. Ramani, V. Venkatachalapathy, Microarchitectural wire management for...
K. Banerjee, A. Mehrotra, A power-optimal repeater insertion methodology for global interconnects in nanometer designs,...
J.A. Butts et al.
A static power model for architects
R. Canal, J.M. Parcerisa, A. Gonzalez, Dynamic cluster assignment mechanisms, in: Proc. of Sixth IEEE Intl. Symp. on...
A. Capitanio et al.
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
M. Chu et al.
Region-based hierarchical operation partitioning for multicluster processors
SIGPLAN Notices
(2003)
J. Derby, J. Moreno, A high-performance embedded DSP core with novel SIMD features, in: Proc. of 2003 Intl. Conf. on...
G. Desoli, Instruction assignment for clustered VLIW DSP compilers: a new approach, Technical Report, Hewlett-Packard,...

S. Dropsho et al.

Managing static leakage energy in microprocessor functional units

P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, Clustered instruction-level parallel processors, Tech. Rep.,...

P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, F. Homewood, Lx: a technology platform for customizable VLIW embedded...

K.I. Farkas et al.

The multicluster architecture: reducing cycle time through partitioning

K. Flautner et al.

Drowsy caches: simple techniques for reducing leakage power

J. Fridman et al.

The TigerSHARC DSP architecture

IEEE Micro

(2000)

B.M.-S. Gokhan Memic, W. Hu, NetBench: a benchmarking suit for network processor, CARES Technical...

A. Gonzalez, Joan-Manuel Parcerisa, Julio Sahuquillo, J. Duato, Efficient interconnects for clustered...

R. Ho et al.

The future of wires

Proceedings of the IEEE

(2001)

K. Kailas, A. Agrawala, K. Ebcioglu, CARS: a new code generation framework for clustered ILP processors, in: Proc. of...

U. Kapasi et al.

The imagine stream processor

S. Kaxiras et al.

Cache decay: exploiting generational behavior to reduce cache leakage power

H.S. Kim, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, Adapting instruction level parallelism for optimizing leakage in...

V. Kursun et al.

Low swing dual threshold voltage domino logic

V.S. Lapinskii et al.

Cluster assignment for high-performance embedded VLIW processors

ACM Transactions on Design Automation of Electronic Systems

(2002)

C. Lee, M. Potkonjak, W.H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and...

W. Lee, D. Puppin, S. Swenson, S. Amarasinghe, Convergent scheduling, in: Proc. of Intl. Symp. on Microarchitecture,...

R. Leupers

Instruction scheduling for clustered VLIW DSPs

P. Marcuello et al.

Clustered speculative multithreaded processors

Cited by (0)

Y.N. Srikant received his B.E. in Electronics from Bangalore University, and M.E. and Ph.D. in Computer Science from the Computer Science and Automation department of the Indian Institute of Science. His area of interest is compiler design.

He is the editor of a handbook on advanced compiler design published by CRC Press in 2002 and 2008 (2nd ed.). His most recent research includes compilation for sensor networks, compiler optimizations for power reduction in embedded systems, efficient profiling techniques, and performance estimation of programs through program analysis.

He is currently a Professor in the Department of Computer Science and Automation at the Indian Institute of Science in Bangalore.

View full text