Elsevier

Integration

Volume 39, Issue 2, March 2006, Pages 131-155
Integration

Low power synthesizable register files for processor and IP cores

https://doi.org/10.1016/j.vlsi.2004.08.001Get rights and content

Abstract

In this paper, low power architectures of register files on register-transfer level (RTL) are presented. The proposed architectures are implemented using a standard hardware description language (HDL) and can be synthesized within a commercial semi-custom design flow. The presented register file architectures are ideally suited for synthesizable processor cores or IP blocks.

It is shown, that significant power savings of register files can be achieved, if a clock gating scheme for register files different from the one usually applied is used. As an alternative, an architecture with register isolation is presented. The third proposed register file architecture is based on interleaving known from signal processing implementations. Although, interleaving is usually applied to multichannel algorithms, it is shown that this architecture can also be applied to certain single channel cases. Experimental results of all three register file architectures prove that a significant power reduction can be achieved.

Introduction

In the last decade, the design methodology for complex CMOS circuits has moved from the traditional full custom design style to an automated semi-custom design flow. The main reason for these structural changes is the design productivity which must cope with the exponential increase in design complexity according to Moores Law. Even with the introduction of an automated semi-custom design flow, the ever increasing design complexity has led to the necessity of further productivity improvements, see ‘International Technology Roadmap for Semiconductors’ [1]. In order to achieve this, design reuse with synthesizable processors and IP cores has become the focus of attention in the last years [2]. These cores are implemented as soft macros applying a hardware description language (HDL) like Verilog or VHDL. The basic concept behind the reuse methodology with IP cores is a technology-independent design without any reference to the target technology. The design can be transferred easily by synthesis of the HDL model from one process technology to the next, from one ASIC vendor to another or from a semi-custom flow to a FPGA based implementation for prototyping or vice versa. This portability property is a very important aspect for this paper. In order to guarantee maximum reusability of soft macros, design rules for the HDL description are more restrictive than the rules which can be used in a semi-custom design flow in general.

As mentioned above, the design methodology changed in the last years, while at the same time the power dissipation has become a severe constraint on the physical level of a VLSI implementation due to increasing design complexity [3], [4], [5]. For mobile systems, the maximum operating time between battery recharge depends on the power dissipation of the device. For high performance systems, the costs for heat removal and packaging due to a significant power dissipation has become a major concern [6], [7]. In this paper, low power architectures of register files being part of synthesizable processors or IP cores are studied. Register files are widely used, e.g., for FIFOs, data transfer buffers, elastic buffers, memories and for the storage of state values of signal processing applications like digital filters. Another classical application of register files are processors [8]. The register files considered here, are synthesized as part of the RTL core in a semi-custom design flow. Alternatively, a RAM could be used instead. As stated in the ‘Reuse Methodology Manual’ [9] RAMs are designed on the physical level like hard macros such that they are very technology dependent. This is a contradiction to the goal of technology independence of soft macros which are the focus of this work. The portability property especially of external IP cores which an IP customer is not intended to modify is an important issue. For example, if the design including the IP cores is transferred to a FPGA prototyping environment for emulation and is finally implemented in a semi-custom product, the usage of synthesizable register files is beneficial. Therefore, synthesizable register files which are part of the design on RTL are often preferred to RAM, if the area penalty is not too high. Other reasons to prefer synthesizable register files to RAMs is the maximum achievable clock rate and further design flow issues which are discussed in detail in the Section 2.

All register file architectures discussed in this paper use clock gating for register addressing which is a standard methodology for low power register files. Clock gating is very well integrated into semi-custom design flows nowadays [10] and is often used to reduce the power dissipation of disabled modules [11], [12]. Automated clock gating inserts dedicated clock gating cells into the clock path such that edge triggered flip-flops can be disabled, if the clock input signal of the flip-flop is kept constant. The power dissipation due to changes of the data signal highly depends on the state of the clock signal of a disabled flip-flop [13]. It is shown in Section 4, that in contrast to usual sequential circuits different clock gating schemes [11] lead to significant differences in power dissipation, if they are applied to register files.

In Section 5, we propose a register file architecture using blocking logic in order to reduce power dissipation. Our method is similar to operand isolation which sets the operands or primary inputs of arithmetic units to a predefined value, if the output values are not used in the next clock cycle [11], [14]. In this work, however, we isolate the data inputs of all registers with blocking logic such that transitions of the data bus are not propagated to the data inputs of all registers such that a significant reduction of power dissipation can be obtained. This blocking logic is divided into partitions. For RAMs various concepts for partitioning are well known. For example, latest research focuses on the address logic [15], [16] but not on the data inputs as considered here.

Interleaving known from multichannel implementations of algorithms [17], [18], [19], [20] in order to reduce area and is extended to register files in Section 6. This extension is not self-evident for two reasons. Firstly, unlike an usual register file or RAM the data values cannot be accessed randomly in a register file architecture with interleaving. Secondly, the power dissipation is generally enhanced by interleaving because subsequent data samples of different channels are not correlated, see [21]. Nevertheless, it is shown here that the power dissipation of synthesizable register files with interleaving can be reduced.

This paper is organized as follows. In Section 2, the limitations on a semi-custom design flow due to IP reuse concerning low power design for register files and memory are discussed. The power dissipation of flip-flops and synthesizable register files with special emphasis on clock gating is presented in Section 3. Based on these results, three different register file architectures are proposed to reduce the power dissipation. In Section 4, we show that the application of a certain clock gating scheme is especially suited for synthesizable register files. As an alternative to this proposal, a register file architecture with partitioning is given in Section 5. Finally, for signal processing applications a register file architecture with interleaving for multichannel algorithms is examined in Section 6. It is shown that this approach can be extended to single channel applications in certain cases. The conclusions of this work are summarized in Section 7. Finally, the question is treated on whether and how the proposed register file architectures should be combined.

Section snippets

Synthesizable processors and IP cores

In industry, modeling guidelines for IP cores evolved which may or may not be published. StarCore, a DSP IP provider published a white paper titled ‘Four golden rules for High-Quality Soft Macros’ [22]. Michael Keating from Synopsys and Pierre Bricaud from Mentor Graphics authored the book, ‘Reuse Methodology Manual’ [9]. Cadence SoC group offers similar guidelines to their customers, [23]. Within Infineon Technologies AG, HDL coding guidelines for design reuse and for macro development exist

Power dissipation of synthesizable register files

In order to compare the register file architectures proposed in the following sections with conventional synthesizable register files concerning power dissipation, a power model for library flip-flops being the basic component of register files is presented in Section 3.1. With these results the power dissipation of a conventional register file is computed in Section 3.2.

Low power register files using clock gating with a logic high disabled clock

Clock gating can be implemented in two ways. The clock input of the registers is disabled by the clock gating cell either with a logic high or low signal. Both versions are discussed in detail in [11]. For a logic low disabled clock, the clock gating cell can be implemented using an AND gate and a latch denoted as scheme 1 in [11], see Fig. 2a. The AND gate switches off the clock signal during the second half of the period, while the latch switches off the clock signal in the first half of the

Low power register files using register isolation

In this section, a method is presented to reduce the power dissipation of the register file due to transitions at the data input of the flip-flops. The method presented here can be used alternatively to the clock gating method with a logic high disabled clock of the previous section. This alternative is useful, if the method of the previous section cannot be applied.

Low power register files using interleaving

In this section, the implementation of multichannel algorithms is considered. Data interleaving of n identical channels is a commonly used method to reduce the number of arithmetic units in parallel architectures [20]. An example with the block diagram of an algorithm with two channels is shown in Fig. 7a. In our example, the block diagram consists of additions and delays by one sample, denoted as unit delays D or z-1. The block diagram with interleaving of n identical channels is obtained by

Conclusion

We presented several architectures to reduce the power consumption of synthesizable register files in RTL designs. Three architectures have been proposed: (1) register file architecture with clock gating using a logic high disabled clock, (2) register file architecture with register isolation and (3) a register file architecture with interleaving. It was shown that with register file architecture (1) a significant power reduction has been achieved, if the data transition probability at the data

References (27)

  • ...
  • ...
  • A.P. Chandrakasan et al.

    Minimizing power consumption in digital CMOS circuits

    Proc. IEEE

    (1995)
  • G.K. Yeap

    Practical Low Power VLSI Design

    (1998)
  • A.P. Chandrakasan et al.

    Low-power CMOS digital design

    IEEE J. Solid-St. Circ.

    (1992)
  • S. Borkar

    Low power design challenges for the decade

  • M. Pedram et al.

    Battery-powered digital CMOS design

    IEEE Trans. VLSI Syst.

    (2002)
  • D.A. Patterson et al.

    Computer Organization and Design

    (1998)
  • M. Keating et al.

    Reuse Methodology Manual

    (1998)
  • I. Renu Mehra

    Synopsys, power aware tools and methodologies for the basic industry

  • A. Raghunatan et al.

    Register transfer level power optimization with emphasis on glitch analysis and reduction

    IEEE Trans. Comput. Aid. Design Integr. Circ. Syst.

    (1999)
  • F. Emnett, M. Biegel, Power reduction through RTL clock gating, in: Synopsys Users Group San Jose,...
  • T. Lang et al.

    Individual flip-flops with gated clocks for low power datapaths

    IEEE Trans. Circ. Syst. IIAnalog and Digital Signal Processing

    (1997)
  • Cited by (13)

    • Efficient low-power register array with transposed access mode

      2014, Microelectronics Journal
      Citation Excerpt :

      Clock gating is the most commonly used RTL optimization technique for improving energy efficiency by lowering dynamic power consumption as it provides a way to selectively activate or stop the clock on registers for an entire block [25,31]. It inserts a circuitry (clock-gating circuitry) between the main clock network and registers to provide control which makes it possible to eliminate the unnecessary register activity [26]. This technique is worthwhile in large size register banks as it can save power.

    • Machine Learning Based Flip-Flop Grouping for Toggling Driven Clock Gating

      2023, Proceedings - IEEE International Symposium on Circuits and Systems
    • Flip-flop state driven clock gating: Concept, design, and methodology

      2019, IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
    • Design and algorithm for clock gating and flip-flop co-optimization

      2018, IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD
    View all citing articles on Scopus
    View full text