Low power synthesizable register files for processor and IP cores
Introduction
In the last decade, the design methodology for complex CMOS circuits has moved from the traditional full custom design style to an automated semi-custom design flow. The main reason for these structural changes is the design productivity which must cope with the exponential increase in design complexity according to Moores Law. Even with the introduction of an automated semi-custom design flow, the ever increasing design complexity has led to the necessity of further productivity improvements, see ‘International Technology Roadmap for Semiconductors’ [1]. In order to achieve this, design reuse with synthesizable processors and IP cores has become the focus of attention in the last years [2]. These cores are implemented as soft macros applying a hardware description language (HDL) like Verilog or VHDL. The basic concept behind the reuse methodology with IP cores is a technology-independent design without any reference to the target technology. The design can be transferred easily by synthesis of the HDL model from one process technology to the next, from one ASIC vendor to another or from a semi-custom flow to a FPGA based implementation for prototyping or vice versa. This portability property is a very important aspect for this paper. In order to guarantee maximum reusability of soft macros, design rules for the HDL description are more restrictive than the rules which can be used in a semi-custom design flow in general.
As mentioned above, the design methodology changed in the last years, while at the same time the power dissipation has become a severe constraint on the physical level of a VLSI implementation due to increasing design complexity [3], [4], [5]. For mobile systems, the maximum operating time between battery recharge depends on the power dissipation of the device. For high performance systems, the costs for heat removal and packaging due to a significant power dissipation has become a major concern [6], [7]. In this paper, low power architectures of register files being part of synthesizable processors or IP cores are studied. Register files are widely used, e.g., for FIFOs, data transfer buffers, elastic buffers, memories and for the storage of state values of signal processing applications like digital filters. Another classical application of register files are processors [8]. The register files considered here, are synthesized as part of the RTL core in a semi-custom design flow. Alternatively, a RAM could be used instead. As stated in the ‘Reuse Methodology Manual’ [9] RAMs are designed on the physical level like hard macros such that they are very technology dependent. This is a contradiction to the goal of technology independence of soft macros which are the focus of this work. The portability property especially of external IP cores which an IP customer is not intended to modify is an important issue. For example, if the design including the IP cores is transferred to a FPGA prototyping environment for emulation and is finally implemented in a semi-custom product, the usage of synthesizable register files is beneficial. Therefore, synthesizable register files which are part of the design on RTL are often preferred to RAM, if the area penalty is not too high. Other reasons to prefer synthesizable register files to RAMs is the maximum achievable clock rate and further design flow issues which are discussed in detail in the Section 2.
All register file architectures discussed in this paper use clock gating for register addressing which is a standard methodology for low power register files. Clock gating is very well integrated into semi-custom design flows nowadays [10] and is often used to reduce the power dissipation of disabled modules [11], [12]. Automated clock gating inserts dedicated clock gating cells into the clock path such that edge triggered flip-flops can be disabled, if the clock input signal of the flip-flop is kept constant. The power dissipation due to changes of the data signal highly depends on the state of the clock signal of a disabled flip-flop [13]. It is shown in Section 4, that in contrast to usual sequential circuits different clock gating schemes [11] lead to significant differences in power dissipation, if they are applied to register files.
In Section 5, we propose a register file architecture using blocking logic in order to reduce power dissipation. Our method is similar to operand isolation which sets the operands or primary inputs of arithmetic units to a predefined value, if the output values are not used in the next clock cycle [11], [14]. In this work, however, we isolate the data inputs of all registers with blocking logic such that transitions of the data bus are not propagated to the data inputs of all registers such that a significant reduction of power dissipation can be obtained. This blocking logic is divided into partitions. For RAMs various concepts for partitioning are well known. For example, latest research focuses on the address logic [15], [16] but not on the data inputs as considered here.
Interleaving known from multichannel implementations of algorithms [17], [18], [19], [20] in order to reduce area and is extended to register files in Section 6. This extension is not self-evident for two reasons. Firstly, unlike an usual register file or RAM the data values cannot be accessed randomly in a register file architecture with interleaving. Secondly, the power dissipation is generally enhanced by interleaving because subsequent data samples of different channels are not correlated, see [21]. Nevertheless, it is shown here that the power dissipation of synthesizable register files with interleaving can be reduced.
This paper is organized as follows. In Section 2, the limitations on a semi-custom design flow due to IP reuse concerning low power design for register files and memory are discussed. The power dissipation of flip-flops and synthesizable register files with special emphasis on clock gating is presented in Section 3. Based on these results, three different register file architectures are proposed to reduce the power dissipation. In Section 4, we show that the application of a certain clock gating scheme is especially suited for synthesizable register files. As an alternative to this proposal, a register file architecture with partitioning is given in Section 5. Finally, for signal processing applications a register file architecture with interleaving for multichannel algorithms is examined in Section 6. It is shown that this approach can be extended to single channel applications in certain cases. The conclusions of this work are summarized in Section 7. Finally, the question is treated on whether and how the proposed register file architectures should be combined.
Section snippets
Synthesizable processors and IP cores
In industry, modeling guidelines for IP cores evolved which may or may not be published. StarCore, a DSP IP provider published a white paper titled ‘Four golden rules for High-Quality Soft Macros’ [22]. Michael Keating from Synopsys and Pierre Bricaud from Mentor Graphics authored the book, ‘Reuse Methodology Manual’ [9]. Cadence SoC group offers similar guidelines to their customers, [23]. Within Infineon Technologies AG, HDL coding guidelines for design reuse and for macro development exist
Power dissipation of synthesizable register files
In order to compare the register file architectures proposed in the following sections with conventional synthesizable register files concerning power dissipation, a power model for library flip-flops being the basic component of register files is presented in Section 3.1. With these results the power dissipation of a conventional register file is computed in Section 3.2.
Low power register files using clock gating with a logic high disabled clock
Clock gating can be implemented in two ways. The clock input of the registers is disabled by the clock gating cell either with a logic high or low signal. Both versions are discussed in detail in [11]. For a logic low disabled clock, the clock gating cell can be implemented using an AND gate and a latch denoted as scheme 1 in [11], see Fig. 2a. The AND gate switches off the clock signal during the second half of the period, while the latch switches off the clock signal in the first half of the
Low power register files using register isolation
In this section, a method is presented to reduce the power dissipation of the register file due to transitions at the data input of the flip-flops. The method presented here can be used alternatively to the clock gating method with a logic high disabled clock of the previous section. This alternative is useful, if the method of the previous section cannot be applied.
Low power register files using interleaving
In this section, the implementation of multichannel algorithms is considered. Data interleaving of n identical channels is a commonly used method to reduce the number of arithmetic units in parallel architectures [20]. An example with the block diagram of an algorithm with two channels is shown in Fig. 7a. In our example, the block diagram consists of additions and delays by one sample, denoted as unit delays D or . The block diagram with interleaving of n identical channels is obtained by
Conclusion
We presented several architectures to reduce the power consumption of synthesizable register files in RTL designs. Three architectures have been proposed: (1) register file architecture with clock gating using a logic high disabled clock, (2) register file architecture with register isolation and (3) a register file architecture with interleaving. It was shown that with register file architecture (1) a significant power reduction has been achieved, if the data transition probability at the data
References (27)
- ...
- ...
- et al.
Minimizing power consumption in digital CMOS circuits
Proc. IEEE
(1995) Practical Low Power VLSI Design
(1998)- et al.
Low-power CMOS digital design
IEEE J. Solid-St. Circ.
(1992) Low power design challenges for the decade
- et al.
Battery-powered digital CMOS design
IEEE Trans. VLSI Syst.
(2002) - et al.
Computer Organization and Design
(1998) - et al.
Reuse Methodology Manual
(1998) Synopsys, power aware tools and methodologies for the basic industry
Register transfer level power optimization with emphasis on glitch analysis and reduction
IEEE Trans. Comput. Aid. Design Integr. Circ. Syst.
Individual flip-flops with gated clocks for low power datapaths
IEEE Trans. Circ. Syst. IIAnalog and Digital Signal Processing
Cited by (13)
Efficient low-power register array with transposed access mode
2014, Microelectronics JournalCitation Excerpt :Clock gating is the most commonly used RTL optimization technique for improving energy efficiency by lowering dynamic power consumption as it provides a way to selectively activate or stop the clock on registers for an entire block [25,31]. It inserts a circuitry (clock-gating circuitry) between the main clock network and registers to provide control which makes it possible to eliminate the unnecessary register activity [26]. This technique is worthwhile in large size register banks as it can save power.
On optimal flip-flop grouping for VLSI power minimization
2013, Operations Research LettersOn-chip supply noise in multiprocessors: impact and clock gating inspired mitigation strategies
2024, International Journal of ElectronicsMachine Learning Based Flip-Flop Grouping for Toggling Driven Clock Gating
2023, Proceedings - IEEE International Symposium on Circuits and SystemsFlip-flop state driven clock gating: Concept, design, and methodology
2019, IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCADDesign and algorithm for clock gating and flip-flop co-optimization
2018, IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD