Elsevier

Integration

Volume 49, March 2015, Pages 125-136
Integration

Synthesis-based design and implementation methodology of high-speed, high-performing unit: L2 cache unit design

https://doi.org/10.1016/j.vlsi.2014.10.001Get rights and content

Abstract

We propose a physical design methodology for synthesis using soft hierarchy, interior pin placement, pre-placing critical logic, and routing techniques on a very timing- and area-challenged unit, the L2 cache, with ~20 million synthesizable transistors. In any past and present standard design at IBM, this test case would stretch all front- and back-end design tools by two to three times due to data volume, congestion and timing criticality, which opens a new avenue to explore for Large Block Synthesis flow. The results confirm the ability to deliver a design in the shortest possible schedule at ~50% of Physical Design cost while still maintaining best-of-breed quality.

Introduction

Today׳s high-speed and high-performing L2 cache unit design plays a crucial part in the microprocessor industry. Due to the physical limits to its area and speed requirements, a high-speed and area-efficient L2 cache design demands custom macros, which require not only highly-skilled custom circuit designers but also time to design these macros. Fig. 1 shows a typical L2 cache unit micro-architecture which contains mainly four functional areas as follows: 1) cache array macros, 2) address macros, 3) data path macros and 4) control macros. The widths and depths of these array, address and datapath macros depend on the micro-architecture of the cache unit. Traditionally, array macros are designed using full- and semi-custom approaches where every transistor is tailored for its specific application. Typically, the effort is in the order of 1–2 person-years (PY) per array. Similarly, most of the peripheral address and data path macros are hand-crafted custom or semi-custom macros. Although these data flow macros are not as critical as arrays, due to area and timing pressure, these macros are done with custom design flow.

However, all of the random logic macros, especially for control areas, are built using automated tool flow, such as synthesis, in a typical microprocessor design methodology.

In the L2 cache unit, ‘Array Banks’ parts are repetitive cache macros tiled together to build a cache function. This design demands about one-third of the highly skilled man power for array design and two-thirds of the resources for the rest of the unit. This means that major reengineering is required when moving from one technology/design point to the next. This work aims to find a design methodology to improve the two-thirds of the unit area. In this paper, we present a Large Block Synthesis (LBS) design methodology and walk through step by step our proposed physical design methodology. We show how development cycle and custom-circuit-design resources can be cut by ~50% while maintaining the best-of-breed quality. We discuss possible elimination of timing and integration resources that are needed in today׳s design methodology to deliver a high-quality routing with timing-closed unit to the chip. We also explain how synthesis-based design methodology improves power efficiency by looking at logic gates and drive strength across the macro boundary. We show how localized macro congestion can be resolved at the unit level by sharing routing layer resources. While we study the design methodology for the L2 unit, we explore, develop and recommend how pre-routing, pre-placement and ‘soft hierarchy’ (SH) techniques can be used as aids to resolve critical timing and routing issues that are commonly encountered. We lastly present data on different macro design topologies and show how macro design methodology has evolved over the last 16 years, as well as how LBS methodology is becoming the mainstream design flow at IBM.

Section snippets

Custom macro design approach

To meet design constraints such as area, timing and power, data flow macros most often need hand-crafted schematics and layout design. The custom design approach is overwhelmingly time consuming and thus requires as much as four times the development time of the synthesis flow. In general, a custom designer has to plan every detail, including physical placement of the design, up front. The problem with a custom design approach is incorporating any late design change may occasionally cause a

Circuit design methodology

In this paper we present the L2 cache unit as a test case to produce data in support of our methodology development. We use IBM׳s design tool flow and common library design components, including custom and growable arrays, which are briefly discussed below.

L2 cache unit design as test case

The L2 cache unit has interfaces with both the core and the L3 cache. Part of the L2 cache unit that interfaces with the core, operates at 1:1 clock frequency, that is, the same speed as the core, supporting high-performance core logic design. The rest of the L2 unit׳s clock operates synchronously at 2:1 interfacing cache and directory support, that is, half of the core frequency. The L2 cache unit represents an interesting test case for developing a synthesis methodology for the following

Synthesis methodology

A set of well-defined synthesis constraints (parms) is a pre-requisite to close timing and routing of any design. It is very important to write the logic in such a way that it has good logical boundaries to close timing and fits in the macro׳s physical size with reasonable utilization for downstream Physical Design (PD) rule checks. A tool aware of power and timing optimization engines could cut down the design cycle time and manual intervention of fixes. PDSrtl [1], a modified version of

Backend design rule check

In RLM (Random Logic Macro) design methodology, the layout is expected to be clean by construction. This means the routed design must pass all physical design verification (PDV) checks such as DRC, LVS, YIELD plus all electrical rule checks. However, it is not uncommon that due to the routing constraints and congestion, the router would leave the design with a handful number of failures that need manual clean-ups.

Bug fixes and design changes

While late bug fixes and new functional changes are commonly expected, a fully automated way to incorporate these Engineering Change Orders (ECO) is an important factor to reduce turnaround time. This methodology allows designers a few different ways to fix these changes, such as schematics, VLSI Integration Model (VIM) and autoRouted based ways. Later in the design cycle, in phase 2, a fully automated way to perform these ECOs not only saves time to implement but also allows designers to work

Experimental results

To close the design, we focused on timing, area, power and congestion of the unit as discussed below.

Conclusions and future work

The L2 unit is routed and built in 22 nm SOI technology, POWER8™ (P8), with 13 layers of physical wiring. The traditional design is primarily done in static circuit design with pulse mode being the default working mode for the master-slave latch as a power-reduction technique. Fig. 5 is a traditional L2 unit design with macro partitioning, where all individual macros have been designed with either custom, rlm or array design flow and then placed manually. Unit buffering and routing have been

Mozammel Hossain (SM’13) received the B.S. degree in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, in 1989. He received the M.E. degree in Electrical Engineering from the City College of New York, NY, in 1993. He held various positions in digital circuit design with AMD, Mentor Graphics and Hewlett Packard between 1994 and 2003. Since 2003, he has been with IBM and led POWER8TM Nest circuit design team and various units

References (9)

  • J. Friedrich

    Design methodology for the IBM Power7 microprocessor

    IBM J. Res. Dev

    (2011)
  • B. Stolt

    Design and implementation of the Power6 microprocessor

    IEEE J. Solid St. Circ

    (2008)
  • W. Liu

    Routing congestion estimation with real design constraints

    Proceedings of the 50th Annual Design Automation Conference

    (2013)
  • R. Berridge

    ÍBM POWER6 microprocessor physical design and design methodology

    IBM J. Res. Dev

    (2007)
There are more references available in the full text version of this article.

Cited by (2)

  • A practical automated timing and physical design implementation methodology for the synchronous asynchronous interface and multi-voltage domain in high-speed synthesis

    2016, Microprocessors and Microsystems
    Citation Excerpt :

    Compared to high-effort custom design flow, our design flow saved almost ∼50% Physical Design (PD) resources as shown in Table 2. Additionally, savings can be even more if many macros are combined to make larger LBS for design productivity [1]. Fig. 15 shows the macro design methodology trend in the power series L2 cache unit (as an example) over the last 16 years by showing how we are moving away from the custom to RLM and to LBS based design methodology, where our proposed methodology enables multi-power and multi-clock domain in the automated physical design flow.

Mozammel Hossain (SM’13) received the B.S. degree in Electrical and Electronic Engineering from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, in 1989. He received the M.E. degree in Electrical Engineering from the City College of New York, NY, in 1993. He held various positions in digital circuit design with AMD, Mentor Graphics and Hewlett Packard between 1994 and 2003. Since 2003, he has been with IBM and led POWER8TM Nest circuit design team and various units in the Nest of POWER6TM, POWER7TM. He is currently leading Nest circuit design team for POWER9TM and pursuing the Ph.D. degree in Electrical and Computer Engineering with Colorado State University, Fort Collins, CO.

Chirag Desai received the B.E. degree in Electronics and Telecom Engineering from the University of Bombay, Bombay, India, in 2001. He received the M.S. degree in Electrical Engineering from the University of Southern California, Los Angeles, CA, in 2004. From 2004 to 2009 he worked for Sun Micro System, San Jose, CA, in digital circuit design area. Since 2010, he has been with IBM and led L2 circuit design team for POWER8TM and POWER9TM.

Tom Chen received the B.S. degree from Shanghai Jiao-Tong University, Shanghai, China, and the Ph.D. from the University of Edinburgh, Edinburgh, UK. From 1987 to 1989, he was with Philips Semiconductors as a member of its technical staff. From 1989 to 1990, he was an Assistant Professor at New Jersey Institute of Technology, Newark. Since 1990, he has been with the Department of Electrical and Computer Engineering, Colorado State University, Fort Collins, where he is currently a Professor. He has published over 100 papers in refereed conferences and journals in the areas of VLSI system architecture and VLSI CAD methodology. His research interests are in the areas of VLSI design and CAD methodology.

Vikas Agarwal received the B.Tech. degree in Electrical Engineering from the Indian Institute of Technology, Bombay, India, in 1996. He received the M.S. and Ph.D. degree in Electrical and Computer Engineering from the University of Texas, Austin, TX, in 1998 and 2004 respectively. Since 2004, he has been with IBM and led custom register file team for IBM’s POWER and Z processor design group. He is currently leading compilable array team for POWER9TM. Dr. Agarwal has more than 8 patents and 10 papers in relevant journals and conference and his works have been cited over 300 publications.

View full text