Elsevier

Optical Switching and Networking

Volume 22, November 2016, Pages 54-68
Optical Switching and Networking

An optically-enabled chip–multiprocessor architecture using a single-level shared optical cache memory

https://doi.org/10.1016/j.osn.2016.05.001Get rights and content

Highlights

  • We present an optical-bus CMP architecture where an optical shared cache is used.

  • The optical cache resides in a separate chip and no on-chip cache is required.

  • The CPU-DRAM communication is realized completely in the optical domain.

  • Significant L1 miss rate reduction of up to 96% for certain cases is attained.

  • Average speed-up of 19.4% or capacity requirements reduction of ~63% is attained.

Abstract

We present an optical bus-based chip–multiprocessor architecture where the processing cores share an optical single-level cache implemented in a separate chip next to the Central-Processing-Unit (CPU) die. The interconnection system is realized through Wavelength-Division-Multiplexed optical interfaces connecting the shared cache with the cores and the Main-Memory via spatial-multiplexed waveguides. Evaluating the proposed approach, we realize system-level simulations of a wide-range parallel workloads using Gem5. Optical cache architecture is compared against the conventional one that uses dedicated on-chip Level-1 electronic caches and a shared Level-2 cache. Results show significant Level-1 miss rate reduction of up to 96% for certain cases; on average, a performance speed-up of 19.4% or cache capacity requirements reduction of ~63% is attained. Combined with high-bandwidth CPU-Dynamic Random Access Memory (DRAM) bus solutions based on optical interconnects, the proposed design is a promising architecture bridging the gap between high-speed optically connected CPU-DRAM schemes and high-speed optical memory technologies.

Introduction

It has been more than twenty years ago since the speed mismatch between the Central Processing Unit (CPU) and Main Memory (MM), commonly referred to as the “Memory Wall”, was identified as one of the main barriers against increase in the computer performance [1]. Solutions such as deployment of large on chip cache memories, the widening of the CPU-MM buses and prefetching from the MM have been devised to ease the limited off-chip bandwidth and the high MM's response latency imposed by the constraints of the electronic technology [2]. Higher spatial multiplexing degrees through wider buses allow for more efficient and simultaneous multi-bit data transfer within a single cycle. On the other hand, trading bandwidth for reduced average delay and buffering data close to CPU in anticipation of future requests through prefetching reduced on average the access delay stalling processing. However, the introduction of modern Chip Multi-Processor (CMP) configurations has further aggravated the bottleneck between the CPU and MM and led to larger two- or even three-level cache memory hierarchies that take up almost 40% of the total chip energy consumption [3] and more that 40% of chip real estate [4], [5], [6]. Taking into account the distance- and speed-dependent energy dissipation and the low bandwidth density associated with electronic technology, novel approaches are required against the Memory Wall.

Promising emerging solutions emerge from the optical interconnects and photonic integration technology fields thanks to their proven high-speed data transfer abilities. The introduction of these technologies in the interconnection system between the memory and processing elements is expected to relieve computing from some energy-demanding and slow electronics. Focusing on the CPU–MM interconnection system, the main effort has shifted to the replacement of the electronic busses with optical wires. With the current technology, fetching 256 bits of operand from the MM module consumes more than 16 nJ [7] and requires four stages of transmission (assuming a 64-bit wide bus at an operation speed just above 1 GHz). Bringing photonics into the game by replacing the electrical busses with optical wiring solutions, either over a Silicon-on-Insulator (SOI) platform or over an Optical Printed Circuit Board (OPCB), is expected to (1) reduce energy consumption down to 1 mW/Gbps [8], (2) raise operation speed to several tens of GHz and at the same time, (3) dispense with the traditional issue of Resistance-Capacitance (RC)-induced delay of the electrical wiring. This roadmap is rapidly gaining interest with several works demonstrating the benefits of switching from electronic to optical CPU-MM buses [9], [10], [11], [12], [13] and introducing novel fully functional all-optical interfaces for Dynamic Random Access Memory (DRAM) integration [9]. However, all these enhancements cannot mitigate the need for memory caching as CMP dies will continue to struggle against finding an optimum balance in the processor, cache and interconnect circuitry considerations.

Going a step further with the emerging technologies the field of optics can also lead to novel solutions in the data buffering domain, such as cache memories. Although the lack of electric charge places photons at disadvantage when coming to storage, a variety of optical flip-flop and Random Access Memory (RAM) cell technologies have appeared for storing information. These technologies exploit the Set-Reset flip-flop architectural layout reducing at the same time the access delay. Representative all-optical flip-flop technologies include coupled Semiconductor Optical Amplifiers (SOAs) [14], III-V-on-SOI microdisk lasers [15], polarization bistable Vertical Cavity Surface Emitting Lasers (VCSELs) [16] and SOA-based Mach-Zehnder Interferometers (SOA-MZIs) [17].

Proceeding moreover to multi-bit storage, Photonic Crystal (PhC) nanocavities technology has already demonstrated more than 100-bit integrated storage capacity with significant benefits in terms of speed, energy consumption and footprint [18], [19]. Extending the elementary flip-flop operation, the first optical Static RAM (SRAM) cell allowed for fully functional random access read/write operation at 5 Gbps [20]. In this configuration the cell deploys two SOA access gates and a SOA-MZI-based flip-flop, and can theoretically operate at speeds of up to 40 Gbps [21]. Next generation SRAM cells introduced significant improvements in terms of active elements and energy consumption reduction through the introduction of wavelength diversity in the incoming signals [22].

Expanding our view from single elementary memory cells to complete optical RAM architectures, [23] has highlighted the benefits of employing Wavelength Division Multiplexing (WDM)-formatted data and address fields in RAM peripheral circuits such as row [24] and column decoders [24], [25]. Taking advantage of all these advances, we recently presented a complete and fully functional optical cache memory architecture that successfully performs both read and write operations directly in the optical domain [26]. In [26] we moved forward with designing an all-optical cache memory that combines all the optical subsystems, such as read/write selection modules, row and column decoders, 2D RAM banks and tag comparison circuits. Physical layer simulation scenarios carried out with the commercially available VPI Photonics simulation suite [27] indicate error-free operation at speeds up to 16 GHz for both direct [26] and 2-way associative [28] cache mapping schemes. However, its system-scale performance in CMP configurations has been so far evaluated only for the bodytrack and blackscholes benchmarks [29] from the PARSEC benchmark suite [30], providing only a limited amount of information about its advantageous architectural perspectives in true CMP systems. In order to analyze its application potential in real CMP settings, it is among the prerequisites to follow the well-known practice of validating the proposed scheme with a broad range of workloads [31].

This paper extends our previous work by presenting a detailed optical bus-based CMP architecture where all-optical Level-1 instruction (L1i) and Level-1 data (L1d) caches are shared among the processing cores and the MM. In this work we focus on the interconnection and system-level architecture performance advantages of the shared all-optical cache memory scheme using the physical-layer cache design of [26] as the basic building block. Both L1i and L1d caches are placed off-chip, sparing precious chip area from the die in favor of processing elements. On the other hand, the choice of a shared cache unit has been taken on the basis that the cycle of the optical cache memories is a fraction of the cycle of the electronic cores, making thus possible to serve multiple concurrent core requests without stalling the execution.

The optical bus-based CMP architecture's system-scale performance is addressed for 12 parallel workloads, using the PARSEC benchmark suite [30] on top of the Gem5 simulator [32]. The simulation findings suggest that the shared optical cache architecture can improve substantially the Level-1 (L1) cache miss rate (up to 96% reduction in certain cases), and either speed-up execution (by 19.4% on average) or slash the required cache capacity (by ~63% on average). Following the proposed CMP architecture, the connection of the cache memory modules with both the (cache-free) CMP dies and the DRAM elements can be realized completely in the optical domain, relieving processor dies from interconnection and caching modules.

The rest of this paper is organized as follows: Section 2 describes the detailed physical layer architecture of the bus-based CMP architecture, Section 3 presents the system-scale simulation results, Section 4 recapitulates the findings of this work and finally Section 5 concludes the paper.

Section snippets

Optical-bus-based CMP architecture with optical cache memories

Fig. 1(a) presents a typical example of a modern CMP with multi-level electronic caches and an indicative number of eight processing cores. Specifically, the standard approach is to put dedicated L1d and L1i caches at each core that run at the same speed with the core in order to maintain stall-free core operation assuming cache hits. L1d and L1i caches independently buffer the instruction and data fetch and store operations towards doubling the cache bandwidth and reducing interference between

Simulation results

To assess the performance of the optical cache module presented in the previous section and described in detail in [26], we consider and study a CMP system that integrates such an optical off-chip cache connected to the CPU and MM via optical buses.

The standard CMP design is based on electronic memory elements and deploys a multi-level cache hierarchy. The most common approach is to use one dedicated L1 cache per core (partitioned in separate L1i and L1d caches) and a common L2 cache shared

Discussion

The previous section's simulation results reveal the potential miss-rate reduction benefits that could be gained by adopting the shared high-speed optical cache topology in multi-core chip configurations. These translated to either significant capacity requirements reduction, or important speed-up in the overall system's performance.

The transition from traditional computing on CPUs to massively parallel general purpose computing on Graphics Processing Units (GPUs) has identified the cache

Conclusions

We have demonstrated an optical bus-based Chip Multiprocessor architecture where an all-optical cache memory is shared among the processing cores suggesting a totally flat cache hierarchy. Placing the shared cache on a separate chip next to the processor die allows for better chip area utilization in favor of the processing elements. All the CPU-DRAM communication is realized completely in the optical domain by utilizing proper WDM optical interfaces combined with spatial-multiplexed optical

Acknowledgments

This work has been supported in part by the European Commission through the FP7-ICT-FET Open project RAMPLAS (Contract no. 270773) and FP7-PEOPLE-2013-IAPP-COMANDER (612257).

References (63)

  • S.A. McKee, Reflections on the memory wall, in: Proceedings of the 1st Conference on Computing Frontiers (CF '04), ACM,...
  • B. Ahsan and M. Zahran, Cache performance, system performance, and off-chip bandwidth… pick any two, in: Proceedings of...
  • K. Ali, M. Aboelaze, S. Datta, Modified hotspot cache architecture: a low energy fast cache for embedded processors,...
  • S. Borkar et al.

    The future of microprocessors

    Commun. ACM

    (2011)
  • L. Zhao, R. Iyer, S. Makineni, J. Moses, R. Illikkal, D. Newell, Performance, area and bandwidth implications on...
  • P. Kongetira et al.

    Niagara: a 32-way multithreaded sparc processor

    IEEE Micro

    (2005)
  • B. Dally, GPU Computing: To ExaScale and Beyond, SC 2010, New Orleans, USA, 2010. Available online at...
  • M. Duranton, Design for Silicon Photonics, Retrieved September 6, 2014 from...
  • H. Ji, K.Ho Ha, I. Joe, S. Gu Kim, K. Won Na, D. Jae Shin, S. Dong Suh, Y. Dong Park, C. Hee Chung, Optical interface...
  • D.J. Shin, K.S. Cho, H.C. Ji, B.S. Lee, S.G. Kim, J.K. Bok, S.H. Choi, Y.H. Shin, J.H. Kim, S.Y. Lee, K.Y. Cho, B.J....
  • K. Lee, D. Jae Shin, H. Ji, K. Na, S. Gu Kim, J. Bok, Y. You, S. Kim, I. Joe, S. Dong Suh, J. Pyo, Y. Shin, K. Ha, Y....
  • C. Baten et al.

    Building many-core processor-to-DRAM networks with monolithic CMOS silicon photonics

    IEEE Micro

    (2009)
  • D. Brunina et al.

    An energy-efficient optically connected memory module for hybrid packet- and circuit-switched optical networks

    IEEE J. Sel. Top. Quantum Electron.

    (2013)
  • C. Vagionas, D. Fitsios, G.T. Kanellos, N. Pleros, A. Miliou, All optical flip-flop with two coupled travelling...
  • L. Liu et al.

    An ultra-small, low-power, all-optical flip-flop memory on a silicon chip

    Nat. Photonics

    (2010)
  • J. Sakaguchi et al.

    High switching-speed operation of optical memory based on polarization bistable vertical-cavity surface-emitting laser

    IEEE J. Quantum Electron.

    (2010)
  • Y. Liu et al.

    Packaged and hybrid integrated all-optical flip-flop memory

    Electron. Lett.

    (2006)
  • E. Kuramochi et al.

    Large-scale integration of wavelength-addressable all-optical memories on a photonic crystal chip

    Nat. Photonics

    (2014)
  • K. Nozaki et al.

    Ultralow-power all optical RAM based on nanocavities

    Nat. Photonics

    (2012)
  • N. Pleros et al.

    Optical static RAM cell

    IEEE Photonics Technol. Lett.

    (2009)
  • D. Fitsios et al.

    Memory speed analysis of optical RAM and optical flip-flop circuits based on coupled SOA-MZI gates

    IEEE J. Sel. Top. Quantum Electron.

    (2012)
  • D. Fitsios et al.

    Dual-wavelength bit input optical RAM with three SOA XGM switches

    IEEE Photonics Technol. Lett.

    (2012)
  • G.T. Kanellos et al.

    Bringing WDM into optical static RAM architectures

    J. Light. Technol.

    (2013)
  • T. Alexoudi et al.

    Optical cache memory peripheral circuitry: row and column address selectors for optical static RAM banks

    J. Light. Technol.

    (2013)
  • C. Vagionas, S. Markou, G. Dabos, T. Alexoudi, D. Tsiokos, A. Miliou, N. Pleros, G.T. Kanellos, Optical RAM row access...
  • P. Maniotis et al.

    Optical buffering for chip multiprocessors: a 16GHz optical cache memory architecture

    J. Light. Technol.

    (2013)
  • VPI Photonics, 2014,...
  • P. Maniotis, D. Fitsios, G. T. Kanellos, N. Pleros, A 16GHz Optical Cache Memory Architecture for Set-Associative...
  • P. Maniotis, S. Gitzenis, L. Tassiulas, N. Pleros, A novel Chip-Multiprocessor Architecture with optically...
  • C. Bienia, K. Li, PARSEC 2.0: A new benchmark suite for chip-multiprocessors, in: Proceedings of 5th Annual Workshop on...
  • G. Hendry, S. Kamil, A. Biberman, J. Chan, B.G. Lee, M. Mohiyuddin, A. Jain, K. Bergman, L.P. Carloni, J. Kubiatowicz,...
  • View full text