ABSTRACT
As the number of on-chip accelerators grows rapidly to improve power-efficiency, the buffer size required by accelerators drastically increases. Existing solutions allow the accelerators to share a common pool of buffers or/and allocate buffers in cache. In this paper we propose a Buffer-in-NUCA (BiN) scheme with the following contributions: (1) a dynamic interval-based global buffer allocation method to assign shared buffer spaces to accelerators that can best utilize the additional buffer space, and (2) a flexible and low-overhead paged buffer allocation method to limit the impact of buffer fragmentation in a shared buffer, especially when allocating buffers in a non-uniform cache architecture (NUCA) with distributed cache banks. Experimental results show that, when compared to two representative schemes from the prior work, BiN improves performance by 32% and 35% and reduces energy by 12% and 29%, respectively.
- C. Johnson et al. A wire-speed powerTM processor: 2.3ghz 45nm soi with 16 cores and 64 threads. ISSCC 2010.Google Scholar
- L. Seiler et al. Larrabee: a many-core x86 architecture for visual computing. IEEE Micro, 29(1):10--21, 2009. Google ScholarDigital Library
- J. Cong et al. AXR-CMP: architecture support in accelerator-rich CMPs. Workshop on SoC Architecture, Accelerators and Workloads 2011.Google Scholar
- J. Cong et al. Architecture support for accelerator-rich CMPs. DAC 2012. Google ScholarDigital Library
- ITRS 2007 system drivers. http://www.itrs.net/.Google Scholar
- M. J. Lyonsy et al. The Accelerator Store: a shared memory framework for accelerator-based systems. ACM Trans. Architecture and Code Optimization, 8(4):48, 2012. Google ScholarDigital Library
- C. F. Fajardo et al. Buffer-Integrated-Cache: a cost-effective SRAM architecture for handheld and embedded platforms. DAC 2011. Google ScholarDigital Library
- J. Cong et al. An energy-efficient adaptive hybrid cache. ISLPED 2011. Google ScholarDigital Library
- J. Cong et al. Combined loop transformation and hierarchy allocation for data reuse optimization. ICCAD 2011. Google ScholarDigital Library
- A. Bui et al. Platform characterization for domain-specific computing. ASPDAC 2012.Google Scholar
- B. M. Beckmann et al. ASR: adaptive selective replication for CMP Caches. MICRO 2006. Google ScholarDigital Library
- S. Cho and L. Jin. Managing distributed, shared L2 caches through OS-level page allocation. MICRO 2006. Google ScholarDigital Library
- J. Cong et al. A shared Buffer-in-NUCA management scheme for accelerator-rich CMPs. University of California, Los Angeles Computer Science Department Technical Report 120012, 2012.Google Scholar
- M. Qureshi and Y. Patt. Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. MICRO 2006. Google ScholarDigital Library
- HP Cacti, http://quid.hpl.hp.com:9081/cacti/.Google Scholar
- P. S. Magnusson et al. Simics: a full system simulation platform. IEEE Trans. Computer, 35(2):50--58, 2002. Google ScholarDigital Library
- M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator toolset. ACM SIGARCH Computer Architecture News, 33(4):92--99, 2005. Google ScholarDigital Library
- J. Cong et al. High-level synthesis for FPGAs: from prototyping to deployment. IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 30(4):473--491, 2011. Google ScholarDigital Library
- S. Li et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. MICRO 2009. Google ScholarDigital Library
Index Terms
- BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs
Recommendations
Reactive NUCA: near-optimal block placement and replication in distributed caches
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and CompilersIn chip multiprocessors (CMPs), data access latency depends on the memory hierarchy organization, the on-chip interconnect (NoC), and the running workload. Reducing data access latency is vital to achieving performance improvements and scalability of ...
Comments