Abstract
The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-efficiency. However, the emerging many-accelerator processor shows a much more complicated memory accessing pattern than general purpose processors (GPPs) because the abundant on-chip FUs tend to generate highly-concurrent memory streams with distinct locality and bandwidth demand. The disordered memory streams issued by diverse accelerators exhibit a mutual interference behavior and cannot be efficiently handled by the orthodox main memory interface that provides an inflexible data fetching mode. Unlike the traditional DRAM memory, our proposed Aggregation Memory System (AMS) can function adaptively to the characterized memory streams from different FUs, because it provides the FUs with different data fetching sizes and protects their locality in memory access by intelligently interleaving their data to memory devices through sub-rank binding. Moreover, AMS can batch the requests without sub-rank conflict into a read burst with our optimized memory scheduling policy. Experimental results from trace-based simulation show both conspicuous performance boost and energy saving brought by AMS.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Yan G, Li Y, Han Y, Li X, Guo M, Liang X. AgileRegulator: A hybird voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. In Proc. the 18th International Symposium on High Performance Computer Architecture, Feb. 2012, pp.287-298.
Fu B, Han Y, Ma J, Li H, Li X. An abacus turn model for time/space-efficient reconfigurable routing. In Proc. the 38th International Symposium on Computer Architecture, June 2011, pp.259-270.
Hameed R, Qadeer W, Wachs M, Azizi O, Solomatnikov A, Lee B C, Richardson S, Kozyrakis C, Horowitz M. Understanding sources of inefficiency in general-purpose chips. In Proc. the 37th Annual International Symposium on Computer Architecture, June 2010, pp.37-47.
Cong J, Grigorian B, Reinman G, Vitanza M. Accelerating vision and navigation applications on a customizable platform. In Proc. the 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors, Sept. 2011, pp.25-32.
Auras D, Girbal S, Berry H et al. CMA: Chip multi-accelerator. In Proc. the 8th IEEE Symposium on Application Specific Processors, June 2010, pp.8-15.
Girbal S, Temam O, Yehia S, Berry H, Li Z. A memory interface for multi-purpose multi-stream accelerators. In Proc. the 13rd International Conference on Compilers, Architectures and Synthesis for Embedded Systems, October 2010, pp.107-116.
Chien A A, Snavely A, Gahagan M. 10×10: A general-purpose architectural approach to heterogeneity and energy efficiency. In Proc. the 11th International Conference on Computational Science, June 2011, pp.1987-1996.
Yoon D H, Jeong M K, Erez M. Adaptive granularity memory systems: A tradeoff between storage efficiency and through-put. In Proc. the 38th Annual International Symposium on Computer Architecture, June 2011, pp.295-306.
Rosenfeld P, Cooper-Balis E, Jacob B. DRAMSim2: A cycle accurate memory system simulator. Computer Architecture Letters, 2011, 10(1): 16–19.
Seznec A. Decoupled sectored caches: Conciliating low tag implementation cost. In Proc. the 21st Annual International Symposium on Computer Architecture, Apr. 1994, pp.384-393.
Kumar S, Zhao H, Shriraman A, Matthews E, Dwarkadas S, Shannon L. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proc. the 45th Annual International Symposium on Microarchitecture, December 2012, pp.376-388.
Ahn J H, Leverich J, Schreiber R, Jouppi N P. Multicore DIMM: An energy efficient memory module with independently controlled DRAMs. IEEE Computer Architecture Letters, 2009, 8(1): 5–8.
Udipi A N, Muralimanohar N, Chatterjee N, Balasubramonian R, Davis A, Jouppi N P. Rethinking DRAM design and organization for energy-constrained multi-cores. In Proc. the 37th Annual International Symposium on Computer Architecture, June 2010, pp.175-186.
Kim J S, Oh C S, Lee H et al. A 1.2 V 12.8 GB/s 2 Gb mobile Wide-I/O DRAM with 4 × 128 I/Os using TSV-based stacking. In Proc. the International Solid-State Circuits Conference, February 2011, pp.496-498.
Liu C, Zhang L, Han Y, Li X. Vertical interconnects squeezing in symmetric 3D mesh network-on-Chip. In Proc. the 16th Asia and South Pacific Design Automation Conference, Jan. 2011, pp.357-362
Wang Y, Zhang L, Han Y, Li H, Li X. FlexMemory: Exploiting and managing abundant off-chip optical bandwidth. In Proc. Design, Automation and Test in Europe, March 2011, pp.968-973
Rafique N, Lim W, Thottethodi M. Effective management of DRAM bandwidth in multicore processors. In Proc. the 16th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2007, pp.245-258.
Bitirgen R, Ipek E, Martinez J. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach. In Proc. the 41st IEEE/ACM International Symposium on Microarchitecture, Nov. 2008, pp.318-329.
Liu F, Jiang X, Solihin Y. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In Proc. the 16th IEEE International Symposium on High Performance Computer Architecture, January 2010.
Muralidhara S P, Subramanian L, Mutlu O et al. Reducing memory interference in multicore systems via application aware memory channel partitioning. In Proc. the 44th International Symposium on Microarchitecture, December 2011, pp.374-385.
Liu L, Cui Z, Xing M, Bao Y, Chen M, Wu C. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, August 2012, pp.367-376.
Thiebaut D, Stone H S. Footprints in the cache. ACM Trans. Computer Systems, 1987, 5(4): 305–329.
Sudan K, Chatterjee N, Nellans D, Awasthi M, Balasubramonian R, Davis A. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In Proc. the 15th Edi tion of ASPLOS on Architectural Support for Programming Languages and Operating systems, March 2010, pp.219-230.
Luk C K, Cohn R, Muth R et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. the 10th International Conference on Programming Language Design and Implementation, June 2005, pp.190-200.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the National Natural Science Foundation of China under Grant Nos. 61173006, 60921002, the National Basic Research 973 Program of China under Grant No. 2011CB302503, and the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA06010403.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wang, Y., Zang, L., Han, YH. et al. Reinventing Memory System Design for Many-Accelerator Architecture. J. Comput. Sci. Technol. 29, 273–280 (2014). https://doi.org/10.1007/s11390-014-1429-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-014-1429-6