ABSTRACT
With the trend toward specialization, an efficient memory-path design is vital to capitalize customization in data-path. A monolithic memory hierarchy is often highly inefficient for irregular applications, traditionally targeted for CPUs. New approaches and tools are required to offer application-specific memory customization combining the benefits of cache and scratchpad memory simultaneously.
This paper introduces a novel approach for automated application-specific on-chip memory assignment and tiling. The approach offers two major tools: (1) static memory access analysis and (2) variable-level memory assignment. Static memory analysis performs at the LLVM abstraction. It extracts target-independent pointer behaviors, measures the access strides and analyze the prefetchability of variables. (2) variable-level memory assignment creates a memory allocation graph for memory assignment (cache vs. scratchpad) based on the variables size and their estimated locality. It also explores the opportunity for tiling memory access. For the exploration and results, this paper uses Machsuite benchmarks (with both regular & irregular memory access behaviors), and gem5-Aladdin tool for performance & power evaluation. The proposed approach optimizes the memory hierarchy by automatically combining the benefits of cache, (tiled-) scratchpad at variable level granularity per individual applications. The results demonstrate more than 45% improvement in our power-stall product, on average, over the monolithic cache or scratchpad design.
- D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit, "Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications," in Proceedings of the 49th Annual Design Automation Conference. ACM, 2012, pp. 1137--1142. Google ScholarDigital Library
- H. Tabkhi, R. Bushey, and G. Schirner, "Function-level processor (flp): A novel processor class for efficient processing of streaming applications," Journal of Signal Processing Systems, vol. 85, no. 3, pp. 287--306, 2016. Google ScholarDigital Library
- J. Cong, M. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, "Architecture support for accelerator-rich CMPs," in Design Automation Conference (DAC), 2012, pp. 843--849. Google ScholarDigital Library
- Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, "Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin," in The 49th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. Google ScholarDigital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, Aug. 2011. {Online}. Available Google ScholarDigital Library
- Y. S. Shao, B. Reagan, G.-Y. Wei, and D. Brooks, "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures," in ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- J. Cong, Z. Fang, M. Gill, and G. Reinman, "Parade: A cycle-accurate full-system simulation platform for accelerator-rich architectural design and exploration," in 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2015. Google ScholarDigital Library
- B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks, "MachSuite: Benchmarks for accelerator design and customized architectures," in Proceedings of the IEEE International Symposium on Workload Characterization, Raleigh, North Carolina, October 2014.Google Scholar
- F. Piovezan, T. E. M. Crocomo, and L. C. V. dos Santos, "Cache sizing for low-energy elliptic curve cryptography," in 29th Symposium on Integrated Circuits and Systems Design (SBCCI), 2016. Google ScholarDigital Library
- G. Wang, L. Ju, Z. Jia, and X. Li, "Data allocation for embedded systems with hybrid on-chip scratchpad and caches," in IEEE International Conference on High Performance Computing and Communications, 2013, pp. 366--373.Google Scholar
- J. Sancho and D. Kerbyson, "Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE," in International Symposium on Parallel and Distributed Processing (ISPDP), 2008, pp. 1--12.Google Scholar
- L. Wu and W. Zhang, "Cache-aware spm allocation algorithms for hybrid spmcache architectures," in Sixteenth International Symposium on Quality Electronic Design, March 2015, pp. 123--129.Google ScholarCross Ref
- R. Hou, L. Zhang, M. Huang, K. Wang, H. Franke, Y. Ge, and X. Chang, "Efficient data streaming with on-chip accelerators: Opportunities and challenges," in High Performance Computer Architecture (HPCA), 2011, pp. 312--320. Google ScholarDigital Library
- B. Reagan, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
- M. Qiu, Z. Chen, Z. Ming, and J. Niu, "Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems," in IEEE SYSTEMS JOURNAL, VOL. 11, NO. 2, 2017, pp. 813--822.Google Scholar
- C. Song, L. Ju, and Z. Jia, "Hybrid scratchpad and cache memory management for energy-efficient parallel hevc encoding," in 33rd IEEE International Conference on Computer Design (ICCD), 2015, pp. 712--719. Google ScholarDigital Library
- J. Cong, P. Li, B. Xiao, and P. Zhang, "An optimal microarchitecture for stencil computation acceleration based on non-uniform partitioning of data reuse buffers," in 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), June 2014, pp. 1--6. Google ScholarDigital Library
- Y. T. Chen, J. Cong, J. Lei, and P. Wei, "A novel high-throughput acceleration engine for read alignment," in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, May 2015, pp. 199--202. Google ScholarDigital Library
- J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, "Quantifying locality in the memory access patterns of hpc applications," in Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov 2005, pp. 50--50. Google ScholarDigital Library
Recommendations
Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesScratchpad Memory (SPM) is widely used in emerging domain-specific architectures and accelerators for improving energy efficiency and time predictability. Typically, SPM-based architectures use DMA for fetching data from off-chip memory and global load ...
Locality Aware Memory Assignment and Tiling
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)With the trend toward specialization, an efficient memory-path design is vital to capitalize customization in data-path. A monolithic memory hierarchy is often highly inefficient for irregular applications, traditionally targeted for CPUs. New approaches ...
Locality aware management on NAND flash-based main memory for in-memory database systems
EDB '16: Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and TheoryConventional database systems manage all data on hard disks, but due to a hard disk's frequent I/O operations, this kind of management exposes critical problems when data is huge or operations are complex and frequent. As the size of the main memory ...
Comments