skip to main content
10.1145/3195970.3196070acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

Locality aware memory assignment and tiling

Published: 24 June 2018 Publication History

Abstract

With the trend toward specialization, an efficient memory-path design is vital to capitalize customization in data-path. A monolithic memory hierarchy is often highly inefficient for irregular applications, traditionally targeted for CPUs. New approaches and tools are required to offer application-specific memory customization combining the benefits of cache and scratchpad memory simultaneously.
This paper introduces a novel approach for automated application-specific on-chip memory assignment and tiling. The approach offers two major tools: (1) static memory access analysis and (2) variable-level memory assignment. Static memory analysis performs at the LLVM abstraction. It extracts target-independent pointer behaviors, measures the access strides and analyze the prefetchability of variables. (2) variable-level memory assignment creates a memory allocation graph for memory assignment (cache vs. scratchpad) based on the variables size and their estimated locality. It also explores the opportunity for tiling memory access. For the exploration and results, this paper uses Machsuite benchmarks (with both regular & irregular memory access behaviors), and gem5-Aladdin tool for performance & power evaluation. The proposed approach optimizes the memory hierarchy by automatically combining the benefits of cache, (tiled-) scratchpad at variable level granularity per individual applications. The results demonstrate more than 45% improvement in our power-stall product, on average, over the monolithic cache or scratchpad design.

References

[1]
D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit, "Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications," in Proceedings of the 49th Annual Design Automation Conference. ACM, 2012, pp. 1137--1142.
[2]
H. Tabkhi, R. Bushey, and G. Schirner, "Function-level processor (flp): A novel processor class for efficient processing of streaming applications," Journal of Signal Processing Systems, vol. 85, no. 3, pp. 287--306, 2016.
[3]
J. Cong, M. Ghodrat, M. Gill, B. Grigorian, and G. Reinman, "Architecture support for accelerator-rich CMPs," in Design Automation Conference (DAC), 2012, pp. 843--849.
[4]
Y. S. Shao, S. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, "Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin," in The 49th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
[5]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, Aug. 2011. {Online}. Available
[6]
Y. S. Shao, B. Reagan, G.-Y. Wei, and D. Brooks, "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures," in ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014.
[7]
J. Cong, Z. Fang, M. Gill, and G. Reinman, "Parade: A cycle-accurate full-system simulation platform for accelerator-rich architectural design and exploration," in 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2015.
[8]
B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks, "MachSuite: Benchmarks for accelerator design and customized architectures," in Proceedings of the IEEE International Symposium on Workload Characterization, Raleigh, North Carolina, October 2014.
[9]
F. Piovezan, T. E. M. Crocomo, and L. C. V. dos Santos, "Cache sizing for low-energy elliptic curve cryptography," in 29th Symposium on Integrated Circuits and Systems Design (SBCCI), 2016.
[10]
G. Wang, L. Ju, Z. Jia, and X. Li, "Data allocation for embedded systems with hybrid on-chip scratchpad and caches," in IEEE International Conference on High Performance Computing and Communications, 2013, pp. 366--373.
[11]
J. Sancho and D. Kerbyson, "Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE," in International Symposium on Parallel and Distributed Processing (ISPDP), 2008, pp. 1--12.
[12]
L. Wu and W. Zhang, "Cache-aware spm allocation algorithms for hybrid spmcache architectures," in Sixteenth International Symposium on Quality Electronic Design, March 2015, pp. 123--129.
[13]
R. Hou, L. Zhang, M. Huang, K. Wang, H. Franke, Y. Ge, and X. Chang, "Efficient data streaming with on-chip accelerators: Opportunities and challenges," in High Performance Computer Architecture (HPCA), 2011, pp. 312--320.
[14]
B. Reagan, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
[15]
M. Qiu, Z. Chen, Z. Ming, and J. Niu, "Energy-Aware Data Allocation With Hybrid Memory for Mobile Cloud Systems," in IEEE SYSTEMS JOURNAL, VOL. 11, NO. 2, 2017, pp. 813--822.
[16]
C. Song, L. Ju, and Z. Jia, "Hybrid scratchpad and cache memory management for energy-efficient parallel hevc encoding," in 33rd IEEE International Conference on Computer Design (ICCD), 2015, pp. 712--719.
[17]
J. Cong, P. Li, B. Xiao, and P. Zhang, "An optimal microarchitecture for stencil computation acceleration based on non-uniform partitioning of data reuse buffers," in 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), June 2014, pp. 1--6.
[18]
Y. T. Chen, J. Cong, J. Lei, and P. Wei, "A novel high-throughput acceleration engine for read alignment," in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, May 2015, pp. 199--202.
[19]
J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, "Quantifying locality in the memory access patterns of hpc applications," in Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, Nov 2005, pp. 50--50.

Cited By

View all
  • (2022)Fine-Granular Computation and Data Layout Reorganization for Improving LocalityProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549386(1-9)Online publication date: 30-Oct-2022
  • (2020)LLVM-based automation of memory decoupling for OpenCL applications on FPGAsMicroprocessors & Microsystems10.1016/j.micpro.2019.10290972:COnline publication date: 1-Feb-2020
  • (2019)Scalable LLVM-Based Accelerator Modeling in gem5IEEE Computer Architecture Letters10.1109/LCA.2019.289393218:1(18-21)Online publication date: 1-Jan-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '18: Proceedings of the 55th Annual Design Automation Conference
June 2018
1089 pages
ISBN:9781450357005
DOI:10.1145/3195970
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

DAC '18
Sponsor:
DAC '18: The 55th Annual Design Automation Conference 2018
June 24 - 29, 2018
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Fine-Granular Computation and Data Layout Reorganization for Improving LocalityProceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design10.1145/3508352.3549386(1-9)Online publication date: 30-Oct-2022
  • (2020)LLVM-based automation of memory decoupling for OpenCL applications on FPGAsMicroprocessors & Microsystems10.1016/j.micpro.2019.10290972:COnline publication date: 1-Feb-2020
  • (2019)Scalable LLVM-Based Accelerator Modeling in gem5IEEE Computer Architecture Letters10.1109/LCA.2019.289393218:1(18-21)Online publication date: 1-Jan-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media