research-article

Superoptimized Memory Subsystems for Streaming Applications

Authors:

Joseph G. Wingbermuehle,

Roger D. ChamberlainAuthors Info & Claims

FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 126 - 135

https://doi.org/10.1145/2684746.2689069

Published: 22 February 2015 Publication History

Abstract

Because main memory is many times slower than modern processor cores, deep, multi-level cache hierarchies are ubiquitous in computers today. Similarly, applications deployed on ASICs and FPGAs are often hindered by slow external memories. Therefore, to achieve good performance, hardware designers must optimize main memory usage. Unfortunately, this process is often labor intensive and fails to explore the full range of potential memory designs. To address this issue for applications expressed in a streaming manner, we show that it is possible to generate automatically a superoptimized memory subsystem that can be deployed on an FPGA such that it performs better than a general-purpose memory subsystem. Rather than explore only simple memory subsystems, our superoptimizer is capable of exploring extremely complex designs consisting of multi-level caches and other components. Finally, we show that it is possible to deploy applications with superoptimized memory subsystems with minimal additional effort while achieving significant performance improvements over a naive memory subsystem.

References

[1]

M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer. LEAP scratchpads: automatic memory and cache management for reconfigurable logic. In Proc. of 19th Int'l Symp. on Field Programmable Gate Arrays, pages 25--28, 2011.

Digital Library

[2]

R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. A dynamically tunable memory hierarchy. IEEE Trans. on Computers, 52(10):1243--1258, Oct. 2003.

Digital Library

[3]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In Proc. of 10th Int'l Symp. on Hardware/Software Codesign, pages 73--78, 2002.

Digital Library

[4]

R. D. Chamberlain, M. A. Franklin, E. J. Tyson, J. H. Buckley, J. Buhler, G. Galloway, S. Gayen, M. Hall, E. B. Shands, and N. Singla. Auto-Pipe: Streaming applications on architecturally diverse systems. Computer, 43(3):42--49, Mar. 2010.

Digital Library

[5]

R. D. Chamberlain and N. Ganesan. Sorting on architecturally diverse computer systems. In Proc. of 3rd Int'l Workshop on High-Performance Reconfigurable Computing Technology and Applications, Nov. 2009.

Digital Library

[6]

E. S. Chung, J. C. Hoe, and K. Mai. CoRAM: an in-fabric memory architecture for FPGA-based computing. In Proc. of 19th Int'l Symp. on Field Programmable Gate Arrays, pages 97--106, 2011.

Digital Library

[7]

J. Cong, M. Huang, and P. Zhang. Combining computation and communication optimizations in system synthesis for streaming applications. In Proc. of 22nd Int'l Symp. on Field Programmable Gate Arrays, pages 213--222. ACM, 2014.

Digital Library

[8]

G. Dueck and T. Scheuer. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. Journal of Computational Physics, 90(1):161--175, 1990.

Digital Library

[9]

A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. on Design Automation of Electronic Systems, 9(4):419--440, Oct. 2004.

Digital Library

[10]

A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic tuning of two-level caches to embedded applications. In Proc. of the Conf. on Design, Automation and Test in Europe, page 10208, 2004.

Digital Library

[11]

A. Gordon-Ross, F. Vahid, and N. Dutt. Fast configurable-cache tuning with a unified second-level cache. In Proc. of Int'l Symp. on Low Power Electronics and Design, pages 323--326, 2005.

Digital Library

[12]

T. C. Hu, A. B. Kahng, and C.-W. A. Tsao. Old bachelor acceptance: A new class of non-monotone threshold accepting methods. ORSA Journal on Computing, 7(4):417--425, 1995.

[13]

N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. of 17th Int'l Symp. on Computer Architecture, pages 364--373, 1990.

Digital Library

[14]

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simmulated annealing. Science, 220(4598):671--680, 1983.

[15]

H. Massalin. Superoptimizer: a look at the smallest program. In Proc. of 2nd Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 122--126, 1987.

Digital Library

[16]

M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. on Modeling and Computer Simulation, 8(1):3--30, 1998.

Digital Library

[17]

A. Naz. Split Array and Scalar Data Caches: A Comprehensive Study of Data Cache Organization. PhD thesis, Univ. of North Texas, 2007.

Digital Library

[18]

P. R. Panda, N. D. Dutt, and A. Nicolau. Local memory exploration and optimization in embedded systems. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 18(1):3--13, 1999.

Digital Library

[19]

P. R. Panda, N. D. Dutt, A. Nicolau, F. Catthoor, A. Vandecappelle, E. Brockmeyer, C. Kulkarni, and E. De Greef. Data memory organization and optimizations in application-specific systems. IEEE Design & Test of Computers, 18(3):56--68, 2001.

Digital Library

[20]

E. Schkufza, R. Sharma, and A. Aiken. Stochastic superoptimization. In Proc. of 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 305--316, 2013.

Digital Library

[21]

J. H. Spring, J. Privat, R. Guerraoui, and J. Vitek. StreamFlex: high-throughput stream programming in Java. ACM SIGPLAN Notices, 42(10):211--228, 2007.

Digital Library

[22]

K. T. Sundararajan, T. M. Jones, and N. P. Topham. Smart cache: A self adaptive cache architecture for energy efficiency. In Proc. of Int'l Conf. on Embedded Computer Systems, pages 41--50, 2011.

[23]

W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of 11th Int'l Conf. on Compiler Construction, pages 179--196, 2002.

Digital Library

[24]

J. Vasiljevic and P. Chow. MPack: global memory optimization for stream applications in high-level synthesis. In Proc. of Int'l Symp. on Field Programmable Gate Arrays, pages 233--236, 2014.

Digital Library

[25]

A. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In Proc. of 13th Int'l Conf. on Supercomputing, pages 145--154, 1999.

Digital Library

[26]

J. G. Wingbermuehle, R. D. Chamberlain, and R. K. Cytron. ScalaPipe: A streaming application generator. In Proc. of 2012 Symp. on Application Accelerators in High-Performance Computing, pages 244--254, 2012.

Digital Library

[27]

J. G. Wingbermuehle, R. K. Cytron, and R. D. Chamberlain. Superoptimization of memory subsystems. In Proc. of Conf. on Languages, Compilers, and Tools for Embedded Systems, 2014.

Digital Library

[28]

F. Winterstein, S. Bayliss, and G. Constantinides. Separation logic-assisted code transformations for efficient high-level synthesis. In Proc of 22nd Int'l Symp. on Field Programmable Custom Computing Machines, pages 1--8, 2014.

Digital Library

[29]

H.-J. Yang, K. Fleming, M. Adler, and J. Emer. Optimizing under abstraction: Using prefetching to improve FPGA performance. In Proc. of 23rd Int'l Conf. on Field Programmable Logic and Applications, pages 1--8, 2013.

Cited By

Winterstein FWinterstein F(2017)BackgroundSeparation Logic for High-level Synthesis10.1007/978-3-319-53222-6_3(35-55)Online publication date: 28-Feb-2017
https://doi.org/10.1007/978-3-319-53222-6_3
Winterstein FFleming KYang HWickerson JConstantinides G(2015)Custom-sized caches in application-specific memory hierarchies2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393141(144-151)Online publication date: Dec-2015
https://doi.org/10.1109/FPT.2015.7393141
Vasiljevic JWittig RSchumacher PFifield JVallina FStyles HChow P(2015)OpenCL library of stream memory components targeting FPGAs2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393134(104-111)Online publication date: Dec-2015
https://doi.org/10.1109/FPT.2015.7393134
Show More Cited By

Index Terms

Superoptimized Memory Subsystems for Streaming Applications
1. Hardware
  1. Integrated circuits
    1. Logic circuits
    2. Semiconductor memory

Recommendations

Superoptimization of memory subsystems
LCTES '14: Proceedings of the 2014 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems

The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. Because these cache hierarchies are designed to be general-purpose, they may not provide the best ...
Energy efficient Phase Change Memory based main memory for future high performance systems
IGCC '11: Proceedings of the 2011 International Green Computing Conference and Workshops

Phase Change Memory (PCM) has recently attracted a lot of attention as a scalable alternative to DRAM for main memory systems. As the need for high-density memory increases, DRAM has proven to be less attractive from the point of view of scaling and ...
Superoptimization of memory subsystems
LCTES '14

The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. Because these cache hierarchies are designed to be general-purpose, they may not provide the best ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 2015

292 pages

ISBN:9781450333153

DOI:10.1145/2684746

General Chair:
George A. Constantinides
Imperial College
,
Program Chair:
Deming Chen
University of Illinois at Urbana-Champaign

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

VelociData, Inc.
Exegy, Inc.

Conference

FPGA '15

Sponsor:

SIGDA

FPGA '15: The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

February 22 - 24, 2015

California, Monterey, USA

Acceptance Rates

FPGA '15 Paper Acceptance Rate 20 of 102 submissions, 20%;

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
251
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Winterstein FWinterstein F(2017)BackgroundSeparation Logic for High-level Synthesis10.1007/978-3-319-53222-6_3(35-55)Online publication date: 28-Feb-2017
https://doi.org/10.1007/978-3-319-53222-6_3
Winterstein FFleming KYang HWickerson JConstantinides G(2015)Custom-sized caches in application-specific memory hierarchies2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393141(144-151)Online publication date: Dec-2015
https://doi.org/10.1109/FPT.2015.7393141
Vasiljevic JWittig RSchumacher PFifield JVallina FStyles HChow P(2015)OpenCL library of stream memory components targeting FPGAs2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393134(104-111)Online publication date: Dec-2015
https://doi.org/10.1109/FPT.2015.7393134
Wingbermuehle JCytron RChamberlain R(2015)Superoptimizing Memory Subsystems for Multiple ObjectivesEuro-Par 2015: Parallel Processing Workshops10.1007/978-3-319-27308-2_29(352-363)Online publication date: 18-Dec-2015
https://doi.org/10.1007/978-3-319-27308-2_29

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten