skip to main content
10.1145/2684746.2689069acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article

Superoptimized Memory Subsystems for Streaming Applications

Published: 22 February 2015 Publication History

Abstract

Because main memory is many times slower than modern processor cores, deep, multi-level cache hierarchies are ubiquitous in computers today. Similarly, applications deployed on ASICs and FPGAs are often hindered by slow external memories. Therefore, to achieve good performance, hardware designers must optimize main memory usage. Unfortunately, this process is often labor intensive and fails to explore the full range of potential memory designs. To address this issue for applications expressed in a streaming manner, we show that it is possible to generate automatically a superoptimized memory subsystem that can be deployed on an FPGA such that it performs better than a general-purpose memory subsystem. Rather than explore only simple memory subsystems, our superoptimizer is capable of exploring extremely complex designs consisting of multi-level caches and other components. Finally, we show that it is possible to deploy applications with superoptimized memory subsystems with minimal additional effort while achieving significant performance improvements over a naive memory subsystem.

References

[1]
M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer. LEAP scratchpads: automatic memory and cache management for reconfigurable logic. In Proc. of 19th Int'l Symp. on Field Programmable Gate Arrays, pages 25--28, 2011.
[2]
R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. A dynamically tunable memory hierarchy. IEEE Trans. on Computers, 52(10):1243--1258, Oct. 2003.
[3]
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In Proc. of 10th Int'l Symp. on Hardware/Software Codesign, pages 73--78, 2002.
[4]
R. D. Chamberlain, M. A. Franklin, E. J. Tyson, J. H. Buckley, J. Buhler, G. Galloway, S. Gayen, M. Hall, E. B. Shands, and N. Singla. Auto-Pipe: Streaming applications on architecturally diverse systems. Computer, 43(3):42--49, Mar. 2010.
[5]
R. D. Chamberlain and N. Ganesan. Sorting on architecturally diverse computer systems. In Proc. of 3rd Int'l Workshop on High-Performance Reconfigurable Computing Technology and Applications, Nov. 2009.
[6]
E. S. Chung, J. C. Hoe, and K. Mai. CoRAM: an in-fabric memory architecture for FPGA-based computing. In Proc. of 19th Int'l Symp. on Field Programmable Gate Arrays, pages 97--106, 2011.
[7]
J. Cong, M. Huang, and P. Zhang. Combining computation and communication optimizations in system synthesis for streaming applications. In Proc. of 22nd Int'l Symp. on Field Programmable Gate Arrays, pages 213--222. ACM, 2014.
[8]
G. Dueck and T. Scheuer. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. Journal of Computational Physics, 90(1):161--175, 1990.
[9]
A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. on Design Automation of Electronic Systems, 9(4):419--440, Oct. 2004.
[10]
A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic tuning of two-level caches to embedded applications. In Proc. of the Conf. on Design, Automation and Test in Europe, page 10208, 2004.
[11]
A. Gordon-Ross, F. Vahid, and N. Dutt. Fast configurable-cache tuning with a unified second-level cache. In Proc. of Int'l Symp. on Low Power Electronics and Design, pages 323--326, 2005.
[12]
T. C. Hu, A. B. Kahng, and C.-W. A. Tsao. Old bachelor acceptance: A new class of non-monotone threshold accepting methods. ORSA Journal on Computing, 7(4):417--425, 1995.
[13]
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. of 17th Int'l Symp. on Computer Architecture, pages 364--373, 1990.
[14]
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simmulated annealing. Science, 220(4598):671--680, 1983.
[15]
H. Massalin. Superoptimizer: a look at the smallest program. In Proc. of 2nd Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 122--126, 1987.
[16]
M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. on Modeling and Computer Simulation, 8(1):3--30, 1998.
[17]
A. Naz. Split Array and Scalar Data Caches: A Comprehensive Study of Data Cache Organization. PhD thesis, Univ. of North Texas, 2007.
[18]
P. R. Panda, N. D. Dutt, and A. Nicolau. Local memory exploration and optimization in embedded systems. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 18(1):3--13, 1999.
[19]
P. R. Panda, N. D. Dutt, A. Nicolau, F. Catthoor, A. Vandecappelle, E. Brockmeyer, C. Kulkarni, and E. De Greef. Data memory organization and optimizations in application-specific systems. IEEE Design & Test of Computers, 18(3):56--68, 2001.
[20]
E. Schkufza, R. Sharma, and A. Aiken. Stochastic superoptimization. In Proc. of 18th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 305--316, 2013.
[21]
J. H. Spring, J. Privat, R. Guerraoui, and J. Vitek. StreamFlex: high-throughput stream programming in Java. ACM SIGPLAN Notices, 42(10):211--228, 2007.
[22]
K. T. Sundararajan, T. M. Jones, and N. P. Topham. Smart cache: A self adaptive cache architecture for energy efficiency. In Proc. of Int'l Conf. on Embedded Computer Systems, pages 41--50, 2011.
[23]
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of 11th Int'l Conf. on Compiler Construction, pages 179--196, 2002.
[24]
J. Vasiljevic and P. Chow. MPack: global memory optimization for stream applications in high-level synthesis. In Proc. of Int'l Symp. on Field Programmable Gate Arrays, pages 233--236, 2014.
[25]
A. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Adapting cache line size to application behavior. In Proc. of 13th Int'l Conf. on Supercomputing, pages 145--154, 1999.
[26]
J. G. Wingbermuehle, R. D. Chamberlain, and R. K. Cytron. ScalaPipe: A streaming application generator. In Proc. of 2012 Symp. on Application Accelerators in High-Performance Computing, pages 244--254, 2012.
[27]
J. G. Wingbermuehle, R. K. Cytron, and R. D. Chamberlain. Superoptimization of memory subsystems. In Proc. of Conf. on Languages, Compilers, and Tools for Embedded Systems, 2014.
[28]
F. Winterstein, S. Bayliss, and G. Constantinides. Separation logic-assisted code transformations for efficient high-level synthesis. In Proc of 22nd Int'l Symp. on Field Programmable Custom Computing Machines, pages 1--8, 2014.
[29]
H.-J. Yang, K. Fleming, M. Adler, and J. Emer. Optimizing under abstraction: Using prefetching to improve FPGA performance. In Proc. of 23rd Int'l Conf. on Field Programmable Logic and Applications, pages 1--8, 2013.

Cited By

View all
  • (2017)BackgroundSeparation Logic for High-level Synthesis10.1007/978-3-319-53222-6_3(35-55)Online publication date: 28-Feb-2017
  • (2015)Custom-sized caches in application-specific memory hierarchies2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393141(144-151)Online publication date: Dec-2015
  • (2015)OpenCL library of stream memory components targeting FPGAs2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393134(104-111)Online publication date: Dec-2015
  • Show More Cited By

Index Terms

  1. Superoptimized Memory Subsystems for Streaming Applications

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
      February 2015
      292 pages
      ISBN:9781450333153
      DOI:10.1145/2684746
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 February 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cache
      2. fpga
      3. memory subsystem
      4. streaming
      5. superoptimization

      Qualifiers

      • Research-article

      Funding Sources

      • VelociData, Inc.
      • Exegy, Inc.

      Conference

      FPGA '15
      Sponsor:

      Acceptance Rates

      FPGA '15 Paper Acceptance Rate 20 of 102 submissions, 20%;
      Overall Acceptance Rate 125 of 627 submissions, 20%

      Upcoming Conference

      FPGA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 10 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2017)BackgroundSeparation Logic for High-level Synthesis10.1007/978-3-319-53222-6_3(35-55)Online publication date: 28-Feb-2017
      • (2015)Custom-sized caches in application-specific memory hierarchies2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393141(144-151)Online publication date: Dec-2015
      • (2015)OpenCL library of stream memory components targeting FPGAs2015 International Conference on Field Programmable Technology (FPT)10.1109/FPT.2015.7393134(104-111)Online publication date: Dec-2015
      • (2015)Superoptimizing Memory Subsystems for Multiple ObjectivesEuro-Par 2015: Parallel Processing Workshops10.1007/978-3-319-27308-2_29(352-363)Online publication date: 18-Dec-2015

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media