ABSTRACT
In high-level synthesis, pipelined designs are often restricted by the number of memory banks available to the synthesis system. Using multiple memory banks can improve the performance of accelerated applications. Currently, programmers must manually assign data structures to specific memory banks on the accelerator. This paper describes Automatic Memory Partitioning, a method for automatically partitioning data structures into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. We present an ILP based algorithm for allocating memory segments into multiple memory banks. Experiments show significant improvements in performance while using a minimal number of memory banks.
- C.Y.R. Ahmad, I. Chen. Post-processor for data path synthesis using multiport memories. In Computer-Aided Design, 1991. ICCAD-91. Digest of Technical Papers., 1991 IEEE International Conference on, pages 276--279, 1991.Google ScholarCross Ref
- Yosi Ben-Asher and Nadav Rotem. Synthesis for variable pipelined function units. In System-on-Chip, 2008. SOC 2008. International Symposium on, pages 1--4. IEEE Computer Society, 2008.Google ScholarCross Ref
- Joo M.P. Cardoso and Pedro C. Diniz. Compilation Techniques for Reconfigurable Architectures. Springer Publishing Company, Incorporated, 2008. Google ScholarDigital Library
- Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, Raul Silvera, and Roch Archambault. Mpads: memory-pooling-assisted data splitting. In ISMM '08: Proceedings of the 7th international symposium on Memory management, pages 101--110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Srinivas Devadas, Abhijit Ghosh, and Kurt Keutzer. Logic Synthesis. McGraw-Hill, 1994. Google ScholarDigital Library
- M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified np-complete problems. In STOC '74: Proceedings of the sixth annual ACM symposium on Theory of computing, pages 47--63, New York, NY, USA, 1974. ACM. Google ScholarDigital Library
- Xilinx Inc. Ml405 evaluation platform reference designs, 2009. http://www.xilinx.com/products/boards/ml405/.Google Scholar
- Chanik Park Junghee Lee and Soonhoi Ha. Memory access pattern analysis and stream cache design for multimedia applications. In Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asia and South Pacific, pages 22--27, 2003. Google ScholarDigital Library
- Ramachandran L., Gajski D.D., and Chaiyakul V. An algorithm for array variable clustering. In European Design and Test Conference, 1994. EDAC, The European Conference on Design Automation., pages 262--266, 1994.Google ScholarCross Ref
- M. Lam. Software pipelining: an effective scheduling technique for vliw machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pages 318--328, 1988. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. SIGPLAN Not., 40(6):129--142, 2005. Google ScholarDigital Library
- Jaydeep Marathe, Frank Mueller, Tushar Mohan, Sally A. Mckee, Bronis R. De Supinski, and Andy Yoo. Metric: Memory tracing via dynamic binary rewriting to identify cache inefficiencies. ACM Transactions on Programming Languages and Systems, 29, 2007. Google ScholarDigital Library
- Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. Electronic Notes in Theoretical Computer Science, 89(2):44--66, 2003. RV '2003, Run-time Verification (Satellite Workshop of CAV '03).Google ScholarCross Ref
- Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not., 42(6):89--100, 2007. Google ScholarDigital Library
- P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst., 6(2):149--206, 2001. Google ScholarDigital Library
- K. Cheung P.Y.K. Qiang Liu Constantinides, G.A. Masselos. Automatic on-chip memory minimization for data reuse. In Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on, pages 251--260. Google ScholarDigital Library
- Shai Rubin, Rastislav Bodık, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In POPL ';02: Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 140--153, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- NVidia CUDA SDK. http://www.nvidia.com/object/cuda_showcase.html.Google Scholar
- Jaewon Seo, Taewhan Kim, and Preeti Ranjan Panda. Memory allocation and mapping in high-level synthesis: an integrated approach. IEEE Trans. Very Large Scale Integr. Syst., 11(5):928--938, 2003. Google ScholarDigital Library
- Greg Stitt, Zhi Guo, Frank Vahid, and Walid Najjar. Techniques for synthesizing binaries to an advanced register/memory structure. In In FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pages 118--124. ACM Press, 2005. Google ScholarDigital Library
- M. Weinhardt and W. Luk. Pipeline vectorization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 234--248, 2001. Google ScholarDigital Library
- Xiangyu Zhang and Rajiv Gupta. Whole execution traces and their applications. ACM Transactions on Architecture and Code Optimization, 2:301--334, 2005. Google ScholarDigital Library
- Peng Zhao, Shimin Cui, Yaoqing Gao, Raúl Silvera, and José Nelson Amaral. Forma: A framework for safe automatic array reshaping. ACM Trans. Program. Lang. Syst., 30(1):2, 2007. Google ScholarDigital Library
Index Terms
- Automatic memory partitioning: increasing memory parallelism via data structure partitioning
Recommendations
Impact of Parallelism and Memory Architecture on FPGA Communication Energy
Regular Papers and Special Section on Field Programmable Gate Arrays (FPGA) 2015The energy in FPGA computations is dominated by data communication energy, either in the form of memory references or data movement on interconnect. In this article, we explore how to use data placement and parallelism to reduce communication energy. We ...
Reading spin-torque memory with spin-torque sensors
NANOARCH '13: Proceedings of the 2013 IEEE/ACM International Symposium on Nanoscale ArchitecturesSpin-Transfer-Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate for future on-chip memory, owing to its high-density, zero-leakage and energy efficiency. In a conventional STT-MRAM cache write operations consume larger energy as ...
Optimizing SDRAM bandwidth for custom FPGA loop accelerators
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate ArraysMemory bandwidth is critical to achieving high performance in many FPGA applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order in which addresses are presented on the SDRAM interface. We present an automated tool for ...
Comments