research-article

Automatic memory partitioning: increasing memory parallelism via data structure partitioning

Authors:
Yosi Ben-Asher

Haida University, Haifa, Israel

Haida University, Haifa, Israel
View Profile

,
Nadav Rotem

Haifa University, Haifa, Israel

Haifa University, Haifa, Israel
View Profile

CODES/ISSS '10: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesisOctober 2010Pages 155–162https://doi.org/10.1145/1878961.1878989

Published:24 October 2010Publication History

CODES/ISSS '10: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

Pages 155–162

ABSTRACT

In high-level synthesis, pipelined designs are often restricted by the number of memory banks available to the synthesis system. Using multiple memory banks can improve the performance of accelerated applications. Currently, programmers must manually assign data structures to specific memory banks on the accelerator. This paper describes Automatic Memory Partitioning, a method for automatically partitioning data structures into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. We present an ILP based algorithm for allocating memory segments into multiple memory banks. Experiments show significant improvements in performance while using a minimal number of memory banks.

References

C.Y.R. Ahmad, I. Chen. Post-processor for data path synthesis using multiport memories. In Computer-Aided Design, 1991. ICCAD-91. Digest of Technical Papers., 1991 IEEE International Conference on, pages 276--279, 1991.Google ScholarCross Ref
Yosi Ben-Asher and Nadav Rotem. Synthesis for variable pipelined function units. In System-on-Chip, 2008. SOC 2008. International Symposium on, pages 1--4. IEEE Computer Society, 2008.Google ScholarCross Ref
Joo M.P. Cardoso and Pedro C. Diniz. Compilation Techniques for Reconfigurable Architectures. Springer Publishing Company, Incorporated, 2008. Google ScholarDigital Library
Stephen Curial, Peng Zhao, Jose Nelson Amaral, Yaoqing Gao, Shimin Cui, Raul Silvera, and Roch Archambault. Mpads: memory-pooling-assisted data splitting. In ISMM '08: Proceedings of the 7th international symposium on Memory management, pages 101--110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Srinivas Devadas, Abhijit Ghosh, and Kurt Keutzer. Logic Synthesis. McGraw-Hill, 1994. Google ScholarDigital Library
M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified np-complete problems. In STOC '74: Proceedings of the sixth annual ACM symposium on Theory of computing, pages 47--63, New York, NY, USA, 1974. ACM. Google ScholarDigital Library
Xilinx Inc. Ml405 evaluation platform reference designs, 2009. http://www.xilinx.com/products/boards/ml405/.Google Scholar
Chanik Park Junghee Lee and Soonhoi Ha. Memory access pattern analysis and stream cache design for multimedia applications. In Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asia and South Pacific, pages 22--27, 2003. Google ScholarDigital Library
Ramachandran L., Gajski D.D., and Chaiyakul V. An algorithm for array variable clustering. In European Design and Test Conference, 1994. EDAC, The European Conference on Design Automation., pages 262--266, 1994.Google ScholarCross Ref
M. Lam. Software pipelining: an effective scheduling technique for vliw machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation, pages 318--328, 1988. Google ScholarDigital Library
Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04), Palo Alto, California, Mar 2004. Google ScholarDigital Library
Chris Lattner and Vikram Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. SIGPLAN Not., 40(6):129--142, 2005. Google ScholarDigital Library
Jaydeep Marathe, Frank Mueller, Tushar Mohan, Sally A. Mckee, Bronis R. De Supinski, and Andy Yoo. Metric: Memory tracing via dynamic binary rewriting to identify cache inefficiencies. ACM Transactions on Programming Languages and Systems, 29, 2007. Google ScholarDigital Library
Nicholas Nethercote and Julian Seward. Valgrind: A program supervision framework. Electronic Notes in Theoretical Computer Science, 89(2):44--66, 2003. RV '2003, Run-time Verification (Satellite Workshop of CAV '03).Google ScholarCross Ref
Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavyweight dynamic binary instrumentation. SIGPLAN Not., 42(6):89--100, 2007. Google ScholarDigital Library
P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst., 6(2):149--206, 2001. Google ScholarDigital Library
K. Cheung P.Y.K. Qiang Liu Constantinides, G.A. Masselos. Automatic on-chip memory minimization for data reuse. In Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on, pages 251--260. Google ScholarDigital Library
Shai Rubin, Rastislav Bodık, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In POPL ';02: Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 140--153, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
NVidia CUDA SDK. http://www.nvidia.com/object/cuda_showcase.html.Google Scholar
Jaewon Seo, Taewhan Kim, and Preeti Ranjan Panda. Memory allocation and mapping in high-level synthesis: an integrated approach. IEEE Trans. Very Large Scale Integr. Syst., 11(5):928--938, 2003. Google ScholarDigital Library
Greg Stitt, Zhi Guo, Frank Vahid, and Walid Najjar. Techniques for synthesizing binaries to an advanced register/memory structure. In In FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, pages 118--124. ACM Press, 2005. Google ScholarDigital Library
M. Weinhardt and W. Luk. Pipeline vectorization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 234--248, 2001. Google ScholarDigital Library
Xiangyu Zhang and Rajiv Gupta. Whole execution traces and their applications. ACM Transactions on Architecture and Code Optimization, 2:301--334, 2005. Google ScholarDigital Library
Peng Zhao, Shimin Cui, Yaoqing Gao, Raúl Silvera, and José Nelson Amaral. Forma: A framework for safe automatic array reshaping. ACM Trans. Program. Lang. Syst., 30(1):2, 2007. Google ScholarDigital Library

Index Terms

Automatic memory partitioning: increasing memory parallelism via data structure partitioning
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
  2. Hardware validation

Recommendations

Impact of Parallelism and Memory Architecture on FPGA Communication Energy
Regular Papers and Special Section on Field Programmable Gate Arrays (FPGA) 2015

The energy in FPGA computations is dominated by data communication energy, either in the form of memory references or data movement on interconnect. In this article, we explore how to use data placement and parallelism to reduce communication energy. We ...
Read More
Reading spin-torque memory with spin-torque sensors
NANOARCH '13: Proceedings of the 2013 IEEE/ACM International Symposium on Nanoscale Architectures

Spin-Transfer-Torque Magnetic Random Access Memory (STT-MRAM) is a promising candidate for future on-chip memory, owing to its high-density, zero-leakage and energy efficiency. In a conventional STT-MRAM cache write operations consume larger energy as ...
Read More
Optimizing SDRAM bandwidth for custom FPGA loop accelerators
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays

Memory bandwidth is critical to achieving high performance in many FPGA applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order in which addresses are presented on the SDRAM interface. We present an automated tool for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CODES/ISSS '10: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
October 2010
348 pages
ISBN:9781605589053
DOI:10.1145/1878961
Program Chairs:
Tony Givargis
University of California, Irvine, CA
,
Adam Donlin
Xilinx, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fpga
memory
parallelism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate280of864submissions,32%
Upcoming Conference
ESWEEK '24

Sponsor:

sigbed

sigbed

sigbed

Twentieth Embedded Systems Week

September 29 - October 4, 2024

Raleigh , NC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 411
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic memory partitioning: increasing memory parallelism via data structure partitioning

CODES/ISSS '10: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Impact of Parallelism and Memory Architecture on FPGA Communication Energy

Reading spin-torque memory with spin-torque sensors

Optimizing SDRAM bandwidth for custom FPGA loop accelerators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic memory partitioning: increasing memory parallelism via data structure partitioning

CODES/ISSS '10: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Impact of Parallelism and Memory Architecture on FPGA Communication Energy

Reading spin-torque memory with spin-torque sensors

Optimizing SDRAM bandwidth for custom FPGA loop accelerators

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media