skip to main content
10.1145/2540708.2540725acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization

Published: 07 December 2013 Publication History

Abstract

Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such operations. As a result, bulk data operations consume high latency, bandwidth, and energy--degrading both system performance and energy efficiency.
In this work, we propose RowClone, a new and simple mechanism to perform bulk copy and initialization completely within DRAM -- eliminating the need to transfer any data over the memory channel to perform such operations. Our key observation is that DRAM can internally and efficiently transfer a large quantity of data (multiple KBs) between a row of DRAM cells and the associated row buffer. Based on this, our primary mechanism can quickly copy an entire row of data from a source row to a destination row by first copying the data from the source row to the row buffer and then from the row buffer to the destination row, via two back-to-back activate commands. This mechanism, which we call the Fast Parallel Mode of RowClone, reduces the latency and energy consumption of a 4KB bulk copy operation by 11.6x and 74.4x, respectively, and a 4KB bulk zeroing operation by 6.0x and 41.5x, respectively. To efficiently copy data between rows that do not share a row buffer, we propose a second mode of RowClone, the Pipelined Serial Mode, which uses the shared internal bus of a DRAM chip to quickly copy data between two banks. RowClone requires only a 0.01% increase in DRAM chip area.
We quantitatively evaluate the benefits of RowClone by focusing on fork, one of the frequently invoked system calls, and five other copy and initialization intensive applications. Our results show that RowClone can significantly improve both single-core and multi-core system performance, while also significantly reducing main memory bandwidth and energy consumption.

References

[1]
Bochs IA-32 emulator project. http://bochs.sourceforge.net/.
[2]
Memcached: A high performance, distributed memory object caching system. http://memcached.org.
[3]
MySQL: An open source database. http://www.mysql.com.
[4]
Wind River Simics full system simulation. http://www.windriver.com/products/simics/.
[5]
J. Ahn. Memory device having page copy mode. U.S. patent 5886944, 1999.
[6]
J. Bent et al. PLFS: A checkpoint filesystem for parallel applications. In SC, 2009.
[7]
D. P. Bovet and M. Cesati. Understanding the Linux Kernel, page 388. O'Reilly Media, 2005.
[8]
F. Chang and G. A. Gibson. Automatic I/O hint generation through speculative execution. In OSDI, 1999.
[9]
J. Chow et al. Shredding Your Garbage: Reducing data lifetime through secure deallocation. In USENIX SS, 2005.
[10]
K. Constantinides et al. Software-Based Online Detection of Hardware Defects: Mechanisms, architectural support, and evaluation. In MICRO, 2007.
[11]
K. Constantinides et al. Online Design Bug Detection: RTL analysis, flexible mechanisms, and evaluation. In MICRO, 2008.
[12]
Reetuparna Das et al. Application-aware prioritization mechanisms for on-chip networks. In MICRO, 2009.
[13]
A. M. Dunn et al. Eternal Sunshine of the Spotless Machine: Protecting privacy with ephemeral channels. In OSDI, 2012.
[14]
S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, (3), 2008.
[15]
P. B. Gillingham and R. Torrance. DRAM page copy method. U.S. patent 5625601, 1997.
[16]
J. A. Halderman et al. Lest We Remember: Cold boot attacks on encryption keys. In USENIX SS, 2008.
[17]
K. Harrison and S. Xu. Protecting cryptographic keys from memory disclosure attacks. In DSN, 2007.
[18]
M. Horiguchi and K. Itoh. Nanoscale Memory Repair. Springer, 2011.
[19]
IBM Corporation. Enterprise Systems Architecture/390 Principles of Operation, 2001.
[20]
Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual. April 2012.
[21]
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, volume 3A, chapter 11, page 12. April 2012.
[22]
E. Ipek et al. Self Optimizing Memory Controllers: A reinforcement learning approach. In ISCA, 2008.
[23]
T. B. Jablin et al. Automatic CPU-GPU communication management and optimization. In PLDI, 2011.
[24]
L. A. Jarrod et al. Avoiding Initialization Misses to the Heap. In ISCA, 2002.
[25]
JEDEC. Server memory roadmap http://www.jedec.org/sites/default/files/Ricki_Dee_Williams.pdf.
[26]
JEDEC. Standard No. 21-C. Annex K: Serial Presence Detect (SPD) for DDR3 SDRAM Modules, 2011.
[27]
JEDEC. DDR3 SDRAM, JESD79-3F, 2012.
[28]
X. Jiang et al. Architecture support for improving bulk memory copying and initialization performance. In PACT, 2009.
[29]
Y. Kim et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.
[30]
Y. Kim et al. Thread Cluster Memory Scheduling: Exploiting differences in memory access behavior. In MICRO, 2010.
[31]
Y. Kim et al. A case for exploiting subarray-level parallelism (SALP) in DRAM. In ISCA, 2012.
[32]
P. M. Kogge. EXECUBE - A new architecture for scaleable MPPs. In ICPP, 1994.
[33]
H. A. Lagar-Cavilla et al. SnowFlock: Rapid virtual machine cloning for cloud computing. In EuroSys, 2009.
[34]
D. Lee et al. Tiered-Latency DRAM: A low-latency and low-cost DRAM architecture. In HPCA, 2013.
[35]
Kun Luo et al. Balancing thoughput and fairness in SMT processors. In ISPASS, 2001.
[36]
Micron. DDR3 SDRAM system-power calculator, 2011.
[37]
D. M. Morgan and M. A. Shore. DRAMs having on-chip row copy circuits for use in testing and video imaging and method for operating same. U.S. patent 5440517, 1995.
[38]
Kaori Mori. Semiconductor memory device including copy circuit. US. patent 5854771, 1998.
[39]
S. P. Muralidhara et al. Reducing memory interference in multi-core systems via application-aware memory channel partitioning. In MICRO, 2011.
[40]
O. Mutlu et al. Efficient Runahead Execution: Power-efficient memory latency tolerance. IEEE Micro, 26(1), 2006.
[41]
J. K. Ousterhout. Why aren't operating systems getting faster as fast as hardware. In USENIX SC, 1990.
[42]
D. Patterson et al. A case for Intelligent RAM. IEEE Micro, 17(2), 1997.
[43]
Rambus. DRAM power model, 2010.
[44]
M. Rosenblum et al. The impact of architectural trends on operating system performance. In SOSP, 1995.
[45]
M. E. Russinovich et al. Windows Internals, page 701. Microsoft Press, 2009.
[46]
G. Sandhu. DRAM scaling and bandwidth challenges. In WETI, 2012.
[47]
R. F. Sauers et al. HP-UX 11i Tuning and Performance, chapter 8. Memory Bottlenecks. Prentice Hall, 2004.
[48]
A. Singh. Mac OS X Internals: A Systems Approach. Addison-Wesley Professional, 2006.
[49]
S. Srinath et al. Feedback Directed Prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA, 2007.
[50]
S. M. Srinivasan et al. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In USENIX ATC, 2004.
[51]
Standard Performance Evaluation Corporation. SPEC CPU2006. http://www.spec.org/cpu2006.
[52]
L. Subramanian et al. MISE: Providing performance predictability and improving fairness in shared main memory systems. In HPCA, 2013.
[53]
K. Sudan et al. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In ASPLOS, 2010.
[54]
A. N. Udipi et al. Rethinking DRAM design and organization for energy-constrained multi-cores. In ISCA, 2010.
[55]
C. A. Waldspurger. Memory resource management in VMware ESX server. In OSDI, 2002.
[56]
F. A. Ware and C. Hampel. Improving power and data efficiency with threaded memory modules. In ICCD, 2006.
[57]
B. Wester et al. Operating system support for application-specific speculation. In EuroSys, 2011.
[58]
X. Yang et al. Why Nothing Matters: The impact of zeroing. In OOPSLA, 2011.
[59]
L. Zhao et al. Hardware support for bulk data movement in server platforms. In ICCD, 2005.
[60]
H. Zheng et al. Mini-rank: Adaptive DRAM architecture for improving memory power efficiency. In MICRO, 2008.

Cited By

View all
  • (2025)Marionette: A RowHammer Attack via Row CouplingProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707242(637-652)Online publication date: 30-Mar-2025
  • (2025)STAR: A Mixed Analog Stochastic In-DRAM Convolutional Neural Network AcceleratorIEEE Design & Test10.1109/MDAT.2024.344758042:1(47-55)Online publication date: Feb-2025
  • (2024)PIMCoSim: Hardware/Software Co-Simulator for Exploring Processing-in-Memory ArchitecturesElectronics10.3390/electronics1323479513:23(4795)Online publication date: 5-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
December 2013
498 pages
ISBN:9781450326384
DOI:10.1145/2540708
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM
  2. bulk operations
  3. energy
  4. in-memory processing
  5. memory bandwidth
  6. page copy
  7. page initialization
  8. performance

Qualifiers

  • Research-article

Funding Sources

Conference

MICRO-46
Sponsor:

Acceptance Rates

MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)241
  • Downloads (Last 6 weeks)27
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Marionette: A RowHammer Attack via Row CouplingProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707242(637-652)Online publication date: 30-Mar-2025
  • (2025)STAR: A Mixed Analog Stochastic In-DRAM Convolutional Neural Network AcceleratorIEEE Design & Test10.1109/MDAT.2024.344758042:1(47-55)Online publication date: Feb-2025
  • (2024)PIMCoSim: Hardware/Software Co-Simulator for Exploring Processing-in-Memory ArchitecturesElectronics10.3390/electronics1323479513:23(4795)Online publication date: 5-Dec-2024
  • (2024)Memory Scraping Attack on Xilinx FPGAs: Private Data Extraction from Terminated Processes2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546527(1-6)Online publication date: 25-Mar-2024
  • (2024)PIMSys: A Virtual Prototype for Processing in MemoryProceedings of the International Symposium on Memory Systems10.1145/3695794.3695797(26-33)Online publication date: 30-Sep-2024
  • (2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
  • (2024)Energy Harvesting-assisted Ultra-Low-Power Processing-in-Memory Accelerator for ML ApplicationsProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3660392(633-638)Online publication date: 12-Jun-2024
  • (2024)SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMsProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3658485(1-6)Online publication date: 23-Jun-2024
  • (2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
  • (2024)A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable ProcessorsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640401(37-54)Online publication date: 27-Apr-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media