Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

Authors:
Vivek Seshadri

Microsoft Research India and Carnegie Mellon University

Microsoft Research India and Carnegie Mellon University
View Profile

,
Donghyuk Lee

NVIDIA Research and Carnegie Mellon University

NVIDIA Research and Carnegie Mellon University
View Profile

,
Thomas Mullins

Intel and Carnegie Mellon University

Intel and Carnegie Mellon University
View Profile

,
Hasan Hassan

ETH Zürich

ETH Zürich
View Profile

,
Amirali Boroumand

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Jeremie Kim

ETH Zürich and Carnegie Mellon University

ETH Zürich and Carnegie Mellon University
View Profile

,
Michael A. Kozuch

Intel

Intel
View Profile

,
Onur Mutlu

ETH Zürich and Carnegie Mellon University

ETH Zürich and Carnegie Mellon University
View Profile

,
Phillip B. Gibbons

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Todd C. Mowry

Carnegie Mellon University

Carnegie Mellon University
View Profile

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2017Pages 273–287https://doi.org/10.1145/3123939.3124544

Published:14 October 2017Publication History

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 273–287

ABSTRACT

Many important applications trigger bulk bitwise operations, i.e., bitwise operations on large bit vectors. In fact, recent works design techniques that exploit fast bulk bitwise operations to accelerate databases (bitmap indices, BitWeaving) and web search (BitFunnel). Unfortunately, in existing architectures, the throughput of bulk bitwise operations is limited by the memory bandwidth available to the processing unit (e.g., CPU, GPU, FPGA, processing-in-memory).

To overcome this bottleneck, we propose Ambit, an Accelerator-in-Memory for bulk bitwise operations. Unlike prior works, Ambit exploits the analog operation of DRAM technology to perform bitwise operations completely inside DRAM, thereby exploiting the full internal DRAM bandwidth. Ambit consists of two components. First, simultaneous activation of three DRAM rows that share the same set of sense amplifiers enables the system to perform bitwise AND and OR operations. Second, with modest changes to the sense amplifier, the system can use the inverters present inside the sense amplifier to perform bitwise NOT operations. With these two components, Ambit can perform any bulk bitwise operation efficiently inside DRAM. Ambit largely exploits existing DRAM structure, and hence incurs low cost on top of commodity DRAM designs (1% of DRAM chip area). Importantly, Ambit uses the modern DRAM interface without any changes, and therefore it can be directly plugged onto the memory bus.

Our extensive circuit simulations show that Ambit works as expected even in the presence of significant process variation. Averaged across seven bulk bitwise operations, Ambit improves performance by 32X and reduces energy consumption by 35X compared to state-of-the-art systems. When integrated with Hybrid Memory Cube (HMC), a 3D-stacked DRAM with a logic layer, Ambit improves performance of bulk bitwise operations by 9.7X compared to processing in the logic layer of the HMC. Ambit improves the performance of three real-world data-intensive applications, 1) database bitmap indices, 2) BitWeaving, a technique to accelerate database scans, and 3) bit-vector-based implementation of sets, by 3X-7X compared to a state-of-the-art baseline using SIMD optimizations. We describe four other applications that can benefit from Ambit, including a recent technique proposed to speed up web search. We believe that large performance and energy improvements provided by Ambit can enable other applications to use bulk bitwise operations.

References

Belly Card Engineering. https://tech.bellycard.com/.Google Scholar
bitmapist. https://github.com/Doist/bitmapist.Google Scholar
FastBit: An Efficient Compressed Bitmap Index Technology. https://sdm.lbl.gov/fastbit/.Google Scholar
GeForce GTX 745. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-745-oem/specifications.Google Scholar
High Bandwidth Memory DRAM. http://www.jedec.org/standards-documents/docs/jesd235.Google Scholar
Hybrid Memory Cube Specification 2.0. http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.0_Public.pdf.Google Scholar
6th Generation Intel Core Processor Family Datasheet. http://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-datasheet-vol-1.html.Google Scholar
Using Bitmap Indexes in Data Warehouses. https://docs.oracle.com/cd/B28359_01/server.111/b28313/indexes.htm.Google Scholar
Predictive Technology Model. http://ptm.asu.edu/.Google Scholar
Redis - bitmaps. http://redis.io/topics/data-types-intro.Google Scholar
rlite. https://github.com/seppo0010/rlite.Google Scholar
Spool. http://www.getspool.com/.Google Scholar
std::set, std::bitset. http://en.cppreference.com/w/cpp/.Google Scholar
DRAM Power Model. https://www.rambus.com/energy/, 2010.Google Scholar
S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy D. Blaauw, and R. Das. Compute Caches. In HPCA, 2017.Google ScholarCross Ref
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In ISCA, 2015. Google ScholarDigital Library
J. Ahn, S. Yoo, O. Mutlu, and K. Choi. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In ISCA, 2015. Google ScholarDigital Library
A. Akerib, O. Agam, E. Ehrman, and M. Meyassed. Using Storage Cells to Perform Computation. US Patent 8908465, 2014.Google Scholar
A. Akerib and E. Ehrman. In-memory Computational Device. US Patent 9653166, 2015.Google Scholar
M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan. GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping. Bioinformatics, 2017.Google Scholar
G. Benson, Y. Hernandez, and J. Loving. A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm. In CPM, 2013.Google Scholar
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The Gem5 Simulator. SIGARCH CAN, 2011. Google ScholarDigital Library
B. H. Bloom. Space/time Trade-offs in Hash Coding with Allowable Errors. ACM Communications, 13, July 1970. Google ScholarDigital Library
A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and O. Mutlu. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. IEEE CAL, 2017.Google ScholarCross Ref
A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, N. Hajinazar, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu. LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures. arXiv preprint arXiv:1706.03162, 2017.Google Scholar
C.-Y. Chan and Y. E. Ioannidis. Bitmap Index Design and Evaluation. In SIGMOD, 1998. Google ScholarDigital Library
K. K. Chang, D. Lee, Z. Chisti, A. R. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu. Improving DRAM Performance by Parallelizing Refreshes with Accesses. In HPCA, 2014.Google ScholarCross Ref
K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, and O. Mutlu. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. In SIGMETRICS, 2016. Google ScholarDigital Library
K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu. Low-cost Inter-linked Subarrays (LISA): Enabling Fast Inter-subarray Data Movement in DRAM. In HPCA, 2016.Google ScholarCross Ref
K. K. Chang, A. G. Yaălikçi, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap, D. Lee, M. O'Connor, H. Hassan, and O. Mutlu. Understanding Reduced-voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms. SIGMETRICS, 2017. Google ScholarDigital Library
J. Corbet, A. Rubini, and G. Kroah-Hartman. Linux Device Drivers, page 445. O'Reilly Media, 2005. Google ScholarDigital Library
D. Denir, I. AbdelRahman, L. He, and Y. Gao. Audience Insights Query Engine. https://www.facebook.com/business/news/audience-insights.Google Scholar
P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE TPDS, 2014.Google ScholarCross Ref
J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca. The Architecture of the DIVA Processing-in-memory Chip. In ICS, 2002. Google ScholarDigital Library
D. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru, and R. McKenzie. Computational RAM: Implementing Processors in Memory. IEEE DT, 1999. Google ScholarDigital Library
C. F. Falconer, C. P. Mozak, and A. J. Normal. Suppressing Power Supply Noise Using Data Scrambling in Double Data Rate Memory Systems. US Patent 8503678, 2009.Google Scholar
A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In HPCA, 2015.Google ScholarCross Ref
B. B. Fraguela, J. Renau, P. Feautrier, D. Padua, and J. Torrellas. Programming the FlexRAM Parallel Intelligent Memory System. In PPoPP, 2003. Google ScholarDigital Library
M. Gokhale, B. Holmes, and K. Iobst. Processing in Memory: The Terasys Massively Parallel PIM Array. Computer, 1995. Google ScholarDigital Library
B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He. BitFunnel: Revisiting Signatures for Search. In SIGIR, 2017. Google ScholarDigital Library
L. J. Guibas and R. Sedgewick. A Dichromatic Framework for Balanced Trees. In SFCS, 1978. Google ScholarDigital Library
Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and F. Franchetti. 3D-stacked Memory-side Acceleration: Accelerator and System Design. In WoNDP, 2013.Google Scholar
R. W. Hamming. Error Detecting and Error Correcting Codes. BSTJ, 1950.Google ScholarCross Ref
J.-W. Han, C.-S. Park, D.-H. Ryu, and E.-S. Kim. Optical Image Encryption Based on XOR Operations. SPIE OE, 1999.Google ScholarCross Ref
H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu. ChargeCache: Reducing DRAM Latency by Exploiting Row Access Locality. In HPCA, 2016.Google ScholarCross Ref
H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee, O. Ergin, and O. Mutlu. SoftMC: A Flexible and Practical Open-source Infrastructure for Enabling Experimental DRAM Studies. In HPCA, 2017.Google ScholarCross Ref
K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O'Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler. Transparent Offloading and Mapping (TOM): Enabling Programmer-transparent Near-data Processing in GPU Systems. In ISCA, 2016. Google ScholarDigital Library
K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu. Accelerating Pointer Chasing in 3D-stacked Memory: Challenges, Mechanisms, Evaluation. In ICCD, 2016.Google ScholarCross Ref
Intel. Intel Instruction Set Architecture Extensions. https://software.intel.com/en-us/intel-isa-extensions.Google Scholar
K. Itoh. VLSI Memory Chip Design, volume 5. Springer Science & Business Media, 2013.Google Scholar
J. Jeddeloh and B. Keeth. Hybrid Memory Cube: New DRAM Architecture Increases Density and Performance. In VLSIT, 2012.Google ScholarCross Ref
JEDEC. DDR3 SDRAM Standard, JESD79-3D. http://www.jedec.org/sites/default/files/docs/JESD79-3D.pdf, 2009.Google Scholar
H. Kang and S. Hong. One-Transistor Type DRAM. US Patent 7701751, 2009.Google Scholar
M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz. An Energy-efficient VLSI Architecture for Pattern Recognition via Deep Embedding of Computation in SRAM. In ICASSP, 2014.Google ScholarCross Ref
U. Kang, H.-s. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. S. Choi. Co-architecting Controllers and DRAM to Enhance DRAM Process Scaling. In The Memory Forum, 2014.Google Scholar
Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas. FlexRAM: Toward an Advanced Intelligent Memory System. In ICCD, 1999.Google ScholarCross Ref
B. Keeth, R. J. Baker, B. Johnson, and F. Lin. DRAM Circuit Design: Fundamental and High-Speed Topics. Wiley-IEEE Press, 2007. Google ScholarDigital Library
J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, and O. Mutlu. GRIM-filter: Fast Seed Filtering in Read Mapping Using Emerging Memory Technologies. arXiv preprint arXiv:1708.04329, 2017.Google Scholar
Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu. A Case for Exploiting Subarray-level Parallelism (SALP) in DRAM. In ISCA, 2012. Google ScholarDigital Library
Y. Kim, W. Yang, and O. Mutlu. Ramulator: A Fast and Extensible DRAM Simulator. IEEE CAL, 2016. Google ScholarDigital Library
D. E. Knuth. The Art of Computer Programming. Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams, 2009. Google ScholarDigital Library
P. M. Kogge. EXECUBE: A New Architecture for Scaleable MPPs. In ICPP, 1994. Google ScholarDigital Library
S. Kvatinsky, A. Kolodny, U. C. Weiser, and E. G. Friedman. Memristor-based IMPLY Logic Design Procedure. In ICCD, 2011. Google ScholarDigital Library
S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser. MAGIC ---Memristor-Aided Logic. IEEE TCAS II: Express Briefs, 2014.Google ScholarCross Ref
S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser. Memristor-Based Material Implication (IMPLY) Logic: Design Principles and Methodologies. IEEE TVLSI, 2014.Google ScholarCross Ref
B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting Phase Change Memory As a Scalable DRAM Alternative. In ISCA, 2009. Google ScholarDigital Library
D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. K. Chang, and O. Mutlu. Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-case. In HPCA, 2015.Google ScholarCross Ref
D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu. Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture. In HPCA, 2013. Google ScholarDigital Library
D. Lee, F. Hormozdiari, H. Xin, F. Hach, O. Mutlu, and C. Alkan. Fast and Accurate Mapping of Complete Genomics Reads. Methods, 2015.Google Scholar
D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu. Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost. ACM TACO, 2016. Google ScholarDigital Library
D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, and O. Mutlu. Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms. In SIGMETRICS, 2017. Google ScholarDigital Library
Y. Levy, J. Bruck, Y. Cassuto, E. G. Friedman, A. Kolodny, E. Yaakobi, and S. Kvatinsky. Logic Operations in Memory Using a Memristive Akers Array. Microelectronics Journal, 2014. Google ScholarDigital Library
H. Li and R. Durbin. Fast and Accurate Long-read Alignment with Burrows-Wheeler Transform. Bioinformatics, 2010. Google ScholarDigital Library
S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie. Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-Volatile Memories. In DAC, 2016. Google ScholarDigital Library
Y. Li and J. M. Patel. BitWeaving: Fast Scans for Main Memory Data Processing. In SIGMOD, 2013. Google ScholarDigital Library
Y. Li and J. M. Patel. WideTable: An Accelerator for Analytical Data Processing. Proc. VLDB Endow., 2014. Google ScholarDigital Library
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro, 2008. Google ScholarDigital Library
J. Liu, B. Jaiyen, R. Veras, and O. Mutlu. RAIDR: Retention-Aware Intelligent DRAM Refresh. In ISCA, 2012. Google ScholarDigital Library
J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu. An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms. In ISCA, 2013. Google ScholarDigital Library
Z. Liu, I. Calciu, M. Herlihy and O. Mutlu. Concurrent Data Structures for Near-Memory Computing. In SPAA, 2017. Google ScholarDigital Library
S.-L. Lu, Y.-C. Lin, and C.-L. Yang. Improving DRAM Latency with Dynamic Asymmetric Subarray. In MICRO, 2015. Google ScholarDigital Library
R. E. Lyons and W. Vanderkulk. The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM JRD, 1962. Google ScholarDigital Library
S. A. Manavski. CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography. In ICSPC, 2007.Google ScholarCross Ref
G. Myers. A Fast Bit-vector Algorithm for Approximate String Matching Based on Dynamic Programming. JACM, 1999. Google ScholarDigital Library
E. O'Neil, P. O'Neil, and K. Wu. Bitmap Index Design Choices and Their Performance Implications. In IDEAS, 2007. Google ScholarDigital Library
M. Oskin, F. T. Chong, and T. Sherwood. Active Pages: A Computation Model for Intelligent Memory. In ISCA, 1998. Google ScholarDigital Library
M. Patel, J. S. Kim, and O. Mutlu. The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions. In ISCA, 2017. Google ScholarDigital Library
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent RAM. IEEE Micro, 1997. Google ScholarDigital Library
A. Pattnaik, X. Tang, A. Jog, O. Kayıran, A. K. Mishra, M. T. Kandemir, O. Mutlu, and C. R. Das. SchEduling Techniques for GPU Architectures with Processing-in-memory Capabilities. In PACT, 2016. Google ScholarDigital Library
A. Peleg and U. Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro, 1996. Google ScholarDigital Library
K. R. Rasmussen, J. Stoye, and E. W. Myers. Efficient Q-gram Filters for Finding All ε-matches Over a Given Length. JCB, 2006.Google ScholarCross Ref
P. J. Restle, J. W. Park, and B. F. Lloyd. DRAM Variable Retention Time. In IEDM, 1992.Google ScholarCross Ref
R. L. Rivest, L. Adleman, and M. L. Dertouzos. On Data Banks and Privacy Homomorphisms. FSC, 1978.Google Scholar
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. In ISCA, 2000. Google ScholarDigital Library
S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, and M. Brudno. SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational Biology, 2009.Google ScholarCross Ref
V. Seshadri and O. Mutlu. Simple Operations in Memory to Reduce Data Movement, ADCOM, Chapter 5. Elsevier, 2017.Google Scholar
V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. RowClone: Fast and Energy-efficient In-DRAM Bulk Data Copy and Initialization. In MICRO, 2013. Google ScholarDigital Library
V. Seshadri, A. Bhowmick, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. The Dirty-block Index. In ISCA, 2014. Google ScholarDigital Library
V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Fast Bulk Bitwise AND and OR in DRAM. IEEE CAL, 2015. Google ScholarDigital Library
V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Gather-scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses. In MICRO, 2015. Google ScholarDigital Library
V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM. arXiv preprint arXiv:1611.09988, 2016.Google Scholar
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In ISCA, 2016. Google ScholarDigital Library
D. E. Shaw, S. Stolfo, H. Ibrahim, B. K. Hillyer, J. Andrews, and G. Wiederhold. The NON-VON Database Machine: An Overview. http://hdl.handle.net/10022/AC:P:11530, 1981.Google Scholar
R. Sikorski. Boolean Algebras, volume 2. Springer, 1969.Google Scholar
H. S. Stone. A Logic-in-Memory Computer. IEEE Trans. Comput., 1970. Google ScholarDigital Library
A. Subramaniyan and R. Das. Parallel Automata Processor. In ISCA, 2017. Google ScholarDigital Library
P. Tuyls, H. D. L. Hollmann, J. H. V. Lint, and L. Tolhuizen. XOR-based Visual Cryptography Schemes. Designs, Codes and Cryptography. Google ScholarDigital Library
H. S. Warren. Hacker's Delight. Addison-Wesley Professional, 2nd edition, 2012. ISBN 0321842685, 9780321842688. Google ScholarDigital Library
D. Weese, A.-K. Emde, T. Rausch, A. Döring, and K. Reinert. RazerS - fast Read Mapping with Sensitivity Control. Genome research, 2009.Google Scholar
T. Willhalm, I. Oukid, I. Muller, and F. Faerber. Vectorizing Database Column Scans with Complex Predicates. In ADMS, 2013.Google Scholar
K. Wu, E. J. Otoo, and A. Shoshani. Compressing Bitmap Indexes for Faster Search Operations. In SSDBM, 2002. Google ScholarDigital Library
H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, and C. Alkan. Accelerating Read Mapping with FastHASH. BMC Genomics, 2013.Google ScholarCross Ref
H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, and O. Mutlu. Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter to Accelerate Alignment Verification in Read Mapping. Bioinformatics, 2015.Google Scholar
D. S. Yaney, C. Y. Lu, R. A. Kohler, M. J. Kelly, and J. T. Nelson. A Meta-stable Leakage Phenomenon in DRAM Charge Storage - Variable Hold Time. In IEDM, 1987.Google Scholar
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In HPDC, 2014. Google ScholarDigital Library
T. Zhang, K. Chen, C. Xu, G. Sun, T. Wang, and Y. Xie. Half-DRAM: A High-bandwidth and Low-power DRAM Architecture from the Rethinking of Fine-grained Activation. In ISCA, 2014. Google ScholarDigital Library
W. Zhao and Y. Cao. New Generation of Predictive Technology Model for Sub-45 nm Early Design Exploration. IEEE TED, 2006.Google ScholarCross Ref
W. K. Zuravleff and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to be Issued Out of Order. US Patent 5630096, 1997.Google Scholar

Index Terms

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
    2. Semiconductor memory
      1. Dynamic memory

Recommendations

SIMDRAM: a framework for bit-serial SIMD processing using DRAM
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable full adoption of processing-using-DRAM, it is necessary to provide support for more complex operations. In this ...
Read More
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Several system-level operations trigger bulk data copy or initialization. Even though these bulk data operations do not require any computation, current systems transfer a large quantity of data back and forth on the memory channel to perform such ...
Read More
Improving phase change memory performance with data content aware access
ISMM 2020: Proceedings of the 2020 ACM SIGPLAN International Symposium on Memory Management

Phase change memory (PCM) is a scalable non-volatile memory technology that has low access latency (like DRAM) and high capacity (like Flash). Writing to PCM incurs significantly higher latency and energy penalties compared to reading its content. A ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
General Chairs:
Hillery Hunter
IBM Research
,
Jaime Moreno
IBM Research
,
Program Chairs:
Joel Emer
NVIDIA and MIT
,
Daniel Sanchez
MIT
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DRAM
bulk bitwise operations
databases
energy
memory bandwidth
performance
processing-in-memory
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 252
  Total Citations
  View Citations
- 2,847
  Total Downloads
- Downloads (Last 12 months)996
- Downloads (Last 6 weeks)151
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

SIMDRAM: a framework for bit-serial SIMD processing using DRAM

RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization

Improving phase change memory performance with data content aware access