skip to main content
10.1145/3087556.3087582acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Concurrent Data Structures for Near-Memory Computing

Published: 24 July 2017 Publication History

Abstract

The performance gap between memory and CPU has grown exponentially. To bridge this gap, hardware architects have proposed near-memory computing (also called processing-in-memory, or PIM), where a lightweight processor (called a PIM core) is located close to memory. Due to its proximity to memory, a memory access from a PIM core is much faster than that from a CPU core. New advances in 3D integration and die-stacked memory make PIM viable in the near future. Prior work has shown significant performance improvements by using PIM for embarrassingly parallel and data-intensive applications, as well as for pointer-chasing traversals in sequential data structures. However, current server machines have hundreds of cores, and algorithms for concurrent data structures exploit these cores to achieve high throughput and scalability, with significant benefits over sequential data structures. Thus, it is important to examine how PIM performs with respect to modern concurrent data structures and understand how concurrent data structures can be developed to take advantage of PIM.
This paper is the first to examine the design of concurrent data structures for PIM. We show two main results: (1) naive PIM data structures cannot outperform state-of-the-art concurrent data structures, such as pointer-chasing data structures and FIFO queues, (2) novel designs for PIM data structures, using techniques such as combining, partitioning and pipelining, can outperform traditional concurrent data structures, with a significantly simpler design.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA '15, pages 105--117, New York, NY, USA, 2015. ACM.
[2]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA '15, pages 336--348, New York, NY, USA, 2015. ACM.
[3]
Berkin Akin, Franz Franchetti, and James C. Hoe. Data reorganization in memory using 3D-stacked DRAM. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA '15, pages 131--143, New York, NY, USA, 2015. ACM.
[4]
Erfan Azarkhish, Christoph Pfister, Davide Rossi, Igor Loi, and Luca Benini. Logic-base interconnect design for near memory computing in the Smart Memory Cube. IEEE Trans. VLSI Syst., 25(1):210--223, 2017.
[5]
Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. High performance AXI-4.0 based interconnect for extensible Smart Memory Cubes. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE '15, pages 1317--1322, San Jose, CA, USA, 2015. EDA Consortium.
[6]
Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Design and evaluation of a processing-in-memory architecture for the Smart Memory Cube. In Proceedings of the 29th International Conference on Architecture of Computing Systems -- ARCS 2016 - Volume 9637, pages 19--31, New York, NY, USA, 2016. Springer-Verlag New York, Inc.
[7]
Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, and Igor Zablotchi. FloDB: Unlocking memory in persistent key-value stores. In Proceedings of the Twelfth European Conference on Computer Systems, EuroSys '17, pages 80--94, New York, NY, USA, 2017. ACM.
[8]
Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die stacking (3D) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 469--479, Washington, DC, USA, 2006. IEEE Computer Society.
[9]
Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. LazyPIM: An efficient cache coherence mechanism for processing-in-memory. IEEE Computer Architecture Letters, 2016.
[10]
Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera. Black-box concurrent data structures for NUMA architectures. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '17, pages 207--221, New York, NY, USA, 2017. ACM.
[11]
Kevin K. Chang. Understanding and Improving Latency of DRAM-Based Memory Systems. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2017.
[12]
Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, SIGMETRICS '16, pages 323--336, New York, NY, USA, 2016. ACM.
[13]
Kevin K. Chang, Prashant J. Nair, Donghyuk Lee, Saugata Ghose, Moinuddin K. Qureshi, and Onur Mutlu. Low-cost inter-linked subarrays (LISA): enabling fast inter-subarray data movement in DRAM. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 12--16, 2016, pages 568--580, 2016.
[14]
Kevin K. Chang, A. Giray Yaglikci, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and Onur Mutlu. Understanding reduced-voltage operation in modern dram devices: Experimental characterization, analysis, and mechanisms. In to appear in Proceedings of the 2017 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, SIGMETRICS '17.
[15]
Hybrid Memory Cube Consortium. Hybrid Memory Cube specification 1.0, 2013.
[16]
Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 33--48, New York, NY, USA, 2013. ACM.
[17]
Duncan G. Elliott, W. Martin Snelgrove, and Michael Stumm. Computational RAM: A memory-SIMD hybrid and its application to DSP. In Proceedings of the IEEE 1992 Custom Integrated Circuits Conference, CICC '92, pages 30.6.1--30.6.4, Piscataway, NJ, USA, 1992. IEEE Press.
[18]
Panagiota Fatourou and Nikolaos D. Kallimanis. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 257--266, New York, NY, USA, 2012. ACM.
[19]
Keir Fraser. Practical lock-freedom. Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory, February 2004.
[20]
Maya Gokhale, Bill Holmes, and Ken Iobst. Processing in memory: The Terasys massively parallel PIM array. Computer, 28(4):23--31, April 1995.
[21]
Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, SC '99, New York, NY, USA, 1999. ACM.
[22]
M. Hashemi, O. Mutlu, and Y. N. Patt. Continuous Runahead: Transparent hardware acceleration for memory intensive workloads. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '16, Oct 2016.
[23]
Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. Accelerating dependent cache misses with an enhanced memory controller. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA '16, pages 444--455, Piscataway, NJ, USA, 2016. IEEE Press.
[24]
Steve Heller, Maurice Herlihy, Victor Luchangco, Mark Moir, William N. Scherer, and Nir Shavit. A lazy concurrent list-based set algorithm. In Proceedings of the 9th International Conference on Principles of Distributed Systems, OPODIS'05, pages 3--16, Berlin, Heidelberg, 2006. Springer-Verlag.
[25]
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the Twenty-second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '10, pages 355--364, New York, NY, USA, 2010. ACM.
[26]
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Scalable flat-combining based synchronous queues. In Proceedings of the 24th International Conference on Distributed Computing, DISC'10, pages 79--93, Berlin, Heidelberg, 2010. Springer-Verlag.
[27]
Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.
[28]
Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463--492, July 1990.
[29]
Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA '16, pages 204--216, Piscataway, NJ, USA, 2016. IEEE Press.
[30]
Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In IEEE 34th International Conference on Computer Design, ICCD 2016, pages 25--32. IEEE, 2016.
[31]
Joe Jeddeloh and Brent Keeth. Hybrid memory cube new DRAM architecture increases density and performance. In Symposium on VLSI Technology, VLSIT 2012, pages 87--88. IEEE, 2012.
[32]
Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Vi Lam, Josep Torrellas, and Pratap Pattnaik. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the IEEE International Conference On Computer Design, ICCD '99.
[33]
Joonyoung Kim and Younsu Kim. HBM: Memory solution for bandwidth-hungry processors. 2014 IEEE Hot Chips 26 Symposium (HCS), 00:1--24, 2014.
[34]
Peter M. Kogge. EXECUBE-a new architecture for scaleable MPPs. In Proceedings of the 1994 International Conference on Parallel Processing - Volume 01, ICPP '94, pages 77--84, Washington, DC, USA, 1994. IEEE Computer Society.
[35]
Donghyuk Lee. Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2017.
[36]
Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Simultaneous multi-layer access: Improving 3D-stacked memory bandwidth at low cost. ACM Trans. Archit. Code Optim., 12(4):63:1--63:29, January 2016.
[37]
Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and Onur Mutlu. Design-induced latency variation in modern dram chips: Characterization, analysis, and latency reduction mechanisms. In to appear in Proceedings of the 2017 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, SIGMETRICS '17.
[38]
Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Manabi Khan, Vivek Seshadri, Kevin Kai-Wei Chang, and Onur Mutlu. Adaptive-latency DRAM: optimizing DRAM timing for the common-case. In 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 7--11, 2015, pages 489--501, 2015.
[39]
Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu. Tiered-latency DRAM: A low latency and low cost DRAM architecture. In 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23--27, 2013, pages 615--626, 2013.
[40]
Gabriel H. Loh. 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pages 453--464, Washington, DC, USA, 2008. IEEE Computer Society.
[41]
Adam Morrison and Yehuda Afek. Fast concurrent queues for x86 processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 103--112, New York, NY, USA, 2013. ACM.
[42]
Onur Mutlu. Memory scaling: a systems architecture perspective. In Proceedings of the 5th International Memory Workshop, IMW '13, 2013.
[43]
Onur Mutlu and Lavanya Subramanian. Research problems and opportunities in memory systems. Supercomputing Frontiers and Innovations, 1, 2014.
[44]
Mark Oskin, Frederic T. Chong, and Timothy Sherwood. Active pages: A computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, ISCA '98, pages 192--203, Washington, DC, USA, 1998. IEEE Computer Society.
[45]
David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A case for intelligent RAM. IEEE Micro, 17(2):34--44, March 1997.
[46]
W. Pugh. Concurrent maintenance of skip lists. Technical report, University of Maryland at College Park, 1990.
[47]
Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. Fast bulk bitwise AND and OR in DRAM. IEEE Comput. Archit. Lett., 14(2):127--131, July 2015.
[48]
Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 185--197, New York, NY, USA, 2013. ACM.
[49]
Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry. Buddy-ram: Improving the performance and efficiency of bulk bitwise operations using DRAM. CoRR, abs/1611.09988, 2016.
[50]
Vivek Seshadri and Onur Mutlu. The processing using memory paradigm: In-DRAM bulk copy, initialization, bitwise AND and OR. CoRR, abs/1610.09603, 2016.
[51]
Harold S. Stone. A logic-in-memory computer. IEEE Trans. Comput., 19(1):73--78, January 1970.
[52]
J. Valois. Lock-free Data Structures. PhD thesis, Rensselaer Polytechnic Institute, Troy, NY, USA, 1996.
[53]
Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 85--98, New York, NY, USA, 2014. ACM.
[54]
Qiuling Zhu, Berkin Akin, H. Ekin Sumbul, Fazle Sadi, James C. Hoe, Larry T. Pileggi, and Franz Franchetti. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In IEEE International 3D Systems Integration Conference, 3DIC 2013, San Francisco, CA, USA, October 2--4, 2013, pages 1--7, 2013.
[55]
Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Larry T. Pileggi, and Franz Franchetti. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In IEEE High Performance Extreme Computing Conference, HPEC 2013, Waltham, MA, USA, September 10--12, 2013, pages 1--6, 2013.

Cited By

View all
  • (2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 13-Dec-2024
  • (2024)PIM-STM: Software Transactional Memory for Processing-In-Memory SystemsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640428(897-911)Online publication date: 27-Apr-2024
  • (2024)RADAR: A Skew-Resistant and Hotness-Aware Ordered Index Design for Processing-in-Memory SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342485335:9(1598-1614)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '17: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures
July 2017
392 pages
ISBN:9781450345934
DOI:10.1145/3087556
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concurrent data structures
  2. near-memory computing
  3. parallel programs
  4. processing-in memory

Qualifiers

  • Research-article

Conference

SPAA '17
Sponsor:

Acceptance Rates

SPAA '17 Paper Acceptance Rate 31 of 127 submissions, 24%;
Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25
37th ACM Symposium on Parallelism in Algorithms and Architectures
July 28 - August 1, 2025
Portland , OR , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)89
  • Downloads (Last 6 weeks)12
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PyGim : An Efficient Graph Neural Network Library for Real Processing-In-Memory ArchitecturesProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004348:3(1-36)Online publication date: 13-Dec-2024
  • (2024)PIM-STM: Software Transactional Memory for Processing-In-Memory SystemsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640428(897-911)Online publication date: 27-Apr-2024
  • (2024)RADAR: A Skew-Resistant and Hotness-Aware Ordered Index Design for Processing-in-Memory SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342485335:9(1598-1614)Online publication date: Sep-2024
  • (2024)AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00043(487-500)Online publication date: 29-Jun-2024
  • (2024)MATSA: An MRAM-Based Energy-Efficient Accelerator for Time Series AnalysisIEEE Access10.1109/ACCESS.2024.337331112(36727-36742)Online publication date: 2024
  • (2023)ESH: Design and Implementation of an Optimal Hashing Scheme for Persistent MemoryApplied Sciences10.3390/app13201152813:20(11528)Online publication date: 20-Oct-2023
  • (2023)PIM-tree: A Skew-resistant Index for Processing-in-Memory (Abstract)Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing10.1145/3597635.3598029(13-14)Online publication date: 18-Jul-2023
  • (2023)Fabric-Centric ComputingProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595907(118-126)Online publication date: 22-Jun-2023
  • (2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
  • (2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media