research-article

Bounded incoherence: a programming model for non-cache-coherent shared memory architectures

Authors:
Yuxin Ren

The George Washington University

The George Washington University
View Profile

,
Gabriel Parmer

The George Washington University

The George Washington University
View Profile

,
Dejan Milojicic

Hewlett Packard Labs

Hewlett Packard Labs
View Profile

PMAM '20: Proceedings of the Eleventh International Workshop on Programming Models and Applications for Multicores and ManycoresFebruary 2020Article No.: 1Pages 1–10https://doi.org/10.1145/3380536.3380541

Published:22 February 2020Publication History

PMAM '20: Proceedings of the Eleventh International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 1–10

ABSTRACT

Cache coherence in modern computer architectures enables easier programming by sharing data across multiple processors. Unfortunately, it can also limit scalability due to cache coherency traffic initiated by competing memory accesses. Rack-scale systems introduce shared memory across a whole rack, but without inter-node cache coherence. This poses memory management and concurrency control challenges for applications that must explicitly manage cache-lines. To fully utilize rack-scale systems for low-latency and scalable computation, applications need to maintain cached memory accesses in spite of non-coherency.

This paper introduces Bounded Incoherence, a programming and memory consistency model that enables cached access to shared data-structures in non-cache-coherency memory. It ensures that updates to memory on one node are visible within at most a bounded amount of time on all other nodes. We evaluate this memory model on modified PowerGraph graph processing framework, and boost its performance by 30% with eight sockets by enabling cached-access to data-structures.

References

Maya Arbel and Hagit Attiya. 2014. Concurrent Updates with RCU: Search Tree As an Example. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing (PODC '14).Google ScholarDigital Library
Krste Asanovic. 2014. FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST'14). Santa Clara, CA, USA.Google Scholar
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The Multikernel: A new OS architecture for scalable multicore systems. In Symposium on Operating System Principles (SOSP).Google ScholarDigital Library
Edouard Bugnion, Scott Devine, and Mendel Rosenblum. 1997. Disco: running commodity operating systems on scalable multiprocessors. In SOSP '97: Proceedings of the sixteenth ACM symposium on Operating systems principles. ACM Press, New York, NY, USA, 143--156. Google ScholarDigital Library
Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawal, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. 2018. Efficient distributed memory management with RDMA and caching. Proceedings of the VLDB Endowment 11, 11 (2018), 1604--1617.Google ScholarDigital Library
J. B. Carter and W. Zwaenepoel. 1990. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the 2nd ACM Symposium on Principles and Practice of Parallel Programming.Google Scholar
Dhruva R. Chakrabarti, Hans-J. Boehm, and Kumud Bhandari. 2014. Atlas: Leveraging Locks for Non-volatile Memory Consistency. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA '14).Google ScholarDigital Library
J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. 1995. Hive: fault containment for shared-memory multiprocessors. SIGOPS Operating Systems Review 29, 5 (1995), 12--25.Google ScholarDigital Library
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA '05).Google ScholarDigital Library
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. 2013. RadixVM: Scalable address spaces for multithreaded applications. In Proceedings of the ACM EuroSys Conference (EuroSys 2013). Prague, Czech Republic.Google ScholarDigital Library
Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, François Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel Chavarría-Miranda. 2005. An Evaluation of Global Address Space Languages: Co-array Fortran and Unified Parallel C. In Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '05).Google ScholarDigital Library
Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar, Jinliang Wei, Wei Dai, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. 2014. Exploiting Bounded Staleness to Speed Up Big Data Analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). Philadelphia, PA, 37--48.Google ScholarDigital Library
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP'07), Stevenson, Washington, USA, October 14--17.Google ScholarDigital Library
Mathieu Desnoyers, Paul E. McKenney, Alan S. Stern, Michel R. Dagenais, and Jonathan Walpole. 2012. User-Level Implementations of Read-Copy Update. IEEE Transactions on Parallel and Distributed Systems 23, 2 (2012).Google ScholarDigital Library
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI'14), Seattle, WA, USA, April 2--4.Google ScholarDigital Library
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, AlexShamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP'15), Monterey, CA, USA, October 4--7.Google ScholarDigital Library
Paolo Faraboschi, Kimberly Keeton, Tim Marsland, and Dejan Milojicic. 2015. Beyond Processor-centric Operating Systems. In 15th Workshop on Hot Topics in Operating Systems, HotOS XV, Kartause, Ittingen, Switzerland, May 18--20.Google Scholar
Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas Anderson. 2011. Scalable Consistency in Scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), Cascais, Portugal, October 23--26.Google ScholarDigital Library
Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). Hollywood, CA.Google Scholar
Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosenblum. 1999. Cellular Disco: Resource Management Using Virtual Clusters on Shared-memory Multiprocessors. In Proceedings of the 17th ACM Symposium on Operating System Principles (SOSP'99), Kiawah Island Resort, South Carolina, USA, December 12--15.Google ScholarDigital Library
Charles Gruenwald, III, Filippo Sironi, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Hare: A File System for Non-cache-coherent Multicores. In Proceedings of the Tenth European Conference on Computer Systems (Eurosys '15).Google ScholarDigital Library
Tim Harris. 2015. Hardware Trends: Challenges and Opportunities in Distributed Computing. ACM SIGACT News 46, 2 (2015), 89--95.Google ScholarDigital Library
Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, and Jonathan Walpole. 2007. Performance of Memory Reclamation for Lockless Synchronization. J. Parallel Distrib. Comput. 67, 12 (2007).Google ScholarDigital Library
Intel Corporation [n. d.]. Intel-64 and IA-32 architectures software developer's manual, Volume 3A: System Programming Guide, Part 1. Intel Corporation.Google Scholar
Intel Corporation. 2016. Intel Rack Scale Design. Online. http://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-architecture/intel-rack-scale-architecture-resources.html.Google Scholar
K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. 1995. CRL: High-performance All-software Distributed Shared Memory. In Proceedings of the 15th ACM Symposium on Operating System Principles (SOSP'95), Copper Mountain Resort, Colorado, USA, December 3--6.Google Scholar
Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning Centralized Coherence and Distributed Critical-Section Execution on Their Head: A New Approach for Scalable Distributed Shared Memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15).Google ScholarDigital Library
Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In USENIX Winter 1994 Technical Conference, San Francisco, California, January 17--21.Google Scholar
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a Social Network or a News Media?. In Proceedings of the 19th International Conference on World Wide Web (WWW '10).Google ScholarDigital Library
Leslie. Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput. 28, 9 (Sept. 1979).Google Scholar
Robert Lyerly, Sang-Hoon Kim, and Binoy Ravindran. 2019. libMPNode: An OpenMP Runtime For Parallel Processing Across Incoherent Domains. In Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM'19).Google ScholarDigital Library
Alexander Matveev, Nir Shavit, Pascal Felber, and Patrick Marlier. 2015. Read-log-update: A Lightweight Synchronization Mechanism for Concurrent Programming. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15).Google ScholarDigital Library
Paul E McKenney, Silas Boyd-Wickizer, and Jonathan Walpole. 2013. RCU usage in the linux kernel: One decade later. Technical report (2013).Google Scholar
Maged M. Michael. 2004. Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. IEEE Transactions on Parallel and Distributed Systems (2004).Google Scholar
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-Tolerant Software Distributed Shared Memory. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). Santa Clara, CA.Google ScholarDigital Library
Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC'16), Santa Clara, CA, USA, October 5--7.Google ScholarDigital Library
Simon Peter, Jana Giceva, Pravin Shinde, Gustavo Alonso, and Timothy Roscoe. 2011. POSTER: OS design for non-cache-coherent systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP'11), Cascais, Portugal, October 23--26.Google Scholar
Simon Peter, Adrian Schüpbach, Dominik Menzi, and Timothy Roscoe. 2011. Early experience with the Barrelfish OS and the Single-Chip Cloud Computer.. In Proceedings of the 3rd Many-core Applications Research Community Symposium (MARC), Ettlingen, Germany, July 5--6.Google Scholar
S. Prakash, Yann Hang Lee, and T. Johnson. 1994. A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap. IEEE Trans. Comput. (1994).Google Scholar
Aravinda Prasad and K. Gopinath. 2016. Prudent Memory Reclamation in Procrastination-Based Synchronization. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'16), Atlanta, GA, USA, April 2--6.Google Scholar
Yuxin Ren, Liu Guyue, Gabriel Parmer, and Björn Brandenburg. 2018. Scalable Memory Reclamation for Multi-Core, Real-Time Systems. In 24th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).Google Scholar
Daniel J. Scales and Kourosh Gharachorloo. 1997. Towards Transparent and Efficient Software Distributed Shared Memory. In Proceedings of the 16th ACM Symposium on Operating System Principles (SOSP'97), St. Malo, France, October 5--8.Google Scholar
Robert Stets, Sandhya Dwarkadas, Nikolaos Hardavellas, Galen Hunt, Leonidas Kontothanassis, Srinivasan Parthasarathy, and Michael Scott. 1997. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-write Network. In Proceedings of the 16th ACM Symposium on Operating System Principles (SOSP'97), St. Malo, France, October 5--8.Google ScholarDigital Library
Rob F Van der Wijngaart, Timothy G Mattson, and Werner Haas. 2011. Lightweight communications on Intel's single-chip cloud computer processor. ACM SIGOPS Operating Systems Review 45, 1 (2011), 73--83.Google ScholarDigital Library
Lei Wang, Liangji Zhuang, Junhang Chen, Huimin Cui, Fang Lv, Ying Liu, and Xiaobing Feng. 2018. Lazygraph: Lazy Data Coherency for Replicas in Distributed Graph-parallel Computation. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'18).Google ScholarDigital Library
Qi Wang, Yuxin Ren, Matt Scaperoth, and Gabriel Parmer. 2015. Speck: AKernel for Scalable Predictability. In Proceedings of the 21st IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).Google Scholar
Qi Wang, Tim Stamler, and Gabriel Parmer. 2016. Parallel Sections: Scaling System-Level Data-Structures. In Proceedings of the ACM EuroSys Conference.Google ScholarDigital Library
Haosen Wen, Joseph Izraelevitz, Wentao Cai, H. Alan Beadle, and Michael L. Scott. 2018. Interval-Based Memory Reclamation. (2018).Google Scholar

Index Terms

Bounded incoherence: a programming model for non-cache-coherent shared memory architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Parallel computing methodologies

Recommendations

A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

One important issue the designer of a scalable shared-memory multiprocessor must deal with is the amount of extra memory required to store the directory information. It is desirable that the directory memory overhead be kept as low as possible, and that ...
Read More
PFFS: a scalable flash memory file system for the hybrid architecture of phase-change RAM and NAND flash
SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

In this paper, we present the scalable and efficient flash file system using the combination of NAND and Phase-change RAM (PRAM). Until now, several flash file systems have been developed considering the physical characteristics of NAND flash. However, ...
Read More
Architecting phase change memory as a scalable dram alternative

Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM'...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PMAM '20: Proceedings of the Eleventh International Workshop on Programming Models and Applications for Multicores and Manycores
February 2020
85 pages
ISBN:9781450375221
DOI:10.1145/3380536
Editors:
Quan Chen
Shanghai Jiao Tong University, China
,
Zhiyi Huang
University of Otago, New Zealand
,
Min Si
Argonne National Laboratory
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
non-cache-coherent shared memory
rack-scale architectures
scalability
Qualifiers
- research-article
Conference

Acceptance Rates
PMAM '20 Paper Acceptance Rate8of15submissions,53%Overall Acceptance Rate53of97submissions,55%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 240
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bounded incoherence: a programming model for non-cache-coherent shared memory architectures

PMAM '20: Proceedings of the Eleventh International Workshop on Programming Models and Applications for Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

PFFS: a scalable flash memory file system for the hybrid architecture of phase-change RAM and NAND flash

Architecting phase change memory as a scalable dram alternative

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Bounded incoherence: a programming model for non-cache-coherent shared memory architectures

PMAM '20: Proceedings of the Eleventh International Workshop on Programming Models and Applications for Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Two-Level Directory Architecture for Highly Scalable cc-NUMA Multiprocessors

PFFS: a scalable flash memory file system for the hybrid architecture of phase-change RAM and NAND flash

Architecting phase change memory as a scalable dram alternative

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media