research-article

SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools

Authors:
Elad Gidron

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Idit Keidar

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Dmitri Perelman

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Yonathan Perez

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architecturesJune 2012Pages 151–160https://doi.org/10.1145/2312005.2312035

Published:25 June 2012Publication History

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

Pages 151–160

ABSTRACT

We present a highly-scalable non-blocking producer-consumer task pool, designed with a special emphasis on lightweight synchronization and data locality. The core building block of our pool is SALSA, Scalable And Low Synchronization Algorithm for a single-consumer container with task stealing support. Each consumer operates on its own SALSA container, stealing tasks from other containers if necessary. We implement an elegant self-tuning policy for task insertion, which does not push tasks to overloaded SALSA containers, thus decreasing the likelihood of stealing.

SALSA manages large chunks of tasks, which improves locality and facilitates stealing. SALSA uses a novel approach for coordination among consumers, without strong atomic operations or memory barriers in the fast path. It invokes only two CAS operations during a chunk steal.

Our evaluation demonstrates that a pool built using SALSA containers scales linearly with the number of threads and significantly outperforms other FIFO and non-FIFO alternatives.

References

http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt.Google Scholar
www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.Google Scholar
Y. Afek, G. Korland, M. Natanzon, and N. Shavit. Scalable producer-consumer pools based on elimination-diffraction trees. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II}, Euro-Par'10, pages 151--162, 2010. Google ScholarDigital Library
Y. Afek, G. Korland, and E. Yanovsky. Quasi-linearizability: Relaxed consistency for improved concurrency. In Principles of Distributed Systems, Lecture Notes in Computer Science, pages 395--410. Google Scholar
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures}, SPAA '98, pages 119--129, 1998. Google ScholarDigital Library
D. Basin. Cafe: Scalable task pools with adjustable fairness and contention. Master's thesis, Technion, 2011.Google Scholar
D. Basin, R. Fan, I. Keidar, O. Kiselov, and D. Perelman. Cafe: scalable task pools with adjustable fairness and contention. In Proceedings of the 25th international conference on Distributed computing}, DISC'11, pages 475--488, 2011. Google ScholarDigital Library
S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for numa-aware contention management on multicore systems. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, USENIXATC'11, 2011. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999. Google ScholarDigital Library
A. Braginsky and E. Petrank. Locality-conscious lock-free linked lists. In Proceedings of the 12th international conference on Distributed computing and networking, ICDCN'11, pages 107--118, 2011. Google ScholarDigital Library
D. Dice, V. J. Marathe, and N. Shavit. Flat-combining numa locks. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, 2011. Google ScholarDigital Library
A. Gidenstam, H. Sundell, and P. Tsigas. Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency. In Proceedings of the 14th international conference on Principles of distributed systems}, OPODIS'10, pages 302--317, 2010. Google ScholarDigital Library
E. Gidron, I. Keidar, D. Perelman, and Y. Perez. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools. Technical report, Technion, 2012.Google ScholarDigital Library
D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006. Google ScholarDigital Library
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the twenty-first annual symposium on Principles of distributed computing, PODC '02, pages 280--289, 2002. Google ScholarDigital Library
D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '04, pages 206--215, 2004. Google ScholarDigital Library
M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th international conference on Principles of distributed systems, OPODIS'07, pages 401--414, 2007. Google ScholarDigital Library
E. Ladan-Mozes, I.-T. A. Lee, and D. Vyukov. Location-based memory fences. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 75--84, 2011. Google ScholarDigital Library
L. Lamport. How to make a multiprocessor computer that correctly execute multiprocess programs. IEEE Trans. Comput., pages 690--691, 1979. Google ScholarDigital Library
M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst., 15:491--504, June 2004. Google ScholarDigital Library
M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, pages 267--275, 1996. Google ScholarDigital Library
M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using elimination to implement scalable and lock-free fifo queues. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 253--262, 2005. Google ScholarDigital Library
P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-tso: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM, pages 89--97, 2010. Google ScholarDigital Library
H. Sundell, A. Gidenstam, M. Papatriantafilou, and P. Tsigas. A lock-free algorithm for concurrent bags. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 335--344, 2011.. Google ScholarDigital Library

Index Terms

SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

LOFT: lock-free transactional data structures
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Concurrent data structures are widely used in modern multicore architectures, providing atomicity (linearizability) for each concurrent operation. However, it is often desirable to execute several operations on multiple data structures atomically. We ...
Read More
Transactional Lock Elision Meets Combining
PODC '17: Proceedings of the ACM Symposium on Principles of Distributed Computing

Flat combining (FC) and transactional lock elision (TLE) are two techniques that facilitate efficient multi-thread access to a sequentially implemented data structure protected by a lock. FC allows threads to delegate their operations to another (...
Read More
Predicate RCU: an RCU for scalable concurrent updates
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Read-copy update (RCU) is a shared memory synchronization mechanism with scalable synchronization-free reads that nevertheless execute correctly with concurrent updates. To guarantee the consistency of such reads, an RCU update transitioning the data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
June 2012
348 pages
ISBN:9781450312134
DOI:10.1145/2312005
General Chair:
Guy Blelloch
Carnegie Mellon University, USA
,
Program Chair:
Maurice Herlihy
Brown University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
concurrent data structures
multi-core
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 249
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools

SPAA '12: Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

LOFT: lock-free transactional data structures

Transactional Lock Elision Meets Combining

Predicate RCU: an RCU for scalable concurrent updates