ABSTRACT
We present a highly-scalable non-blocking producer-consumer task pool, designed with a special emphasis on lightweight synchronization and data locality. The core building block of our pool is SALSA, Scalable And Low Synchronization Algorithm for a single-consumer container with task stealing support. Each consumer operates on its own SALSA container, stealing tasks from other containers if necessary. We implement an elegant self-tuning policy for task insertion, which does not push tasks to overloaded SALSA containers, thus decreasing the likelihood of stealing.
SALSA manages large chunks of tasks, which improves locality and facilitates stealing. SALSA uses a novel approach for coordination among consumers, without strong atomic operations or memory barriers in the fast path. It invokes only two CAS operations during a chunk steal.
Our evaluation demonstrates that a pool built using SALSA containers scales linearly with the number of threads and significantly outperforms other FIFO and non-FIFO alternatives.
- http://home.comcast.net/~pjbishop/Dave/Asymmetric-Dekker-Synchronization.txt.Google Scholar
- www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919.Google Scholar
- Y. Afek, G. Korland, M. Natanzon, and N. Shavit. Scalable producer-consumer pools based on elimination-diffraction trees. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II}, Euro-Par'10, pages 151--162, 2010. Google ScholarDigital Library
- Y. Afek, G. Korland, and E. Yanovsky. Quasi-linearizability: Relaxed consistency for improved concurrency. In Principles of Distributed Systems, Lecture Notes in Computer Science, pages 395--410. Google Scholar
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures}, SPAA '98, pages 119--129, 1998. Google ScholarDigital Library
- D. Basin. Cafe: Scalable task pools with adjustable fairness and contention. Master's thesis, Technion, 2011.Google Scholar
- D. Basin, R. Fan, I. Keidar, O. Kiselov, and D. Perelman. Cafe: scalable task pools with adjustable fairness and contention. In Proceedings of the 25th international conference on Distributed computing}, DISC'11, pages 475--488, 2011. Google ScholarDigital Library
- S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. A case for numa-aware contention management on multicore systems. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, USENIXATC'11, 2011. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46:720--748, September 1999. Google ScholarDigital Library
- A. Braginsky and E. Petrank. Locality-conscious lock-free linked lists. In Proceedings of the 12th international conference on Distributed computing and networking, ICDCN'11, pages 107--118, 2011. Google ScholarDigital Library
- D. Dice, V. J. Marathe, and N. Shavit. Flat-combining numa locks. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, 2011. Google ScholarDigital Library
- A. Gidenstam, H. Sundell, and P. Tsigas. Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency. In Proceedings of the 14th international conference on Principles of distributed systems}, OPODIS'10, pages 302--317, 2010. Google ScholarDigital Library
- E. Gidron, I. Keidar, D. Perelman, and Y. Perez. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-Consumer Pools. Technical report, Technion, 2012.Google ScholarDigital Library
- D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Distrib. Comput., 18:189--207, February 2006. Google ScholarDigital Library
- D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the twenty-first annual symposium on Principles of distributed computing, PODC '02, pages 280--289, 2002. Google ScholarDigital Library
- D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '04, pages 206--215, 2004. Google ScholarDigital Library
- M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th international conference on Principles of distributed systems, OPODIS'07, pages 401--414, 2007. Google ScholarDigital Library
- E. Ladan-Mozes, I.-T. A. Lee, and D. Vyukov. Location-based memory fences. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 75--84, 2011. Google ScholarDigital Library
- L. Lamport. How to make a multiprocessor computer that correctly execute multiprocess programs. IEEE Trans. Comput., pages 690--691, 1979. Google ScholarDigital Library
- M. M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst., 15:491--504, June 2004. Google ScholarDigital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, PODC '96, pages 267--275, 1996. Google ScholarDigital Library
- M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using elimination to implement scalable and lock-free fifo queues. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 253--262, 2005. Google ScholarDigital Library
- P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-tso: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM, pages 89--97, 2010. Google ScholarDigital Library
- H. Sundell, A. Gidenstam, M. Papatriantafilou, and P. Tsigas. A lock-free algorithm for concurrent bags. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA '11, pages 335--344, 2011.. Google ScholarDigital Library
Index Terms
- SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools
Recommendations
LOFT: lock-free transactional data structures
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingConcurrent data structures are widely used in modern multicore architectures, providing atomicity (linearizability) for each concurrent operation. However, it is often desirable to execute several operations on multiple data structures atomically. We ...
Transactional Lock Elision Meets Combining
PODC '17: Proceedings of the ACM Symposium on Principles of Distributed ComputingFlat combining (FC) and transactional lock elision (TLE) are two techniques that facilitate efficient multi-thread access to a sequentially implemented data structure protected by a lock. FC allows threads to delegate their operations to another (...
Predicate RCU: an RCU for scalable concurrent updates
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingRead-copy update (RCU) is a shared memory synchronization mechanism with scalable synchronization-free reads that nevertheless execute correctly with concurrent updates. To guarantee the consistency of such reads, an RCU update transitioning the data ...
Comments