Abstract
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Mellor-Crummey J M, Scott M L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Computer Systems, 1991, 9(1): 21-65.
Michael M M, Scott M L. Implementation of atomic primitives on distributed shared memory multiprocessors. In Proc. the 1st HPCA, Raleigh, USA, Jan. 22-25, 1995, pp.221-231.
Anderson T E. The performance implications of spin-waiting alternatives for shared-memory multiprocessors. In Proc. ICPP, volume II Software, University Park, USA, Aug. 1989, pp.170-174.
Anderson T E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel and Distributed Systems, 1990, 1(1): 6-16.
Goodman J R, Vernon M K,Woest P J. Effcient synchronization primitives for large-scale cache-coherent multiprocessors. In Proc. the 3rd ASPLOS, Boston Mass, USA, Apr. 3-6, 1989, pp.64-75.
Kagi A. Mechanisms for effcient shared-memory, lock-based synchronization. [PhD thesis]. University of Wisconsin-Madison, May 1999.
Kagi A, Burger D, Goodman J R. Effcient synchronization: Let them eat QOLB. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.170-180.
Kumar S, Jiang D, Chandra R, Singh J P. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. In Proc. ACM SIGMETRICS 1999, Atlanta, USA, May 1-4, 1999, pp.23-34.
Rudolph L, Segall Z. Dynamic decentralized cache schemes for mimd parallel processors. In Proc. the 11th ISCA, Ann Arbor, USA, June 5-1, 1984, pp.340-347.
Radovic Z, Hagersten E. Hierarchical backoff locks for nonuniform communication architectures. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.241-252.
Graunke G, Thakkar S. Synchronization algorithms for shared-memory multiprocessors. IEEE Computer, 1990, 23(6): 60-69.
Magnusson P S, Landin A, Hagersten E. Queue locks on cache coherent multiprocessors. In Proc. the 8th ISPP, Cancun, Mexico, Apr. 26-29, 1994, pp.165-171.
Rajwar R, Kagi A, Goodman J R. Improving the throughput of synchronization by insertion of delays. In Proc. the 6th HPCA, Toulouse, France, Jan. 8-12, 2000, pp.168-179.
Rajwar R, Kagi A, Goodman J R. Inferential queueing and speculative push for reducing critical communication latencies. In Proc. the 17th ICS, San Francisco, USA, June 23-26, 2003, pp.273-284.
Hoffmann R, Korch M, Rauber T. Performance evaulation of task pools based on hardware synchronization. In Proc. the 18th SC, Pittsburgh, USA, Nov. 6-12, 2004, pp.44.
Vallejo E, Beivide R et al. Architectural support for fair reader-writer locking. In Proc. the 43rd MICRO, Atlanta, USA, Dec. 4-8, 2010, pp.275-286.
Lev Y, Luchangco V, Olszewski M. Scalable reader-writer locks. In Proc. the 21st SPAA, Calgary, Canada, Aug. 11-13, 2009, pp.101-110.
Suleman M A, Mutlu O, Qureshi M L, Patt Y N. Accelerating critical section execution with asymmetric multi-core architectures. In Proc. the 14th ASPLOS, Washington, USA, March 7-11, 2009, pp.253-264.
Kuskin J et al. The stanford flash multiprocessor. In Proc. the 21st ISCA, Chicago, USA, Apr. 18-21, 1994, pp.302-313.
Laudon J, Lenoski D. The sgi origin: A ccNUMA highly scalable server. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.170-180.
Barroso L A et al. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th ISCA, Vancouver, Canada, June 10-14, 2000, pp.282-293.
Gharachorloo K et al. Architecture and design of Alpha Server GS320. In Proc. the 9th ASPLOS, Cambridge, USA, Nov. 12-15, 2000, pp.13-24.
James D D, Laundrie A T, Gjessing S, Sohni G S. Distributed directory scheme: Scalable coherence interface. IEEE Computer, June 1990, 23(6): 74-77.
Agarwal A, Bianchini R et al. The MIT alewife machine: Architecture and performance. In Proc. the 22nd ISCA, Santa Margherita Ligure, Italy, 22-24, 1995, pp.2-13.
Chaudhuri M, Heinrich M. The impact of negative acknowledgments in shared memory scientific applications. IEEE Trans. Parallel and Distributed Systems, 2004, 15(2): 134-152.
Hu W, Hou R, Xiao J, Zhang L. High Performance general-purpose microprocessors: Past and future. Journal of Computer Science and Technology, 2006, 21(5): 631-640.
Pai V S, Ranganathan P, Adve S V. RSIM: An execution-driven simulator for ilp-based shared-memory multiprocessors and uniprocessors. In Proc. the 3rd Workshop on Computer Architecture Education, San Antonio, USA, Feb. 1-5, 1997.
Pai V S, Ranganathan P, Adve S V. RSIM reference manual version 1.0. Technical Report 9705, Dept. of Electrical and Computer Engineering, Rice University, 1997.
Gharachorloo K, Gupta A, Hennessy J. Two techniques to enhance the performance of memory consistency models. In Proc. ICPP, Austin, USA, Aug. 1991, pp.355-364.
Woo S C et al. The splash-2 programs: Characterization and methodological considerations. In Proc. the 22nd ISCA, Santa Margherita Ligure, Italy, June 22-24, 1995, pp.24-36.
Heinrich M, Chaudhuri M. Ocean warning: Avoid drowing. Computer Architecture News, 2003, 31(3): 30-32.
de Dios A, Sahelices B, Ibáñez P, Viñals V, Llabería J M. Speeding-up synchronizations in dsm multiprocessors. In Proc. the 12nd Euro-Par, Dresden, Germany, Aug. 28-Sept. 1, 2006, pp.473-484.
Alameldeen A R,Wood D A. Variability in architectural simulations of multi-threaded workloads. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.7-18.
Lenoski D et al. The Stanford DASH multiprocessor. IEEE Computer, 1992, 25(3): 63-79
Rajwar R, Goodman J R. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. the 34th MICRO, Austin, USA, Dec. 2-5, 2001, pp.294-305.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sahelices, B., de Dios, A., Ibáñez, P. et al. Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers. J. Comput. Sci. Technol. 27, 75–91 (2012). https://doi.org/10.1007/s11390-012-1207-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1207-2