ABSTRACT
Many modern multicore architectures support shared memory for ease of programming and relaxed memory models to deliver high performance. With relaxed memory models, memory accesses can be reordered dynamically and seen by other processors. Therefore, fence instructions are provided to enforce the memory orderings that are critical to the correctness of a program. However, fence instructions are costly as they cause the processor to stall. Prior works have observed that most of the executions of fence instructions are unnecessary. In this paper we propose address-aware fence, a hardware solution for reducing the overhead of fence instructions without resorting to speculation. Address-aware fence only enforces memory orderings that are necessary to maintain the effect that the traditional fence strives to enforce. This is achieved by dynamically checking a condition for when an execution of a fence must take effect and delay the memory accesses following the fence. When a fence instruction is encountered, first, necessary memory addresses are collected to form a watchlist, and then, only the memory accesses to addresses that are contained in the watchlist are delayed. The memory accesses whose addresses are not contained in the watchlist are allowed to complete without waiting for the completion of pending memory accesses from before the fence. Our experiments conducted on a group of concurrent lock-free algorithms and SPLASH-2 benchmarks show that address-aware fence eliminates nearly all the overhead due to fences and achieves an average improvement of 12.2\% on programs with traditional fences.
- S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:66--76, 1995. Google ScholarDigital Library
- D. F. Bacon, R. Konuru, C. Murthy, and M. Serrano. Thin locks: featherweight synchronization for Java. PLDI '98, pages 258--268. Google ScholarDigital Library
- C. Blundell, M. M. Martin, and T. F. Wenisch. Invisifence: performance-transparent memory ordering in conventional multiprocessors. ISCA '09, pages 233--244. Google ScholarDigital Library
- S. Burckhardt, R. Alur, and M. M. K. Martin. Checkfence: checking consistency of concurrent data types on relaxed memory models. PLDI '07, pages 12--21. Google ScholarDigital Library
- L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. BulkSC: Bulk enforcement of sequential consistency. ISCA '07, pages 278--289. Google ScholarDigital Library
- D. Chase and Y. Lev. Dynamic circular work-stealing deque. SPAA '05, pages 21--28. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, 2001. Google ScholarDigital Library
- D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. ASPLOS '09, pages 157--168. Google ScholarDigital Library
- E. W. Dijkstra. Cooperating sequential processes. The origin of concurrent programming: from semaphores to remote procedure calls, pages 65--138, 2002. Google ScholarDigital Library
- Y. Duan, X. Feng, L. Wang, C. Zhang, and P.-C. Yew. Detecting and eliminating potential violations of sequential consistency for concurrent C/C++ programs. CGO '09, pages 25--34. Google ScholarDigital Library
- X. Fang, J. Lee, and S. P. Midkiff. Automatic fence insertion for shared memory multiprocessing. ICS '03, pages 285--294. Google ScholarDigital Library
- K. Gharachorloo, A. Gupta, and J. Hennessy. Performance evaluation of memory consistency models for shared-memory multiprocessors. ASPLOS '91, pages 245--257. Google ScholarDigital Library
- K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. ISCA '91, pages 355--364.Google Scholar
- C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. PACT '02, pages 179--188. Google ScholarDigital Library
- C. Gniady, B. Falsafi, and T. N. Vijaykumar. Is SC+ ILP = RC' ISCA '99, pages 162--171. Google ScholarDigital Library
- K. Kawachiya, A. Koseki, and T. Onodera. Lock reservation: Java locks can mostly do without atomic operations. OOPSLA '02, pages 130--141. Google ScholarDigital Library
- E. Ladan-Mozes, I.-T. A. Lee, and D. Vyukov. Location-based memory fences. SPAA '11, pages 75--84. Google ScholarDigital Library
- L. Lamport. Specifying concurrent program modules. ACM Trans. Program. Lang. Syst., 5(2):190--222, Apr. 1983. Google ScholarDigital Library
- J. Lee and D. A. Padua. Hiding relaxed memory consistency with a compiler. IEEE Trans. Comput., 50(8):824--833, 2001. Google ScholarDigital Library
- C. Lin, V. Nagarajan, and R. Gupta. Efficient sequential consistency using conditional fences. PACT '10, pages 295--306. Google ScholarDigital Library
- C. Lin, V. Nagarajan, R. Gupta, and B. Rajaram. Efficient sequential consistency via conflict ordering. ASPLOS '12, pages 273--286. Google ScholarDigital Library
- F. Liu, N. Nedev, N. Prisadnikov, M. Vechev, and E. Yahav. Dynamic synthesis for relaxed memory models. PLDI '12, pages 429--440. Google ScholarDigital Library
- B. Lucia, L. Ceze, K. Strauss, S. Qadeer, and H.-J. Boehm. Conflict exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races. ISCA '10, pages 210--221. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. PLDI '05, pages 190--200. Google ScholarDigital Library
- D. Marino, A. Singh, T. Millstein, M. Musuvathi, and S. Narayanasamy. DRFx: a simple and efficient memory model for concurrent programming languages. PLDI '10, pages 351--362. Google ScholarDigital Library
- M. M. Michael. Scalable lock-free dynamic memory allocation. PLDI '04, pages 35--46. Google ScholarDigital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. PODC '96, pages 267--275. Google ScholarDigital Library
- T. Ogasawara, H. Komatsu, and T. Nakatani. To-lock: Removing lock overhead using the owners' temporal locality. PACT '04, pages 255--266. Google ScholarDigital Library
- R. Rajwar and J. R. Goodman. Speculative lock elision: enabling highly concurrent multithreaded execution. MICRO '01, pages 294--305. Google ScholarDigital Library
- D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282--312, 1988. Google ScholarDigital Library
- A. Singh, D. Marino, S. Narayanasamy, T. Millstein, and M. Musuvathi. Efficient processor support for DRFx, a memory model with exceptions. ASPLOS '11, pages 53--66. Google ScholarDigital Library
- A. Singh, S. Narayanasamy, D. Marino, T. Millstein, and M. Musuvathi. End-to-end sequential consistency. ISCA '12, pages 524--535. Google ScholarDigital Library
- D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011. Google ScholarDigital Library
- T. Usui, R. Behrends, J. Evans, and Y. Smaragdakis. Adaptive locks: Combining transactions and locks for efficient concurrency. PACT '09, pages 3--14. Google ScholarDigital Library
- N. Vasudevan, K. S. Namjoshi, and S. A. Edwards. Simple and fast biased locks. PACT '10, pages 65--74. Google ScholarDigital Library
- C. von Praun, H. W. Cain, J.-D. Choi, and K. D. Ryu. Conditional memory ordering. ISCA '06, pages 41--52. Google ScholarDigital Library
- T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Mechanisms for store-wait-free multiprocessors. ISCA '07, pages 266--277. Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. ISCA '95, pages 24--36. Google ScholarDigital Library
Index Terms
Address-aware fences
Recommendations
Fence scoping
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisWe observe that fence instructions used by programmers are usually only intended to order memory accesses within a limited scope. Based on this observation, we propose the concept fence scope which defines the scope within which a fence enforces the ...
Out-of-order vector architectures
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on MicroarchitectureRegister renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory ...
Location-based memory fences
SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architecturesTraditional memory fences are program-counter (PC) based. That is, a memory fence enforces a serialization point in the program instruction stream --- it ensures that all memory references before the fence in the program order have taken effect before ...
Comments