research-article

Heterogeneous-race-free memory models

Authors:

Derek R. Hower,

Blake A. Hechtman,

Bradford M. Beckmann,

Benedict R. Gaster,

Steven K. Reinhardt,

David A. WoodAuthors Info & Claims

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Pages 427 - 440

https://doi.org/10.1145/2541940.2541981

Published: 24 February 2014 Publication History

Abstract

Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous systems (unlike homogeneous CPU systems) provide synchronization mechanisms that only guarantee ordering among a subset of threads, which we call a scope. Unfortunately, the consequences and se-mantics of these scoped operations are not yet well under-stood. Without a formal and approachable model to reason about the behavior of these operations, we risk an array of portability and performance issues.

In this paper, we embrace scoped synchronization with a new class of memory consistency models that add scoped synchronization to data-race-free models like those of C++ and Java. Called sequential consistency for heterogeneous-race-free (SC for HRF), the new models guarantee SC for programs with "sufficient" synchronization (no data races) of "sufficient" scope. We discuss two such models. The first, HRF-direct, works well for programs with highly regular parallelism. The second, HRF-indirect, builds on HRF-direct by allowing synchronization using different scopes in some cases involving transitive communication. We quanti-tatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs.

References

[1]

Adve, S.V. and Boehm, H.-J. 2010. Semantics of shared variables & synchronization a.k.a. memory models.

[2]

Adve, S.V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. Computer. 29, 12 (1996), 66--76.

Digital Library

[3]

Adve, S.V. and Hill, M.D. 1990. Weak ordering--a new definition. Proceedings of the International Symposium on Computer Architecture (New York, NY, USA, 1990), 2--14.

Digital Library

[4]

AMD, Inc. 2012. Southern Islands series instruction set architecture. Advanced Micro Devices.

[5]

Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T. and Sardashti, S. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News. 39, 2 (2011), 1--7.

Digital Library

[6]

Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H. and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. ACM.

Digital Library

[7]

Boehm, H.-J. and Adve, S.V. 2008. Foundations of the C++ concurrency memory model. International Symposium on Programming Language Design and Implementation (PLDI) (Tuscon, AZ, Jun. 2008), 68--78.

Digital Library

[8]

Carlson, W.W., Draper, J.M., Culler, D.E., Yelick, K., Brooks, E. and Warren, K. 1999. Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses.

[9]

Chamberlain, B.L., Callahan, D. and Zima, H.P. 2007. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications. 21, 3 (2007), 291--312.

Digital Library

[10]

Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C. and Sarkar, V. 2005. X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices (2005), 519--538.

Digital Library

[11]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H. and Skadron, K. 2009. Rodinia: a benchmark suite for heterogeneous computing. IEEE International Symposium on Workload Characterization, 2009. IISWC 2009 (Oct. 2009), 44--54.

Digital Library

[12]

CUDA 5.5 C programming guide: 2013. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2013-12-19.

[13]

Danalis, A., Pollock, L., Swany, M. and Cavazos, J. 2009. MPI-aware compiler optimizations for improving communication-computation overlap. Proceedings of the 23rd in-ternational conference on Supercomputing (2009), 316--325.

Digital Library

[14]

Dubois, M., Scheurich, C. and Briggs, F. 1986. Memory access buffering in multiprocessors. ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture (1986), 434--442.

Digital Library

[15]

Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A. and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. Proceedings of the 17th annual International Symposium on Computer Architecture (1990), 376--387.

Digital Library

[16]

Gropp, W., Lusk, E. and Skjellum, A. 1999. Using MPI: portable parallel programming with the message passing interface. MIT press.

Digital Library

[17]

Guiady, C., Falsafi, B. and Vijaykumar, T.N. 1999. Is SC+ILP=RC? Proceedings of the 26th International Symposium on Computer Architecture, 1999 (1999), 162--171.

Digital Library

[18]

Gupta, K., Stuart, J. and Owens, J.D. 2012. A study of persistent threads style GPU programming for GPGPU workloads. Proceedings of Innovative Parallel Computing (InPar '12) (May 2012).

[19]

Hechtman, B.A., Che, S., Hower, D.R., Tian, Y., Beckmann, B.M., Hill, M.D., Reinhardt, S.K. and Wood, D.A. 2014. QuickRelease: a throughput oriented approach to release consistency on GPUs. Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA) (Orland, FL, Feb. 2014).

[20]

Hechtman, B.A. and Sorin, D.J. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. Proceedings of the 40th International Symposi-um on Computer Architecture (ISCA) (Tel Aviv, Israel, Jun. 2013).

Digital Library

[21]

HSA Foundation 2012. Heterogeneous System Architecture: A Technical Review.

[22]

Kalla, R., Sinharoy, B., Starke, W.J. and Floyd, M. 2010. Power7: IBM's next-generation server processor. IEEE Micro. 30, 2 (2010), 7--15.

Digital Library

[23]

Kelm, J.H., Johnson, D.R., Tuohy, W., Lumetta, S.S. and Patel, S.J. 2010. Cohesion: a hybrid memory model for accelerators. Proceedings of the 37th annual international symposium on Computer architecture (New York, NY, USA, 2010), 429--440.

Digital Library

[24]

Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers. C-28, 9 (Sep. 1979), 690--691.

Digital Library

[25]

Lucia, B., Ceze, L., Strauss, K., Qadeer, S. and Boehm, H.J. 2010. Conflict exceptions: providing simple concurrent language semantics with precise hardware exceptions. Interna-tional Symposium on Computer Architecture (ISCA) (2010).

Digital Library

[26]

Manson, J., Pugh, W. and Adve, S.V. 2005. The Java memory model. Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages (New York, NY, USA, 2005), 378--391.

Digital Library

[27]

Marino, D., Singh, A., Millstein, T., Musuvathi, M. and Narayanasamy, S. 2010. DRFX: a simple and efficient memory model for concurrent programming languages. Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation (New York, NY, USA, 2010), 351--362.

Digital Library

[28]

Munshi, A. ed. 2013. The OpenCL Specification, Version 2.0 (Provisional). Khronos Group.

[29]

Munshi, A., Gaster, B. and Mattson, T.G. 2011. OpenCL programming guide. Addison-Wesley Professional.

Digital Library

[30]

NVIDIA Corporation 2012. Parallel Thread Execution ISA Version 3.1.

[31]

Olivier, S., Huan, J., Liu, J., Prins, J., Dinan, J., Sa-dayappan, P. and Tseng, C.-W. 2007. UTS: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing. Springer. 235--250.

Digital Library

[32]

OpenACC, Inc 2011. The OpenACCTM Application Programming Interface, Version 1.0.

[33]

Owens, S., Sarkar, S. and Sewell, P. 2009. A better x86 memory model: x86-TSO. Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics (Berlin, Heidelberg, 2009), 391--407.

Digital Library

[34]

Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A. 2013. Convolution engine: balancing efficiency & flexibility in specialized computing. Proceedings of the 40th Annual International Symposium on Computer Architecture (2013), 24--35.

Digital Library

[35]

Sindhu, P.S., Frailong, J.-M. and Cekleov, M. 1992. Formal specification of memory models. Scalable Shared Memory Multiprocessors: Proceedings. (1992), 25.

[36]

Sorin, D.J., Hill, M.D. and Wood, D.A. 2011. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. 6, 3 (2011), 1--212.

[37]

Thakkar, S., Gifford, P. and Fielland, G. 1988. The balance multiprocessor system. IEEE Micro. 8, 1 (Jan. 1988), 57--69.

Digital Library

[38]

UTS source distribution: http://sourceforge.net/p/uts-benchmark/wiki/Home/.

Cited By

Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Jeon H(2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_66
Show More Cited By

Index Terms

Heterogeneous-race-free memory models

Recommendations

Heterogeneous-race-free memory models
ASPLOS '14

Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous ...
Heterogeneous-race-free memory models
ASPLOS '14

Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous ...
A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

February 2014

780 pages

ISBN:9781450323055

DOI:10.1145/2541940

General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

ACM SIGPLAN Notices Volume 49, Issue 4
ASPLOS '14
April 2014
729 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2644865
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '14

Sponsor:

ASPLOS '14: Architectural Support for Programming Languages and Operating Systems

March 1 - 5, 2014

Utah, Salt Lake City, USA

Acceptance Rates

ASPLOS '14 Paper Acceptance Rate 49 of 217 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
1,206
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)19

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Zhang YWang MWang WMai YHuang HYu Z(2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00056
Jeon H(2024)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-97-9314-3_66(531-559)Online publication date: 21-Dec-2024
https://doi.org/10.1007/978-981-97-9314-3_66
Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Levine RGuo TCho MBaker ALevien RNeto DQuinn ASorensen TAamodt TJerger NSwift M(2023)MC Mutants: Evaluating and Improving Testing for Memory Consistency SpecificationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575750(473-488)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575750
Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Singh SFeliu JAcacio MJimborean ARos A(2023)CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00009(1-13)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00009
Peccerillo BCheshmikhani EMannino MMondelli ABartolini S(2023)IXIAM: ISA EXtension for Integrated Accelerator ManagementIEEE Access10.1109/ACCESS.2023.326426511(33768-33791)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3264265
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-2(1-29)Online publication date: 25-Jun-2023
https://doi.org/10.1007/978-981-15-6401-7_66-2
Jeon H(2023)GPU ArchitectureHandbook of Computer Architecture10.1007/978-981-15-6401-7_66-1(1-29)Online publication date: 16-May-2023
https://doi.org/10.1007/978-981-15-6401-7_66-1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten