skip to main content
10.1145/2541940.2541981acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Heterogeneous-race-free memory models

Published:24 February 2014Publication History

ABSTRACT

Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous systems (unlike homogeneous CPU systems) provide synchronization mechanisms that only guarantee ordering among a subset of threads, which we call a scope. Unfortunately, the consequences and se-mantics of these scoped operations are not yet well under-stood. Without a formal and approachable model to reason about the behavior of these operations, we risk an array of portability and performance issues.

In this paper, we embrace scoped synchronization with a new class of memory consistency models that add scoped synchronization to data-race-free models like those of C++ and Java. Called sequential consistency for heterogeneous-race-free (SC for HRF), the new models guarantee SC for programs with "sufficient" synchronization (no data races) of "sufficient" scope. We discuss two such models. The first, HRF-direct, works well for programs with highly regular parallelism. The second, HRF-indirect, builds on HRF-direct by allowing synchronization using different scopes in some cases involving transitive communication. We quanti-tatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs.

References

  1. Adve, S.V. and Boehm, H.-J. 2010. Semantics of shared variables & synchronization a.k.a. memory models.Google ScholarGoogle Scholar
  2. Adve, S.V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. Computer. 29, 12 (1996), 66--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Adve, S.V. and Hill, M.D. 1990. Weak ordering--a new definition. Proceedings of the International Symposium on Computer Architecture (New York, NY, USA, 1990), 2--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AMD, Inc. 2012. Southern Islands series instruction set architecture. Advanced Micro Devices.Google ScholarGoogle Scholar
  5. Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T. and Sardashti, S. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News. 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H. and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Boehm, H.-J. and Adve, S.V. 2008. Foundations of the C++ concurrency memory model. International Symposium on Programming Language Design and Implementation (PLDI) (Tuscon, AZ, Jun. 2008), 68--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carlson, W.W., Draper, J.M., Culler, D.E., Yelick, K., Brooks, E. and Warren, K. 1999. Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses.Google ScholarGoogle Scholar
  9. Chamberlain, B.L., Callahan, D. and Zima, H.P. 2007. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications. 21, 3 (2007), 291--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C. and Sarkar, V. 2005. X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices (2005), 519--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H. and Skadron, K. 2009. Rodinia: a benchmark suite for heterogeneous computing. IEEE International Symposium on Workload Characterization, 2009. IISWC 2009 (Oct. 2009), 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. CUDA 5.5 C programming guide: 2013. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2013-12-19.Google ScholarGoogle Scholar
  13. Danalis, A., Pollock, L., Swany, M. and Cavazos, J. 2009. MPI-aware compiler optimizations for improving communication-computation overlap. Proceedings of the 23rd in-ternational conference on Supercomputing (2009), 316--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dubois, M., Scheurich, C. and Briggs, F. 1986. Memory access buffering in multiprocessors. ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture (1986), 434--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A. and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. Proceedings of the 17th annual International Symposium on Computer Architecture (1990), 376--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gropp, W., Lusk, E. and Skjellum, A. 1999. Using MPI: portable parallel programming with the message passing interface. MIT press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Guiady, C., Falsafi, B. and Vijaykumar, T.N. 1999. Is SC+ILP=RC? Proceedings of the 26th International Symposium on Computer Architecture, 1999 (1999), 162--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gupta, K., Stuart, J. and Owens, J.D. 2012. A study of persistent threads style GPU programming for GPGPU workloads. Proceedings of Innovative Parallel Computing (InPar '12) (May 2012).Google ScholarGoogle Scholar
  19. Hechtman, B.A., Che, S., Hower, D.R., Tian, Y., Beckmann, B.M., Hill, M.D., Reinhardt, S.K. and Wood, D.A. 2014. QuickRelease: a throughput oriented approach to release consistency on GPUs. Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA) (Orland, FL, Feb. 2014).Google ScholarGoogle Scholar
  20. Hechtman, B.A. and Sorin, D.J. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. Proceedings of the 40th International Symposi-um on Computer Architecture (ISCA) (Tel Aviv, Israel, Jun. 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. HSA Foundation 2012. Heterogeneous System Architecture: A Technical Review.Google ScholarGoogle Scholar
  22. Kalla, R., Sinharoy, B., Starke, W.J. and Floyd, M. 2010. Power7: IBM's next-generation server processor. IEEE Micro. 30, 2 (2010), 7--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kelm, J.H., Johnson, D.R., Tuohy, W., Lumetta, S.S. and Patel, S.J. 2010. Cohesion: a hybrid memory model for accelerators. Proceedings of the 37th annual international symposium on Computer architecture (New York, NY, USA, 2010), 429--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers. C-28, 9 (Sep. 1979), 690--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lucia, B., Ceze, L., Strauss, K., Qadeer, S. and Boehm, H.J. 2010. Conflict exceptions: providing simple concurrent language semantics with precise hardware exceptions. Interna-tional Symposium on Computer Architecture (ISCA) (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Manson, J., Pugh, W. and Adve, S.V. 2005. The Java memory model. Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages (New York, NY, USA, 2005), 378--391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marino, D., Singh, A., Millstein, T., Musuvathi, M. and Narayanasamy, S. 2010. DRFX: a simple and efficient memory model for concurrent programming languages. Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation (New York, NY, USA, 2010), 351--362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Munshi, A. ed. 2013. The OpenCL Specification, Version 2.0 (Provisional). Khronos Group.Google ScholarGoogle Scholar
  29. Munshi, A., Gaster, B. and Mattson, T.G. 2011. OpenCL programming guide. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. NVIDIA Corporation 2012. Parallel Thread Execution ISA Version 3.1.Google ScholarGoogle Scholar
  31. Olivier, S., Huan, J., Liu, J., Prins, J., Dinan, J., Sa-dayappan, P. and Tseng, C.-W. 2007. UTS: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing. Springer. 235--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. OpenACC, Inc 2011. The OpenACCTM Application Programming Interface, Version 1.0.Google ScholarGoogle Scholar
  33. Owens, S., Sarkar, S. and Sewell, P. 2009. A better x86 memory model: x86-TSO. Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics (Berlin, Heidelberg, 2009), 391--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A. 2013. Convolution engine: balancing efficiency & flexibility in specialized computing. Proceedings of the 40th Annual International Symposium on Computer Architecture (2013), 24--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sindhu, P.S., Frailong, J.-M. and Cekleov, M. 1992. Formal specification of memory models. Scalable Shared Memory Multiprocessors: Proceedings. (1992), 25.Google ScholarGoogle ScholarCross RefCross Ref
  36. Sorin, D.J., Hill, M.D. and Wood, D.A. 2011. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. 6, 3 (2011), 1--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Thakkar, S., Gifford, P. and Fielland, G. 1988. The balance multiprocessor system. IEEE Micro. 8, 1 (Jan. 1988), 57--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. UTS source distribution: http://sourceforge.net/p/uts-benchmark/wiki/Home/.Google ScholarGoogle Scholar

Index Terms

  1. Heterogeneous-race-free memory models

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
            February 2014
            780 pages
            ISBN:9781450323055
            DOI:10.1145/2541940

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 February 2014

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            ASPLOS '14 Paper Acceptance Rate49of217submissions,23%Overall Acceptance Rate535of2,713submissions,20%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader