ABSTRACT
Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous systems (unlike homogeneous CPU systems) provide synchronization mechanisms that only guarantee ordering among a subset of threads, which we call a scope. Unfortunately, the consequences and se-mantics of these scoped operations are not yet well under-stood. Without a formal and approachable model to reason about the behavior of these operations, we risk an array of portability and performance issues.
In this paper, we embrace scoped synchronization with a new class of memory consistency models that add scoped synchronization to data-race-free models like those of C++ and Java. Called sequential consistency for heterogeneous-race-free (SC for HRF), the new models guarantee SC for programs with "sufficient" synchronization (no data races) of "sufficient" scope. We discuss two such models. The first, HRF-direct, works well for programs with highly regular parallelism. The second, HRF-indirect, builds on HRF-direct by allowing synchronization using different scopes in some cases involving transitive communication. We quanti-tatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs.
- Adve, S.V. and Boehm, H.-J. 2010. Semantics of shared variables & synchronization a.k.a. memory models.Google Scholar
- Adve, S.V. and Gharachorloo, K. 1996. Shared memory consistency models: A tutorial. Computer. 29, 12 (1996), 66--76. Google ScholarDigital Library
- Adve, S.V. and Hill, M.D. 1990. Weak ordering--a new definition. Proceedings of the International Symposium on Computer Architecture (New York, NY, USA, 1990), 2--14. Google ScholarDigital Library
- AMD, Inc. 2012. Southern Islands series instruction set architecture. Advanced Micro Devices.Google Scholar
- Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T. and Sardashti, S. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News. 39, 2 (2011), 1--7. Google ScholarDigital Library
- Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H. and Zhou, Y. 1995. Cilk: An efficient multithreaded runtime system. ACM.Google ScholarDigital Library
- Boehm, H.-J. and Adve, S.V. 2008. Foundations of the C++ concurrency memory model. International Symposium on Programming Language Design and Implementation (PLDI) (Tuscon, AZ, Jun. 2008), 68--78. Google ScholarDigital Library
- Carlson, W.W., Draper, J.M., Culler, D.E., Yelick, K., Brooks, E. and Warren, K. 1999. Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses.Google Scholar
- Chamberlain, B.L., Callahan, D. and Zima, H.P. 2007. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications. 21, 3 (2007), 291--312. Google ScholarDigital Library
- Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C. and Sarkar, V. 2005. X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Notices (2005), 519--538. Google ScholarDigital Library
- Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H. and Skadron, K. 2009. Rodinia: a benchmark suite for heterogeneous computing. IEEE International Symposium on Workload Characterization, 2009. IISWC 2009 (Oct. 2009), 44--54. Google ScholarDigital Library
- CUDA 5.5 C programming guide: 2013. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed: 2013-12-19.Google Scholar
- Danalis, A., Pollock, L., Swany, M. and Cavazos, J. 2009. MPI-aware compiler optimizations for improving communication-computation overlap. Proceedings of the 23rd in-ternational conference on Supercomputing (2009), 316--325. Google ScholarDigital Library
- Dubois, M., Scheurich, C. and Briggs, F. 1986. Memory access buffering in multiprocessors. ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture (1986), 434--442. Google ScholarDigital Library
- Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A. and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. Proceedings of the 17th annual International Symposium on Computer Architecture (1990), 376--387. Google ScholarDigital Library
- Gropp, W., Lusk, E. and Skjellum, A. 1999. Using MPI: portable parallel programming with the message passing interface. MIT press. Google ScholarDigital Library
- Guiady, C., Falsafi, B. and Vijaykumar, T.N. 1999. Is SC+ILP=RC? Proceedings of the 26th International Symposium on Computer Architecture, 1999 (1999), 162--171. Google ScholarDigital Library
- Gupta, K., Stuart, J. and Owens, J.D. 2012. A study of persistent threads style GPU programming for GPGPU workloads. Proceedings of Innovative Parallel Computing (InPar '12) (May 2012).Google Scholar
- Hechtman, B.A., Che, S., Hower, D.R., Tian, Y., Beckmann, B.M., Hill, M.D., Reinhardt, S.K. and Wood, D.A. 2014. QuickRelease: a throughput oriented approach to release consistency on GPUs. Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA) (Orland, FL, Feb. 2014).Google Scholar
- Hechtman, B.A. and Sorin, D.J. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. Proceedings of the 40th International Symposi-um on Computer Architecture (ISCA) (Tel Aviv, Israel, Jun. 2013). Google ScholarDigital Library
- HSA Foundation 2012. Heterogeneous System Architecture: A Technical Review.Google Scholar
- Kalla, R., Sinharoy, B., Starke, W.J. and Floyd, M. 2010. Power7: IBM's next-generation server processor. IEEE Micro. 30, 2 (2010), 7--15. Google ScholarDigital Library
- Kelm, J.H., Johnson, D.R., Tuohy, W., Lumetta, S.S. and Patel, S.J. 2010. Cohesion: a hybrid memory model for accelerators. Proceedings of the 37th annual international symposium on Computer architecture (New York, NY, USA, 2010), 429--440. Google ScholarDigital Library
- Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers. C-28, 9 (Sep. 1979), 690--691. Google ScholarDigital Library
- Lucia, B., Ceze, L., Strauss, K., Qadeer, S. and Boehm, H.J. 2010. Conflict exceptions: providing simple concurrent language semantics with precise hardware exceptions. Interna-tional Symposium on Computer Architecture (ISCA) (2010). Google ScholarDigital Library
- Manson, J., Pugh, W. and Adve, S.V. 2005. The Java memory model. Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages (New York, NY, USA, 2005), 378--391. Google ScholarDigital Library
- Marino, D., Singh, A., Millstein, T., Musuvathi, M. and Narayanasamy, S. 2010. DRFX: a simple and efficient memory model for concurrent programming languages. Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation (New York, NY, USA, 2010), 351--362. Google ScholarDigital Library
- Munshi, A. ed. 2013. The OpenCL Specification, Version 2.0 (Provisional). Khronos Group.Google Scholar
- Munshi, A., Gaster, B. and Mattson, T.G. 2011. OpenCL programming guide. Addison-Wesley Professional. Google ScholarDigital Library
- NVIDIA Corporation 2012. Parallel Thread Execution ISA Version 3.1.Google Scholar
- Olivier, S., Huan, J., Liu, J., Prins, J., Dinan, J., Sa-dayappan, P. and Tseng, C.-W. 2007. UTS: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing. Springer. 235--250. Google ScholarDigital Library
- OpenACC, Inc 2011. The OpenACCTM Application Programming Interface, Version 1.0.Google Scholar
- Owens, S., Sarkar, S. and Sewell, P. 2009. A better x86 memory model: x86-TSO. Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics (Berlin, Heidelberg, 2009), 391--407. Google ScholarDigital Library
- Qadeer, W., Hameed, R., Shacham, O., Venkatesan, P., Kozyrakis, C. and Horowitz, M.A. 2013. Convolution engine: balancing efficiency & flexibility in specialized computing. Proceedings of the 40th Annual International Symposium on Computer Architecture (2013), 24--35. Google ScholarDigital Library
- Sindhu, P.S., Frailong, J.-M. and Cekleov, M. 1992. Formal specification of memory models. Scalable Shared Memory Multiprocessors: Proceedings. (1992), 25.Google ScholarCross Ref
- Sorin, D.J., Hill, M.D. and Wood, D.A. 2011. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture. 6, 3 (2011), 1--212. Google ScholarDigital Library
- Thakkar, S., Gifford, P. and Fielland, G. 1988. The balance multiprocessor system. IEEE Micro. 8, 1 (Jan. 1988), 57--69. Google ScholarDigital Library
- UTS source distribution: http://sourceforge.net/p/uts-benchmark/wiki/Home/.Google Scholar
Index Terms
- Heterogeneous-race-free memory models
Recommendations
Heterogeneous-race-free memory models
ASPLOS '14Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous ...
Heterogeneous-race-free memory models
ASPLOS '14Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous ...
A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs
Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...
Comments