skip to main content
10.1145/3145617.3158212acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
short-paper

Principles of Memory-Centric Programming for High Performance Computing

Published:12 November 2017Publication History

ABSTRACT

The memory wall challenge -- the growing disparity between CPU speed and memory speed -- has been one of the most critical and long-standing challenges in computing. For high performance computing, programming to achieve efficient execution of parallel applications often requires more tuning and optimization efforts to improve data and memory access than for managing parallelism. The situation is further complicated by the recent expansion of the memory hierarchy, which is becoming deeper and more diversified with the adoption of new memory technologies and architectures such as 3D-stacked memory, non-volatile random-access memory (NVRAM), and hybrid software and hardware caches.

The authors believe it is important to elevate the notion of memory-centric programming, with relevance to the compute-centric or data-centric programming paradigms, to utilize the unprecedented and ever-elevating modern memory systems. Memory-centric programming refers to the notion and techniques of exposing hardware memory system and its hierarchy, which could include DRAM and NUMA regions, shared and private caches, scratch pad, 3-D stacked memory, non-volatile memory, and remote memory, to the programmer via portable programming abstractions and APIs. These interfaces seek to improve the dialogue between programmers and system software, and to enable compiler optimizations, runtime adaptation, and hardware reconguration with regard to data movement, beyond what can be achieved using existing parallel programming APIs. In this paper, we provide an overview of memory-centric programming concepts and principles for high performance computing.

References

  1. {n. d.}. ISSCC 2016 TechTrends. ({n. d.}). http://isscc.org/doc/2016/ISSCC2016_TechTrends.pdf.Google ScholarGoogle Scholar
  2. {n. d.}. The Chapel Parallel Programming Language. http://chapel.cray.com/. ({n. d.}).Google ScholarGoogle Scholar
  3. {n. d.}. X10: Performance and Productivity at Scale. http://x10-lang.org/. ({n. d.}).Google ScholarGoogle Scholar
  4. B. Alpern, L. Carter, and J. Ferrante. 1993. Modeling parallel computers as memory hierarchies. In Programming Models for Massively Parallel Computers, 1993. Proceedings. 116--123.Google ScholarGoogle Scholar
  5. J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan, J. Cook, D. Donofrio, S. D. Hammond, K. S. Hemmert, S. M. Kelly, H. Le, V. J. Leung, D. R. Resnick, A. F. Rodrigues, J. Shalf, D. Stark, D. Unat, and N. J. Wright. 2014. Abstract Machine Models and Proxy Architectures for Exascale Computing. In Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing (Co-HPC '14). IEEE Press, Piscataway, NJ, USA, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A View of the Parallel Computing Landscape. Commun. ACM 52, 10 (Oct. 2009), 56--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Abdel-Hameed Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-Wen Tseng. 2004. The efficacy of software prefetching and locality optimizations on future memory systems. Journal of Instruction-Level Parallelism 6, 7 (2004).Google ScholarGoogle Scholar
  8. Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389086 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Douglas C. Burger, James R. Goodman, and Alain KÃd'gi. 1995. The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors. Technical Report. University of Wisconsin-Madison Computer Sciences.Google ScholarGoogle Scholar
  10. Surendra Byna, Yong Chen, and Xian-He Sun. 2008. A Taxonomy of Data Prefetching Mechanisms. In Proceedings of the The International Symposium on Parallel Architectures, Algorithms, and Networks (ISPAN '08). IEEE Computer Society, Washington, DC, USA, 19--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Laura Carrington, Allan Snavely, and Nicole Wolter. 2006. A Performance Prediction Framework for Scientific Applications. Future Gener. Comput. Syst. 22, 3 (Feb. 2006), 336--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Francky Catthoor, Nikil D. Dutt, and Christoforos E. Kozyrakis. 2000. How to Solve the Current Memory Access and Data Transfer Bottlenecks: At the Processor Architecture or at the Compiler Level. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE '00). ACM, New York, NY, USA, 426--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stephan Diehl and Peter Sestoft. 2000. Abstract Machines for Programming Language Implementation. Future Gener. Comput. Syst. 16, 7 (May 2000), 739--751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey K. Hollingsworth, and Marvin V. Zelkowitz. 2005. Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, Washington, DC, USA, 35--. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Kara, J. R. Davy, D. Goodeve, and J. Nash (Eds.). 1997. Abstract Machine Models for Parallel and Distributed Computing. IOS Press, Amsterdam, The Netherlands, The Netherlands. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Suji Lee, Jongpil Jung, and Chong-Min Kyung. 2012. Hybrid cache architecture replacing SRAM cache with future memory technology. In 2012 IEEE International Symposium on Circuits and Systems. IEEE, 2481--2484.Google ScholarGoogle ScholarCross RefCross Ref
  17. Gabriel H. Loh. 2008. 3D-Stacked Memory Architectures for Multi-core Processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA '08). IEEE Computer Society, Washington, DC, USA, 453--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jagan Singh Meena, Simon Min Sze, Umesh Chand, and Tseung-Yuen Tseng. 2014. Overview of emerging nonvolatile memory technologies. Nanoscale Research Letters 9, 1 (2014), 1--33.Google ScholarGoogle ScholarCross RefCross Ref
  19. Sparsh Mittal, Jeffrey S Vetter, and Dong Li. 2015. A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches. IEEE Transactions on Parallel and Distributed Systems 26, 6 (2015), 1524--1537.Google ScholarGoogle ScholarCross RefCross Ref
  20. Sebastian Nanz, Scott West, and Kaue Soares da Silveira. 2013. Benchmarking Usability and Performance of Multicore Languages. CoRR abs/1302.2837 (2013). http://arxiv.org/abs/1302.2837Google ScholarGoogle Scholar
  21. S. S. Nemawarkar and G. R. Gao. 1997. Latency tolerance: a metric for performance analysis of multithreaded architectures. In Parallel Processing Symposium, 1997. Proceedings., 11th International. 227--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Ramm, A. Klumpp, J. Weber, N. Lietaer, M. Taklo, W. De Raedt, T. Fritzsch, and P. Couderc. 2010. 3D Integration technology: Status and application development. In ESSCIRC, 2010 Proceedings of the. 9--16.Google ScholarGoogle Scholar
  23. S. Salehian, Jiawen Liu, and Yonghong Yan. 2017. Comparison of Threading Programming Models. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 766--774.Google ScholarGoogle Scholar
  24. Sean Treichler, Michael Bauer, and Alex Aiken. 2013. Language Support for Dynamic, Hierarchical Data Partitioning. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA '13). ACM, New York, NY, USA, 495--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yuan Xie. 2011. Modeling, architecture, and applications for emerging memory technologies. IEEE Design & Test of Computers 1 (2011), 44--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yonghong Yan, Jiawen Liu, Kirk W. Cameron, and Mariam Umar. 2017. HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 788--798.Google ScholarGoogle Scholar
  27. Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement.. In LCPC'09. 172--187. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Principles of Memory-Centric Programming for High Performance Computing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MCHPC'17: Proceedings of the Workshop on Memory Centric Programming for HPC
          November 2017
          43 pages
          ISBN:9781450351317
          DOI:10.1145/3145617

          Copyright © 2017 ACM

          © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 November 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper
          • Research
          • Refereed limited

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader