ABSTRACT
The memory wall challenge -- the growing disparity between CPU speed and memory speed -- has been one of the most critical and long-standing challenges in computing. For high performance computing, programming to achieve efficient execution of parallel applications often requires more tuning and optimization efforts to improve data and memory access than for managing parallelism. The situation is further complicated by the recent expansion of the memory hierarchy, which is becoming deeper and more diversified with the adoption of new memory technologies and architectures such as 3D-stacked memory, non-volatile random-access memory (NVRAM), and hybrid software and hardware caches.
The authors believe it is important to elevate the notion of memory-centric programming, with relevance to the compute-centric or data-centric programming paradigms, to utilize the unprecedented and ever-elevating modern memory systems. Memory-centric programming refers to the notion and techniques of exposing hardware memory system and its hierarchy, which could include DRAM and NUMA regions, shared and private caches, scratch pad, 3-D stacked memory, non-volatile memory, and remote memory, to the programmer via portable programming abstractions and APIs. These interfaces seek to improve the dialogue between programmers and system software, and to enable compiler optimizations, runtime adaptation, and hardware reconguration with regard to data movement, beyond what can be achieved using existing parallel programming APIs. In this paper, we provide an overview of memory-centric programming concepts and principles for high performance computing.
- {n. d.}. ISSCC 2016 TechTrends. ({n. d.}). http://isscc.org/doc/2016/ISSCC2016_TechTrends.pdf.Google Scholar
- {n. d.}. The Chapel Parallel Programming Language. http://chapel.cray.com/. ({n. d.}).Google Scholar
- {n. d.}. X10: Performance and Productivity at Scale. http://x10-lang.org/. ({n. d.}).Google Scholar
- B. Alpern, L. Carter, and J. Ferrante. 1993. Modeling parallel computers as memory hierarchies. In Programming Models for Massively Parallel Computers, 1993. Proceedings. 116--123.Google Scholar
- J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan, J. Cook, D. Donofrio, S. D. Hammond, K. S. Hemmert, S. M. Kelly, H. Le, V. J. Leung, D. R. Resnick, A. F. Rodrigues, J. Shalf, D. Stark, D. Unat, and N. J. Wright. 2014. Abstract Machine Models and Proxy Architectures for Exascale Computing. In Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing (Co-HPC '14). IEEE Press, Piscataway, NJ, USA, 25--32. Google ScholarDigital Library
- Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A View of the Parallel Computing Landscape. Commun. ACM 52, 10 (Oct. 2009), 56--67. Google ScholarDigital Library
- Abdel-Hameed Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-Wen Tseng. 2004. The efficacy of software prefetching and locality optimizations on future memory systems. Journal of Instruction-Level Parallelism 6, 7 (2004).Google Scholar
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389086 Google ScholarDigital Library
- Douglas C. Burger, James R. Goodman, and Alain KÃd'gi. 1995. The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors. Technical Report. University of Wisconsin-Madison Computer Sciences.Google Scholar
- Surendra Byna, Yong Chen, and Xian-He Sun. 2008. A Taxonomy of Data Prefetching Mechanisms. In Proceedings of the The International Symposium on Parallel Architectures, Algorithms, and Networks (ISPAN '08). IEEE Computer Society, Washington, DC, USA, 19--24. Google ScholarDigital Library
- Laura Carrington, Allan Snavely, and Nicole Wolter. 2006. A Performance Prediction Framework for Scientific Applications. Future Gener. Comput. Syst. 22, 3 (Feb. 2006), 336--346. Google ScholarDigital Library
- Francky Catthoor, Nikil D. Dutt, and Christoforos E. Kozyrakis. 2000. How to Solve the Current Memory Access and Data Transfer Bottlenecks: At the Processor Architecture or at the Compiler Level. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE '00). ACM, New York, NY, USA, 426--435. Google ScholarDigital Library
- Stephan Diehl and Peter Sestoft. 2000. Abstract Machines for Programming Language Implementation. Future Gener. Comput. Syst. 16, 7 (May 2000), 739--751. Google ScholarDigital Library
- Lorin Hochstein, Jeff Carver, Forrest Shull, Sima Asgari, Victor Basili, Jeffrey K. Hollingsworth, and Marvin V. Zelkowitz. 2005. Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, Washington, DC, USA, 35--. Google ScholarDigital Library
- M. Kara, J. R. Davy, D. Goodeve, and J. Nash (Eds.). 1997. Abstract Machine Models for Parallel and Distributed Computing. IOS Press, Amsterdam, The Netherlands, The Netherlands. Google ScholarDigital Library
- Suji Lee, Jongpil Jung, and Chong-Min Kyung. 2012. Hybrid cache architecture replacing SRAM cache with future memory technology. In 2012 IEEE International Symposium on Circuits and Systems. IEEE, 2481--2484.Google ScholarCross Ref
- Gabriel H. Loh. 2008. 3D-Stacked Memory Architectures for Multi-core Processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA '08). IEEE Computer Society, Washington, DC, USA, 453--464. Google ScholarDigital Library
- Jagan Singh Meena, Simon Min Sze, Umesh Chand, and Tseung-Yuen Tseng. 2014. Overview of emerging nonvolatile memory technologies. Nanoscale Research Letters 9, 1 (2014), 1--33.Google ScholarCross Ref
- Sparsh Mittal, Jeffrey S Vetter, and Dong Li. 2015. A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches. IEEE Transactions on Parallel and Distributed Systems 26, 6 (2015), 1524--1537.Google ScholarCross Ref
- Sebastian Nanz, Scott West, and Kaue Soares da Silveira. 2013. Benchmarking Usability and Performance of Multicore Languages. CoRR abs/1302.2837 (2013). http://arxiv.org/abs/1302.2837Google Scholar
- S. S. Nemawarkar and G. R. Gao. 1997. Latency tolerance: a metric for performance analysis of multithreaded architectures. In Parallel Processing Symposium, 1997. Proceedings., 11th International. 227--232. Google ScholarDigital Library
- P. Ramm, A. Klumpp, J. Weber, N. Lietaer, M. Taklo, W. De Raedt, T. Fritzsch, and P. Couderc. 2010. 3D Integration technology: Status and application development. In ESSCIRC, 2010 Proceedings of the. 9--16.Google Scholar
- S. Salehian, Jiawen Liu, and Yonghong Yan. 2017. Comparison of Threading Programming Models. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 766--774.Google Scholar
- Sean Treichler, Michael Bauer, and Alex Aiken. 2013. Language Support for Dynamic, Hierarchical Data Partitioning. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA '13). ACM, New York, NY, USA, 495--514. Google ScholarDigital Library
- Yuan Xie. 2011. Modeling, architecture, and applications for emerging memory technologies. IEEE Design & Test of Computers 1 (2011), 44--51. Google ScholarDigital Library
- Yonghong Yan, Jiawen Liu, Kirk W. Cameron, and Mariam Umar. 2017. HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 788--798.Google Scholar
- Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement.. In LCPC'09. 172--187. Google ScholarDigital Library
Index Terms
- Principles of Memory-Centric Programming for High Performance Computing
Recommendations
Next high performance and low power flash memory package structure
In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore,...
Energy efficient Phase Change Memory based main memory for future high performance systems
IGCC '11: Proceedings of the 2011 International Green Computing Conference and WorkshopsPhase Change Memory (PCM) has recently attracted a lot of attention as a scalable alternative to DRAM for main memory systems. As the need for high-density memory increases, DRAM has proven to be less attractive from the point of view of scaling and ...
State-Restrict MLC STT-RAM Designs for High-Reliable High-Performance Memory System
DAC '14: Proceedings of the 51st Annual Design Automation ConferenceMulti-level Cell Spin-Transfer Torque Random Access Memory (MLC STT-RAM) is a promising nonvolatile memory technology for high-capacity and high-performance applications. However, the reliability concerns and the complicated access mechanism greatly ...
Comments