skip to main content
10.1145/3572848.3577497acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness

Authors Info & Claims
Published:21 February 2023Publication History

ABSTRACT

The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. The problem is manifested as load imbalance among tasks in task-parallel HPC applications. The root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution.

References

  1. Hervé Abdi. 2010. Coefficient of variation. Encyclopedia of research design 1 (2010), 169--171.Google ScholarGoogle Scholar
  2. Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent Page Management for Two-tiered Main Memory. In International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  3. Neha Agarwal and Thomas F Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 631--644.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Francis Alexander, Ann Almgren, John Bell, Amitava Bhattacharjee, Jacqueline Chen, Phil Colella, David Daniel, Jack DeSlippe, Lori Diachin, Erik Draeger, et al. 2020. Exascale applications: skin in the game. Philosophical Transactions of the Royal Society A 378, 2166 (2020), 20190056.Google ScholarGoogle Scholar
  5. Hartwig Anzt, Terry Cojean, Goran Flegar, Fritz Göbel, Thomas Grützmacher, Pratik Nayak, Tobias Ribizel, Yuhsiang Mike Tsai, and Enrique S Quintana-Ortí. 2020. Ginkgo: A modern linear operator algebra framework for high performance computing. arXiv preprint arXiv:2006.16852 (2020).Google ScholarGoogle Scholar
  6. Alberto Baiardi. 2021. Electron Dynamics with the Time-Dependent Density Matrix Renormalization Group. Journal of Chemical Theory and Computation (2021).Google ScholarGoogle Scholar
  7. David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bradley J Barnes, Barry Rountree, David K Lowenthal, Jaxk Reeves, Bronis De Supinski, and Martin Schulz. 2008. A regression-based approach to scalability prediction. In Proceedings of the 22nd annual international conference on Supercomputing. 368--377.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Technical Report. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).Google ScholarGoogle Scholar
  10. Pablo De Oliveira Castro, Chadi Akel, Eric Petit, Mihail Popov, and William Jalby. 2015. Cere: Llvm-based codelet extractor and replayer for piecewise benchmarking and optimization. ACM Transactions on Architecture and Code Optimization (TACO) 12, 1 (2015), 1--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C.Consortium. [n.d.]. ComputeExpressLink. https://www.computeexpresslink.orgGoogle ScholarGoogle Scholar
  12. Xuhao Chen, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2020. Pangolin: An efficient and flexible graph mining system on cpu and gpu. Proceedings of the VLDB Endowment 13, 8 (2020), 1190--1205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yu Chen, Ivy B Peng, Zhen Peng, Xu Liu, and Bin Ren. 2020. Atmem: adaptive data placement in graph applications on heterogeneous memories. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 293--304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Minh Thanh Chung, Josef Weidendorfer, Philipp Samfass, Karl Fuerlinger, and Dieter Kranzlmüller. 2020. Scheduling across Multiple Applications using Task-Based Programming Models. In 2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  15. J. Corbe. [n.d.]. AutoNUMA: the Other Approach to NUMA Scheduling. http://lwn.net/Articles/488709.Google ScholarGoogle Scholar
  16. Intel Corporation. 2021. MemoryOptimizer - hot page accounting and migration daemon. https://github.com/intel/memory-optimizer.Google ScholarGoogle Scholar
  17. Najim Dehak, Reda Dehak, James R Glass, Douglas A Reynolds, Patrick Kenny, et al. 2010. Cosine similarity scoring without score normalization techniques.. In Odyssey. 15.Google ScholarGoogle Scholar
  18. Bang Di, Daokun Hu, Zhen Xie, Jianhua Sun, Hao Chen, Jinkui Ren, and Dong Li. 2021. TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2021), 1--23.Google ScholarGoogle Scholar
  19. Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the Eleventh European Conference on Computer Systems. 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. 2018. Reducing DRAM footprint with NVM in Facebook. In Proceedings of the Thirteenth EuroSys Conference. 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Matthew Fishman, Steven R White, and E Miles Stoudenmire. 2020. The ITensor software library for tensor network calculations. arXiv preprint arXiv:2007.14822 (2020).Google ScholarGoogle Scholar
  22. Marta Garcia-Gasulla, Guillaume Houzeaux, Roger Ferrer, Antoni Artigues, Victor López, Jesús Labarta, and Mariano Vázquez. 2019. MPI+ X: task-based parallelisation and dynamic load balance of finite element assembly. International Journal of Computational Fluid Dynamics 33, 3 (2019), 115--136.Google ScholarGoogle ScholarCross RefCross Ref
  23. Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. 2019. Single machine graph analytics on massive datasets using intel optane DC persistent memory. arXiv preprint arXiv:1904.07162 (2019).Google ScholarGoogle Scholar
  24. Nagendra Gulur, Mahesh Mehendale, Raman Manikantan, and Ramaswamy Govindarajan. 2014. ANATOMY: An Analytical Model of Memory System Performance. In International Conference on Measurement and Modeling of Computer Systems.Google ScholarGoogle Scholar
  25. Manish Gupta, Vilas Sridharan, David Roberts, Andreas Prodromou, Ashish Venkat, Dean Tullsen, and Rajesh Gupta. 2018. Reliability-aware data placement for heterogeneous memory architecture. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 583--595.Google ScholarGoogle ScholarCross RefCross Ref
  26. John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Takahiro Hirofuchi and Ryousei Takano. 2016. RAMinate: Hypervisor-based virtualization for hybrid main memory systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 112--125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09).Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting execution time of computer programs using sparse polynomial regression. Advances in neural information processing systems 23 (2010), 883--891.Google ScholarGoogle Scholar
  30. Yingchao Huang and Dong Li. 2017. Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems. In IEEE International Conference on Cluster Computing.Google ScholarGoogle Scholar
  31. Amazon Inc. 2018. Amazon EC2 High Memory Instances with 6, 9, and 12 TB of Memory, Perfect for SAP HANA. https://aws.amazon.com/blogs/aws/now-available-amazon-ec2-high-memory-instances-with-6-9-and-12-tb-of-memory-perfectfor-sap-hana/.Google ScholarGoogle Scholar
  32. Intel. [n.d.]. Intel Optane™ Persistent Memory 200 Series Brief. https://www.intel.com/content/www/us/en/products/docs/memory-storage/optane-persistent-memory/optane-persistent-memory-200-series-brief.htmlGoogle ScholarGoogle Scholar
  33. Intel. 2019. Intel Memory Optimizer. https://github.com/intel/memory-optimizer.Google ScholarGoogle Scholar
  34. Intel. 2021. Intel Memory Tiering. https://lwn.net/Articles/802544/.Google ScholarGoogle Scholar
  35. Intel. 2021. Processor Counter Monitor (PCM). https://github.com/opcm/pcm.Google ScholarGoogle Scholar
  36. Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. 2019. Basic performance measurements of the intel optane DC persistent memory module. arXiv preprint arXiv:1903.05714 (2019).Google ScholarGoogle Scholar
  37. Tomislav Janjusic, Christos Kartsaklis, and Wang Dali. 2014. Scalability analysis of gleipnir: A memory tracing and profiling tool, on titan. Cray User Group (2014).Google ScholarGoogle Scholar
  38. Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 workshop on Memory system performance. 36--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. Heteroos: Os design for heterogeneous memory management in datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 521--534.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ricky A. Kendall, Edoardo Aprà, David E. Bernholdt, Eric J. Bylaska, Michel Dupuis, George I. Fann, Robert J. Harrison, Jialin Ju, Jeffrey A. Nichols, Jarek Nieplocha, T. P. Straatsma, Theresa L. Windus, and Adrian T. Wong. 2000. High performance computational chemistry: An overview of NWChem a distributed parallel application. Computer Physics Communications 128, 1--2 (June 2000), 260--283. Google ScholarGoogle ScholarCross RefCross Ref
  41. Jonghyeon Kim, Wonkyo Choe, and Jeongseob Ahn. 2021. Exploring the Design Space of Page Management for Multi-Tiered Memory Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21).Google ScholarGoogle Scholar
  42. Jannis Klinkenberg, Philipp Samfass, Michael Bader, Christian Terboven, and Matthias S Müller. 2020. Chameleon: reactive load balancing for hybrid MPI+ OpenMP task-parallel applications. J. Parallel and Distrib. Comput. 138 (2020), 55--64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R Madhava Krishnan, Jaeho Kim, Ajit Mathew, Xinwei Fu, Anthony Demeri, Changwoo Min, and Sudarsun Kannan. 2020. Durable transactional memory can scale with timestone. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 335--349.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lawrence Berkeley National Laboratory. 2021. WarpX. https://github.com/ECP-WarpX/WarpX.Google ScholarGoogle Scholar
  45. Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 462--477.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ryan Levy, Edgar Solomonik, and Bryan K Clark. 2020. Distributed-memory DMRG via sparse and dense parallel tensor contractions. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yan Ling, Fang Liu, Yue Qiu, and Jiajie Zhao. 2016. Prediction of total execution time for MapReduce applications. In 2016 Sixth International Conference on Information Science and Technology (ICIST). IEEE, 341--345.Google ScholarGoogle ScholarCross RefCross Ref
  48. Jiawen Liu, Dong Li, and Jiajia Li. 2021. Athena: High-Performance Sparse Tensor Contraction Sequence on Heterogeneous Memory. In International Conference on Supercomputing (ICS).Google ScholarGoogle Scholar
  49. Jie Liu, Jiawen Liu, Zhen Xie, and Dong Li. 2020. FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors. arXiv preprint arXiv:2003.01762 (2020).Google ScholarGoogle Scholar
  50. Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: High-performance, element-wise sparse tensor contraction on heterogeneous memory. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 318--333.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Heterogeneous Memory. In Principles and Practice of Parallel Programming.Google ScholarGoogle Scholar
  52. Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. 2013. Understanding variable importances in forests of randomized trees. Advances in neural information processing systems 26 (2013).Google ScholarGoogle Scholar
  53. Jaydeep Marathe, Frank Mueller, Tushar Mohan, Sally A Mckee, Bronis R De Supinski, and Andy Yoo. 2007. Metric: Memory tracing via dynamic binary rewriting to identify cache inefficiencies. ACM Transactions on Programming Languages and Systems (TOPLAS) 29, 2 (2007), 12--es.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Mitesh R Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 126--136.Google ScholarGoogle ScholarCross RefCross Ref
  55. Thierry Monteil. 2013. Coupling profile and historical methods to predict execution time of parallel applications. Parallel and Cloud Computing 2, 3 (2013), pp-81.Google ScholarGoogle Scholar
  56. Farrukh Nadeem and Thomas Fahringer. 2009. Using templates to predict execution time of scientific workflow applications in the grid. In 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE, 316--323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Stefano Markidis, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Dirk Pleiter, and Shaun De Witt. 2019. Sage: percipient storage for exascale data centric computing. Parallel computing 83 (2019), 22--33.Google ScholarGoogle Scholar
  58. S Arash Ostadzadeh, Roel J Meeuws, Carlo Galuzzi, and Koen Bertels. 2010. Quad-a memory access pattern analyser. In International Symposium on Applied Reconfigurable Computing. Springer, 269--281.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Eunjung Park, Christos Kartsaklis, Tomislav Janjusic, and John Cavazos. 2014. Trace-driven memory access pattern recognition in computational kernels. In Proceedings of the Second Workshop on Optimizing Stencil Computations. 25--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. SeongJae Park, Yunjae Lee, and Heon Y. Yeom. 2019. Profiling Dynamic Data Access Patterns with Controlled Overhead and Quality.Google ScholarGoogle Scholar
  61. Onkar Patil, Latchesar Ionkov, Jason Lee, Frank Mueller, and Michael Lang. 2019. Performance Characterization of a DRAM-NVM Hybrid Memory Architecture for HPC Applications Using Intel Optane DC Persistent Memory Modules. In Proceedings of the International Symposium on Memory Systems (MEMSYS '19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Arnab K Paul, Arpit Goyal, Feiyi Wang, Sarp Oral, Ali R Butt, Michael J Brim, and Sangeetha B Srinivasa. 2017. I/o load balancing for big data hpc applications. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 233--242.Google ScholarGoogle ScholarCross RefCross Ref
  63. Ivy Peng, Kai Wu, Jie Ren, Maya Gokhale, and Dong Li. 2020. Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems. In IEEE International Parallel and Distributed Processing Symposium.Google ScholarGoogle Scholar
  64. Ivy B. Peng, Maya B. Gokhale, and Eric W. Green. 2019. System Evaluation of the Intel Optane Byte-addressable NVM. In Proceedings of the International Symposium on Memory Systems. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Thanh-Phuong Pham, Juan J Durillo, and Thomas Fahringer. 2017. Predicting workflow task execution time in the cloud using a two-stage machine learning approach. IEEE Transactions on Cloud Computing 8, 1 (2017), 256--268.Google ScholarGoogle ScholarCross RefCross Ref
  66. Eric Raut, Jie Meng, Mauricio Araya-Polo, and Barbara Chapman. 2020. Evaluating Performance of OpenMP Tasks in a Seismic Stencil Application. In International Workshop on OpenMP. Springer, 67--81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing Large-Scale Plasma Simulations on Persistent Memory-based Heterogeneous Memory with Effective Data Placement Across Memory Hierarchy. In International Conference on Supercomputing (ICS).Google ScholarGoogle Scholar
  68. Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy. In Proceedings of the ACM International Conference on Supercomputing. 203--214.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. 2020. Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning. In International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  70. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. Zero-offload: Democratizing billion-scale model training. arXiv preprint arXiv:2101.06840 (2021).Google ScholarGoogle Scholar
  71. Jie Ren, Kai Wu, and Dong Li. 2020. Exploring non-volatility of nonvolatile memory for high performance computing under failures. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 237--247.Google ScholarGoogle ScholarCross RefCross Ref
  72. Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. In Conference on Neural Information Processing Systems (NeurIPS).Google ScholarGoogle Scholar
  73. Seyed Masoud Sadjadi, Shu Shimizu, Javier Figueroa, Raju Rangaswami, Javier Delgado, Hector Duran, and Xabriel J Collazo-Mojica. 2008. A modeling approach for estimating execution time of long-running scientific applications. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  74. Philipp Samfass, Tobias Weinzierl, Dominic E Charrier, and Michael Bader. 2020. Lightweight task offloading exploiting MPI wait times for parallel adaptive mesh refinement. Concurrency and Computation: Practice and Experience 32, 24 (2020), e5916.Google ScholarGoogle ScholarCross RefCross Ref
  75. Michael J Schulte, Mike Ignatowski, Gabriel H Loh, Bradford M Beckmann, William C Brantley, Sudhanva Gurumurthi, Nuwan Jayasena, Indrani Paul, Steven K Reinhardt, and Gregory Rodgers. 2015. Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35, 4 (2015), 26--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, and Mea Wang. 2019. Quick Execution Time Predictions for Spark Applications. In 2019 15th International Conference on Network and Service Management (CNSM). IEEE, 1--9.Google ScholarGoogle Scholar
  77. Samantha Sherman and Tamara G Kolda. 2020. Estimating higher-order moments using symmetric tensor decomposition. SIAM J. Matrix Anal. Appl. 41, 3 (2020), 1369--1387.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard W. Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proceedings of the Symposium on Principles and Practices of Parallel Programming.Google ScholarGoogle Scholar
  79. M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T.L. Windus, and W.A. de Jong. 2010. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications 181, 9 (2010), 1477--1489. Google ScholarGoogle ScholarCross RefCross Ref
  80. J.-L. Vay, A. Almgren, J. Bell, L. Ge, D.P. Grote, M. Hogan, O. Kononenko, R. Lehe, A. Myers, C. Ng, and et al. 2018. Warp-X: A new exascale computing platform for beam-plasma simulations. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 909 (Nov 2018), 476--479. Google ScholarGoogle ScholarCross RefCross Ref
  81. Thiruvengadam Vijayaraghavan, Yasuko Eckert, Gabriel H Loh, Michael J Schulte, Mike Ignatowski, Bradford M Beckmann, William C Brantley, Joseph L Greathouse, Wei Huang, Arun Karunanithi, et al. 2017. Design and Analysis of an APU for Exascale Computing. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 85--96.Google ScholarGoogle ScholarCross RefCross Ref
  82. Chenxi Wang, Huimin Cui, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang Lv, Xiaobing Feng, and Guoqing Harry Xu. 2019. Panthera: Holistic memory management for big data processing over hybrid memories. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 347--362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and Wenguang Chen. 2018. Spindle: informed memory access monitoring. In 2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18). 561--574.Google ScholarGoogle Scholar
  84. K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory. In International Conference for High Performance Computing, Networking, Storage and Analysis.Google ScholarGoogle Scholar
  85. Kai Wu, Yingchao Huang, and Dong Li. 2017. Unimem: Runtime data managementon non-volatile memory-based heterogeneous main memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task Parallel Programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.Google ScholarGoogle Scholar
  87. Kai Wu, Jie Ren, and Dong Li. 2018. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 401--413.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Zhen Xie, Wenqian Dong, Jiawen Liu, Hang Liu, and Dong Li. 2021. Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU. In Proceedings of the Sixteenth European Conference on Computer Systems. 426--440.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD-HM: memoization-based molecular dynamics simulations on big memory system. In Proceedings of the ACM International Conference on Supercomputing. 215--226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Zhen Xie, Jie Liu, Sam Ma, Jiajia Li, and Dong Li. 2022. LB-HM: load balance-aware data placement on heterogeneous memory for task-parallel HPC applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 435--436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IASpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing. 94--105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2021. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems 33, 1 (2021), 159--175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Nimble page management for tiered memory systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 331--345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In 18th {USENIX} Conference on File and Storage Technologies ({FAST} 20). 169--182.Google ScholarGoogle Scholar

Index Terms

  1. Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
          February 2023
          480 pages
          ISBN:9798400700156
          DOI:10.1145/3572848

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 February 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate230of1,014submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader