skip to main content
10.1145/3572848.3577497acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Open access

Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness

Published: 21 February 2023 Publication History


The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. The problem is manifested as load imbalance among tasks in task-parallel HPC applications. The root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution.


Hervé Abdi. 2010. Coefficient of variation. Encyclopedia of research design 1 (2010), 169--171.
Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-transparent Page Management for Two-tiered Main Memory. In International Conference on Architectural Support for Programming Languages and Operating Systems.
Neha Agarwal and Thomas F Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 631--644.
Francis Alexander, Ann Almgren, John Bell, Amitava Bhattacharjee, Jacqueline Chen, Phil Colella, David Daniel, Jack DeSlippe, Lori Diachin, Erik Draeger, et al. 2020. Exascale applications: skin in the game. Philosophical Transactions of the Royal Society A 378, 2166 (2020), 20190056.
Hartwig Anzt, Terry Cojean, Goran Flegar, Fritz Göbel, Thomas Grützmacher, Pratik Nayak, Tobias Ribizel, Yuhsiang Mike Tsai, and Enrique S Quintana-Ortí. 2020. Ginkgo: A modern linear operator algebra framework for high performance computing. arXiv preprint arXiv:2006.16852 (2020).
Alberto Baiardi. 2021. Electron Dynamics with the Time-Dependent Density Matrix Renormalization Group. Journal of Chemical Theory and Computation (2021).
David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, et al. 1991. The NAS parallel benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63--73.
Bradley J Barnes, Barry Rountree, David K Lowenthal, Jaxk Reeves, Bronis De Supinski, and Martin Schulz. 2008. A regression-based approach to scalability prediction. In Proceedings of the 22nd annual international conference on Supercomputing. 368--377.
Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Technical Report. Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).
Pablo De Oliveira Castro, Chadi Akel, Eric Petit, Mihail Popov, and William Jalby. 2015. Cere: Llvm-based codelet extractor and replayer for piecewise benchmarking and optimization. ACM Transactions on Architecture and Code Optimization (TACO) 12, 1 (2015), 1--24.
C.Consortium. [n.d.]. ComputeExpressLink.
Xuhao Chen, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali. 2020. Pangolin: An efficient and flexible graph mining system on cpu and gpu. Proceedings of the VLDB Endowment 13, 8 (2020), 1190--1205.
Yu Chen, Ivy B Peng, Zhen Peng, Xu Liu, and Bin Ren. 2020. Atmem: adaptive data placement in graph applications on heterogeneous memories. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 293--304.
Minh Thanh Chung, Josef Weidendorfer, Philipp Samfass, Karl Fuerlinger, and Dieter Kranzlmüller. 2020. Scheduling across Multiple Applications using Task-Based Programming Models. In 2020 IEEE/ACM Fourth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM). IEEE, 1--8.
J. Corbe. [n.d.]. AutoNUMA: the Other Approach to NUMA Scheduling.
Intel Corporation. 2021. MemoryOptimizer - hot page accounting and migration daemon.
Najim Dehak, Reda Dehak, James R Glass, Douglas A Reynolds, Patrick Kenny, et al. 2010. Cosine similarity scoring without score normalization techniques. In Odyssey. 15.
Bang Di, Daokun Hu, Zhen Xie, Jianhua Sun, Hao Chen, Jinkui Ren, and Dong Li. 2021. TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2021), 1--23.
Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the Eleventh European Conference on Computer Systems. 1--16.
Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. 2018. Reducing DRAM footprint with NVM in Facebook. In Proceedings of the Thirteenth EuroSys Conference. 1--13.
Matthew Fishman, Steven R White, and E Miles Stoudenmire. 2020. The ITensor software library for tensor network calculations. arXiv preprint arXiv:2007.14822 (2020).
Marta Garcia-Gasulla, Guillaume Houzeaux, Roger Ferrer, Antoni Artigues, Victor López, Jesús Labarta, and Mariano Vázquez. 2019. MPI+ X: task-based parallelisation and dynamic load balance of finite element assembly. International Journal of Computational Fluid Dynamics 33, 3 (2019), 115--136.
Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. 2019. Single machine graph analytics on massive datasets using intel optane DC persistent memory. arXiv preprint arXiv:1904.07162 (2019).
Nagendra Gulur, Mahesh Mehendale, Raman Manikantan, and Ramaswamy Govindarajan. 2014. ANATOMY: An Analytical Model of Memory System Performance. In International Conference on Measurement and Modeling of Computer Systems.
Manish Gupta, Vilas Sridharan, David Roberts, Andreas Prodromou, Ashish Venkat, Dean Tullsen, and Rajesh Gupta. 2018. Reliability-aware data placement for heterogeneous memory architecture. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 583--595.
John L Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1--17.
Takahiro Hirofuchi and Ryousei Takano. 2016. RAMinate: Hypervisor-based virtualization for hybrid main memory systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing. 112--125.
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA '09).
Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting execution time of computer programs using sparse polynomial regression. Advances in neural information processing systems 23 (2010), 883--891.
Yingchao Huang and Dong Li. 2017. Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems. In IEEE International Conference on Cluster Computing.
Amazon Inc. 2018. Amazon EC2 High Memory Instances with 6, 9, and 12 TB of Memory, Perfect for SAP HANA.
Intel. [n.d.]. Intel Optane™ Persistent Memory 200 Series Brief.
Intel. 2019. Intel Memory Optimizer.
Intel. 2021. Intel Memory Tiering.
Intel. 2021. Processor Counter Monitor (PCM).
Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et al. 2019. Basic performance measurements of the intel optane DC persistent memory module. arXiv preprint arXiv:1903.05714 (2019).
Tomislav Janjusic, Christos Kartsaklis, and Wang Dali. 2014. Scalability analysis of gleipnir: A memory tracing and profiling tool, on titan. Cray User Group (2014).
Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. 2005. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 workshop on Memory system performance. 36--43.
Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. Heteroos: Os design for heterogeneous memory management in datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 521--534.
Ricky A. Kendall, Edoardo Aprà, David E. Bernholdt, Eric J. Bylaska, Michel Dupuis, George I. Fann, Robert J. Harrison, Jialin Ju, Jeffrey A. Nichols, Jarek Nieplocha, T. P. Straatsma, Theresa L. Windus, and Adrian T. Wong. 2000. High performance computational chemistry: An overview of NWChem a distributed parallel application. Computer Physics Communications 128, 1--2 (June 2000), 260--283.
Jonghyeon Kim, Wonkyo Choe, and Jeongseob Ahn. 2021. Exploring the Design Space of Page Management for Multi-Tiered Memory Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21).
Jannis Klinkenberg, Philipp Samfass, Michael Bader, Christian Terboven, and Matthias S Müller. 2020. Chameleon: reactive load balancing for hybrid MPI+ OpenMP task-parallel applications. J. Parallel and Distrib. Comput. 138 (2020), 55--64.
R Madhava Krishnan, Jaeho Kim, Ajit Mathew, Xinwei Fu, Anthony Demeri, Changwoo Min, and Sudarsun Kannan. 2020. Durable transactional memory can scale with timestone. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 335--349.
Lawrence Berkeley National Laboratory. 2021. WarpX.
Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 462--477.
Ryan Levy, Edgar Solomonik, and Bryan K Clark. 2020. Distributed-memory DMRG via sparse and dense parallel tensor contractions. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.
Yan Ling, Fang Liu, Yue Qiu, and Jiajie Zhao. 2016. Prediction of total execution time for MapReduce applications. In 2016 Sixth International Conference on Information Science and Technology (ICIST). IEEE, 341--345.
Jiawen Liu, Dong Li, and Jiajia Li. 2021. Athena: High-Performance Sparse Tensor Contraction Sequence on Heterogeneous Memory. In International Conference on Supercomputing (ICS).
Jie Liu, Jiawen Liu, Zhen Xie, and Dong Li. 2020. FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors. arXiv preprint arXiv:2003.01762 (2020).
Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: High-performance, element-wise sparse tensor contraction on heterogeneous memory. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 318--333.
Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. 2021. Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Heterogeneous Memory. In Principles and Practice of Parallel Programming.
Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. 2013. Understanding variable importances in forests of randomized trees. Advances in neural information processing systems 26 (2013).
Jaydeep Marathe, Frank Mueller, Tushar Mohan, Sally A Mckee, Bronis R De Supinski, and Andy Yoo. 2007. Metric: Memory tracing via dynamic binary rewriting to identify cache inefficiencies. ACM Transactions on Programming Languages and Systems (TOPLAS) 29, 2 (2007), 12--es.
Mitesh R Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 126--136.
Thierry Monteil. 2013. Coupling profile and historical methods to predict execution time of parallel applications. Parallel and Cloud Computing 2, 3 (2013), pp-81.
Farrukh Nadeem and Thomas Fahringer. 2009. Using templates to predict execution time of scientific workflow applications in the grid. In 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE, 316--323.
Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Stefano Markidis, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Dirk Pleiter, and Shaun De Witt. 2019. Sage: percipient storage for exascale data centric computing. Parallel computing 83 (2019), 22--33.
S Arash Ostadzadeh, Roel J Meeuws, Carlo Galuzzi, and Koen Bertels. 2010. Quad-a memory access pattern analyser. In International Symposium on Applied Reconfigurable Computing. Springer, 269--281.
Eunjung Park, Christos Kartsaklis, Tomislav Janjusic, and John Cavazos. 2014. Trace-driven memory access pattern recognition in computational kernels. In Proceedings of the Second Workshop on Optimizing Stencil Computations. 25--32.
SeongJae Park, Yunjae Lee, and Heon Y. Yeom. 2019. Profiling Dynamic Data Access Patterns with Controlled Overhead and Quality.
Onkar Patil, Latchesar Ionkov, Jason Lee, Frank Mueller, and Michael Lang. 2019. Performance Characterization of a DRAM-NVM Hybrid Memory Architecture for HPC Applications Using Intel Optane DC Persistent Memory Modules. In Proceedings of the International Symposium on Memory Systems (MEMSYS '19).
Arnab K Paul, Arpit Goyal, Feiyi Wang, Sarp Oral, Ali R Butt, Michael J Brim, and Sangeetha B Srinivasa. 2017. I/o load balancing for big data hpc applications. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 233--242.
Ivy Peng, Kai Wu, Jie Ren, Maya Gokhale, and Dong Li. 2020. Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems. In IEEE International Parallel and Distributed Processing Symposium.
Ivy B. Peng, Maya B. Gokhale, and Eric W. Green. 2019. System Evaluation of the Intel Optane Byte-addressable NVM. In Proceedings of the International Symposium on Memory Systems. ACM.
Thanh-Phuong Pham, Juan J Durillo, and Thomas Fahringer. 2017. Predicting workflow task execution time in the cloud using a two-stage machine learning approach. IEEE Transactions on Cloud Computing 8, 1 (2017), 256--268.
Eric Raut, Jie Meng, Mauricio Araya-Polo, and Barbara Chapman. 2020. Evaluating Performance of OpenMP Tasks in a Seismic Stencil Application. In International Workshop on OpenMP. Springer, 67--81.
Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing Large-Scale Plasma Simulations on Persistent Memory-based Heterogeneous Memory with Effective Data Placement Across Memory Hierarchy. In International Conference on Supercomputing (ICS).
Jie Ren, Jiaolin Luo, Ivy Peng, Kai Wu, and Dong Li. 2021. Optimizing large-scale plasma simulations on persistent memory-based heterogeneous memory with effective data placement across memory hierarchy. In Proceedings of the ACM International Conference on Supercomputing. 203--214.
Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. 2020. Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning. In International Symposium on High Performance Computer Architecture (HPCA).
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. Zero-offload: Democratizing billion-scale model training. arXiv preprint arXiv:2101.06840 (2021).
Jie Ren, Kai Wu, and Dong Li. 2020. Exploring non-volatility of nonvolatile memory for high performance computing under failures. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 237--247.
Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. In Conference on Neural Information Processing Systems (NeurIPS).
Seyed Masoud Sadjadi, Shu Shimizu, Javier Figueroa, Raju Rangaswami, Javier Delgado, Hector Duran, and Xabriel J Collazo-Mojica. 2008. A modeling approach for estimating execution time of long-running scientific applications. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--8.
Philipp Samfass, Tobias Weinzierl, Dominic E Charrier, and Michael Bader. 2020. Lightweight task offloading exploiting MPI wait times for parallel adaptive mesh refinement. Concurrency and Computation: Practice and Experience 32, 24 (2020), e5916.
Michael J Schulte, Mike Ignatowski, Gabriel H Loh, Bradford M Beckmann, William C Brantley, Sudhanva Gurumurthi, Nuwan Jayasena, Indrani Paul, Steven K Reinhardt, and Gregory Rodgers. 2015. Achieving exascale capabilities through heterogeneous computing. IEEE Micro 35, 4 (2015), 26--36.
Sarah Shah, Yasaman Amannejad, Diwakar Krishnamurthy, and Mea Wang. 2019. Quick Execution Time Predictions for Spark Applications. In 2019 15th International Conference on Network and Service Management (CNSM). IEEE, 1--9.
Samantha Sherman and Tamara G Kolda. 2020. Estimating higher-order moments using symmetric tensor decomposition. SIAM J. Matrix Anal. Appl. 41, 3 (2020), 1369--1387.
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard W. Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proceedings of the Symposium on Principles and Practices of Parallel Programming.
M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T.L. Windus, and W.A. de Jong. 2010. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications 181, 9 (2010), 1477--1489.
J.-L. Vay, A. Almgren, J. Bell, L. Ge, D.P. Grote, M. Hogan, O. Kononenko, R. Lehe, A. Myers, C. Ng, and et al. 2018. Warp-X: A new exascale computing platform for beam-plasma simulations. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 909 (Nov 2018), 476--479.
Thiruvengadam Vijayaraghavan, Yasuko Eckert, Gabriel H Loh, Michael J Schulte, Mike Ignatowski, Bradford M Beckmann, William C Brantley, Joseph L Greathouse, Wei Huang, Arun Karunanithi, et al. 2017. Design and Analysis of an APU for Exascale Computing. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 85--96.
Chenxi Wang, Huimin Cui, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang Lv, Xiaobing Feng, and Guoqing Harry Xu. 2019. Panthera: Holistic memory management for big data processing over hybrid memories. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 347--362.
Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and Wenguang Chen. 2018. Spindle: informed memory access monitoring. In 2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18). 561--574.
K. Wu, Y. Huang, and D. Li. 2017. Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory. In International Conference for High Performance Computing, Networking, Storage and Analysis.
Kai Wu, Yingchao Huang, and Dong Li. 2017. Unimem: Runtime data managementon non-volatile memory-based heterogeneous main memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task Parallel Programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
Kai Wu, Jie Ren, and Dong Li. 2018. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 401--413.
Zhen Xie, Wenqian Dong, Jiawen Liu, Hang Liu, and Dong Li. 2021. Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU. In Proceedings of the Sixteenth European Conference on Computer Systems. 426--440.
Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. 2021. MD-HM: memoization-based molecular dynamics simulations on big memory system. In Proceedings of the ACM International Conference on Supercomputing. 215--226.
Zhen Xie, Jie Liu, Sam Ma, Jiajia Li, and Dong Li. 2022. LB-HM: load balance-aware data placement on heterogeneous memory for task-parallel HPC applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 435--436.
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IASpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing. 94--105.
Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2021. A pattern-based spgemm library for multi-core and many-core architectures. IEEE Transactions on Parallel and Distributed Systems 33, 1 (2021), 159--175.
Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Nimble page management for tiered memory systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 331--345.
Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In 18th {USENIX} Conference on File and Storage Technologies ({FAST} 20). 169--182.

Cited By

View all
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2024)A Review of Memory Management Mechanisms Based on Hot Page Monitoring2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730021(1-4)Online publication date: 20-Sep-2024
  • (2023)Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and ReplayProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607098(1-14)Online publication date: 12-Nov-2023
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
February 2023
480 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023


Request permissions for this article.

Check for updates


Author Tags

  1. data placement
  2. heterogeneous memory
  3. load balance
  4. parallel computing


  • Research-article

Funding Sources

  • CCF
  • OAC


PPoPP '23

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)464
  • Downloads (Last 6 weeks)38
Reflects downloads up to 03 Mar 2025

Other Metrics


Cited By

View all
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2024)A Review of Memory Management Mechanisms Based on Hot Page Monitoring2024 3rd International Conference on Artificial Intelligence and Computer Information Technology (AICIT)10.1109/AICIT62434.2024.10730021(1-4)Online publication date: 20-Sep-2024
  • (2023)Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and ReplayProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607098(1-14)Online publication date: 12-Nov-2023
  • (2023)TrainBF: High-Performance DNN Training Engine Using BFloat16 on AI AcceleratorsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_31(458-473)Online publication date: 28-Aug-2023

View Options

View options


View or Download as a PDF file.



View online with eReader.


Login options






Share this Publication link

Share on social media