skip to main content
research-article

MaPHeA: A Framework for Lightweight Memory Hierarchy-aware Profile-guided Heap Allocation

Published: 13 December 2022 Publication History

Abstract

Hardware performance monitoring units (PMUs) are a standard feature in modern microprocessors, providing a rich set of microarchitectural event samplers. Recently, numerous profile-guided optimization (PGO) frameworks have exploited them to feature much lower profiling overhead compared to conventional instrumentation-based frameworks. However, existing PGO frameworks mainly focus on optimizing the layout of binaries; they overlook rich information provided by the PMU about data access behaviors over the memory hierarchy. Thus, we propose MaPHeA, a lightweight Memory hierarchy-aware Profile-guided Heap Allocation framework applicable to both HPC and embedded systems. MaPHeA guides and applies the optimized allocation of dynamically allocated heap objects with very low profiling overhead and without additional user intervention to improve application performance. To demonstrate the effectiveness of MaPHeA, we apply it to optimizing heap object allocation in an emerging DRAM-NVM heterogeneous memory system (HMS), selective huge-page utilization, and controlling the cacheability of the objects with the low temporal locality. In an HMS, by identifying and placing frequently accessed heap objects to the fast DRAM region, MaPHeA improves the performance of memory-intensive graph-processing and Redis workloads by 56.0% on average over the default configuration that uses DRAM as a hardware-managed cache of slow NVM. By identifying large heap objects that cause frequent TLB misses and allocating them to huge pages, MaPHeA increases the performance of the read and update operations of Redis by 10.6% over the transparent huge-page implementation of Linux. Also, by distinguishing the objects that cause cache pollution due to their low temporal locality and applying write-combining to them, MaPHeA improves the performance of STREAM and RADIX workloads by 20.0% on average over the system without cacheability control.

References

[1]
A.-R. Adl-Tabatabai, R. L. Hudson, M. J. Serrano, and S. Subramoney. 2004. Prefetch injection based on hardware monitoring and object metadata. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. DOI:
[2]
N. Agarwal and T. F. Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. DOI:
[3]
S. Akram, J. B. Sartor, K. S. McKinley, and L. Eeckhout. 2018. Write-rationing garbage collection for hybrid memories. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. DOI:
[4]
AMD. 2017. AMD64 Architecture Programmer’s Manual Volume 2: System Programming. Retrieved from https://www.amd.com/system/files/TechDocs/24593.pdf.
[5]
J. A. Ang, B. W. Barrett, K. B. Wheeler, and R. C. Murphy. 2010. Introducing the Graph 500. DOI:https://www.osti.gov/biblio/1014641
[6]
M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava, A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. Vora. 2019. Cascade Lake: Next generation Intel Xeon scalable processor. IEEE Micro 39 (2019), 29–36. DOI:
[7]
ARM. 2019. ARM® ARM Architecture Reference Manual Armv8, for Armv8-A Architecture Profile. Retrieved from https://documentation-service.arm.com/static/60119835773bb020e3de6fee?token=.
[8]
G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan. 2018. Memory hierarchy for web search. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture. DOI:
[9]
T. W. Barr, A. L. Cox, and S. Rixner. 2011. SpecTLB: A mechanism for speculative address translation. In Proceedings of the 38th Annual International Symposium on Computer Architecture. DOI:
[10]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift. 2013. Efficient virtual memory for big memory servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture. DOI:
[11]
S. Beamer, K. Asanovic, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization. DOI:
[12]
S. Beamer, K. Asanović, and D. Patterson. 2017. The GAP Benchmark Suite. arXiv:1508.03619 [cs.DC].
[13]
C. Cantalupo, V. Venkatesan, J. Hammond, K. Czurlyo, and S. D. Hammond. 2015. memkind: An extensible heap memory manager for heterogeneous memory platforms and mixed memory policies.DOI:https://www.osti.gov/biblio/1245908
[14]
D. Chen, T. Moseley, and D. X. Li. 2017. AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the International Symposium on Code Generation and Optimization. DOI:
[15]
D. Chen, N. Vachharajani, R. Hundt, S. Liao, V. Ramasamy, P. Yuan, W. Chen, and W. Zheng. 2010. Taming hardware event samples for FDO compilation. In Proceedings of International Symposium on Code Generation and Optimization. DOI:
[16]
D. Chen, N. Vachharajani, R. Hundt, S. Liao, V. Ramasamy, P. Yuan, W. Chen, and W. Zheng. 2013. Taming hardware event samples for precise and versatile feedback directed optimizations. IEEE Trans. Comput. 62 (2013), 376–389. DOI:
[17]
Y. Chen, Z. Lin, R. Pienta, M. Kahng, and D. H. Chau. 2014. Towards scalable graph computation on mobile devices. In Proceedings of the IEEE International Conference on Big Data. DOI:
[18]
Y. Chen, I. B. Peng, Z. Peng, X. Liu, and B. Ren. 2020. ATMem: Adaptive data placement in graph applications on heterogeneous memories. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. DOI:
[19]
T. Chilimbi, M. D. Hill, and J. R. Larus. 1999. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN’99 Conference on Programming Language Design and Implementation. DOI:
[20]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. DOI:
[21]
T. A. Davis and Y. Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38 (2011), 1–25. DOI:
[22]
A. C. de Melo. 2009. Performance counters on Linux. In Proceedings of the Linux Plumbers Conference.
[23]
S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson, and K. Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the 11th European Conference on Computer Systems. DOI:
[24]
GNU. 2016. GCC. Retrieved from https://github.com/gcc-mirror/gcc.
[25]
Google. 2019. AutoFDO. Retrieved from: https://github.com/google/autofdo.
[26]
D. Greenspan. 2019. LLAMA—Automatic memory allocations: An LLVM pass and library for automatically determining memory allocations. In Proceedings of the International Symposium on Memory Systems. DOI:
[27]
T. Hirofuchi and R. Takano. 2019. The Preliminary Evaluation of a Hypervisor-Based Virtualization Mechanism for Intel Optane DC Persistent Memory Module. arXiv:1907.12014 [cs.OS].
[28]
J. Hu, M. Xie, C. Pan, C. J. Xue, Q. Zhuge, and E. H. Sha. 2015. Low overhead software wear leveling for hybrid PCM + DRAM main memory on embedded systems. IEEE Trans. Very Large Scale Integ. Syst. 23 (2015), 654–663. DOI:
[29]
J. Hu, Q. Zhuge, C. J. Xue, W.-C. Tseng, and E. H. Sha. 2013. Software enabled wear-leveling for hybrid PCM main memory on embedded systems. In Proceedings of the Conference on Design, Automation and Test in Europe. DOI:
[30]
J. Hubicka. 2005. Profile driven optimisations in GCC. In Proceedings of the GCC Summit.
[31]
IBM. 2018. POWER9 Performance Monitor Unit User’s Guide. Retrieved from https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf.
[32]
Intel. 2018. Memory Optimizer. Retrieved from https://github.com/intel/memory-optimizer.
[33]
Intel. 2018. Persistent Memory Documentation. Retrieved from https://docs.pmem.io/persistent-memory/.
[34]
Intel. 2019. MEMKIND. Retrieved from https://github.com/memkind/memkind.
[35]
Intel. 2021. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Retrieved from https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.
[36]
Intel. 2021. Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes 3B: System Programming Guide. Retrieved from https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-volume-3b-system-programming-guide-part-2.
[37]
JEDEC. 2012. JEDEC Standard: DDR4 SDRAM.
[38]
JEDEC. 2015. High Bandwidth Memory (HBM) DRAM.
[39]
D. Jung, S. Li, and J. Ahn. 2016. Large pages on steroids: Small ideas to accelerate big memory applications. IEEE Comput. Archit. Lett. 15 (2016), 101–104. DOI:
[40]
S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. DOI:
[41]
S. Kanev, S. L. Xi, G.-Y. Wei, and D. Brooks. 2017. Mallacc: Accelerating memory allocation. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. DOI:
[42]
S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan. 2017. HeteroOS: OS design for heterogeneous memory management in datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture. DOI:
[43]
D. Khaldi and B. Chapman. 2016. Towards automatic HBM allocation using LLVM: A case study with Knights Landing. In Proceedings of the 3rd Workshop on the LLVM Compiler Infrastructure in HPC. DOI:
[44]
R. Krishnaiyer, E. Kultursay, P. Chawla, S. Preis, A. Zvezdin, and H. Saito. 2013. Compiler-based data prefetching and streaming non-temporal store generation for the Intel(R) Xeon Phi(TM) coprocessor. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum. DOI:
[45]
H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. DOI:
[46]
Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel. 2016. Coordinated and efficient huge page management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. DOI:
[47]
S. Lee, B. Jeon, K. Kang, D. Ka, N. Kim, Y. Kim, Y. Hong, M. Kang, J. Min, M. Lee, C. Jeong, K. Kim, D. Lee, J. Shin, Y. Han, Y. Shim, Y. Kim, Y. Kim, H. Kim, J. Yun, B. Kim, S. Han, C. Lee, J. Song, H. Song, I. Park, Y. Kim, J. Chun, and J. Oh. 2019. 23.4 A 512GB 1.1 V Managed DRAM solution with 16GB ODP and media controller. In Proceedings of the IEEE International Solid-State Circuits Conference. DOI:
[48]
J. Leidel and R. C. Murphy. 2015. Hybrid Memory Cube System Interconnect Directory-Based Cache Coherence Methodology. US Patent App. 14/706,516.
[49]
Linux. 2009. Transparent Hugepages. Retrieved from https://lwn.net/Articles/359158.
[50]
Linux. 2018. PMEM NUMA Node and Hotness Accounting/Migration. Retrieved from https://lkml.org/lkml/2018/12/26/138.
[51]
L. Looi and J. X. Jianping. 2019. Intel Optane data center persistent memory. In Proceedings of the IEEE Hot Chips 31 Symposium. DOI:
[52]
C.-K. Luk, R. Muth, H. Patil, R. Cohn, and G. Lowney. 2004. Ispike: A post-link optimizer for the Intel Itanium architecture. In Proceedings of the International Symposium on Code Generation and Optimization. DOI:
[53]
M. Maas, D. G. Andersen, M. Isard, M. M. Javanmard, K. S. McKinley, and C. Raffel. 2020. Learning-based memory allocation for C++ server workloads. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. DOI:
[54]
J. Magee and A. Qasem. 2009. A case for compiler-driven superpage allocation. In Proceedings of the 47th Annual Southeast Regional Conference. DOI:
[55]
M. Malik and H. Homayoun. 2015. Big data on low power cores: Are low power embedded processors a good fit for the big data workloads? In Proceedings of the 33rd IEEE International Conference on Computer Design. DOI:
[56]
J. Marathe and F. Mueller. 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. DOI:
[57]
J. D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Societ. Technic. Committ. Comput. Archit. Newsl. 84 (1995), 19–25.
[58]
J. Merrill. 2003. GENERIC and GIMPLE: A new tree representation for entire functions. In Proceedings of the GCC Summit.
[59]
M. R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G. H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture. DOI:
[61]
L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. DOI:
[62]
A. Narayan, T. Zhang, S. Aga, S. Narayanasamy, and A. Coskun. 2018. MOCA: Memory object classification and allocation in heterogeneous memory systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. DOI:
[63]
J. Navarro, S. Iyer, P. Druschel, and A. Cox. 2002. Practical, transparent operating system support for superpages. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation. DOI:
[64]
D. Oh, Y. Moon, E. Lee, T. J. Ham, Y. Park, J. W. Lee, and J. Ahn. 2021. MaPHeA: A lightweight memory hierarchy-aware profile-guided heap allocation framework. In Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. DOI:
[65]
M. B. Olson, J. T. Teague, D. Rao, M. R. Jantz, K. A. Doshi, and P. A. Kulkarni. 2018. Cross-layer memory management to improve DRAM energy efficiency. ACM Trans. Archit. Code Optim. 15 (2018), 1–27. DOI:
[66]
G. Ottoni and B. Maher. 2017. Optimizing function placement for large-scale data-center applications. In Proceedings of the International Symposium on Code Generation and Optimization. DOI:
[67]
M. Panchenko, R. Auler, B. Nell, and G. Ottoni. 2019. BOLT: A practical binary optimizer for data centers and beyond. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization. DOI:
[68]
P. K. Panigrahi and H. K. Tripathy. 2015. Low complexicity graph based navigation and path finding of mobile robot using BFS. In Proceedings of the 2nd International Conference on Perception and Machine Intelligence. DOI:
[69]
A. Panwar, S. Bansal, and K. Gopinath. 2019. HawkEye: Efficient fine-grained os support for huge pages. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. DOI:
[70]
K. Pettis and R. C. Hansen. 1990. Profile guided code positioning. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. DOI:
[71]
M. K. Qureshi and G. H. Loh. 2012. Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. DOI:
[72]
M. K. Qureshi, V. Srinivasan, and J. A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture. DOI:
[73]
Redis. 2020. redis.io. Retrieved from https://redis.io.
[74]
A. Rudoff. 2017. Persistent memory programming. Login: Usenix Mag. 42 (2017), 34–40.
[75]
A. Sandberg, D. Eklöv, and E. Hagersten. 2010. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. DOI:
[76]
J. Savage and T. M. Jones. 2020. HALO: Post-link heap-layout optimisation. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. DOI:
[77]
M. L. Seidl and B. G. Zorn. 1998. Segregating heap objects by reference behavior and lifetime. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems. DOI:
[78]
H. Servat, A. J. Peña, G. Llort, E. Mercadal, H. Hoppe, and J. Labarta. 2017. Automating the application data placement in hybrid memory systems. In Proceedings of the IEEE International Conference on Cluster Computing. DOI:
[79]
Y. Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jiménez. 2015. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs. DOI:
[80]
C. Wang, H. Cui, T. Cao, J. Zigman, H. Volos, O. Mutlu, F. Lv, X. Feng, and G. H. Xu. 2019. Panthera: Holistic memory management for big data processing over hybrid memories. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. DOI:
[81]
L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu. 2014. BigDataBench: A big data benchmark suite from internet services. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture. DOI:
[82]
S. Wen, L. Cherkasova, F. X. Lin, and X. Liu. 2018. ProfDP: A lightweight profiler to guide data placement in heterogeneous memory systems. In Proceedings of the International Conference on Supercomputing. DOI:
[83]
B. Wicht, R. A. Vitillo, D. Chen, and D. Levinthal. 2014. Hardware Counted Profile-Guided Optimization. arXiv:1411.6361 [cs.PL].
[84]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. DOI:
[85]
K. Wu, T. Huang, and D. Li. 2017. Unimem: Runtime data management on non-volatile memory-based heterogeneous main memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. DOI:
[86]
K. Wu, J. Ren, and D. Li. 2018. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. DOI:
[87]
X. Xie, Y. Liang, G. Sun, and D. Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In Proceedings of the International Conference on Computer-aided Design. DOI:
[88]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture. DOI:
[89]
Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee. 2019. Nimble page management for tiered memory systems. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. DOI:
[90]
A. M. Yang, E. Österlund, and T. Wrigstad. 2020. Improving program locality in the GC using hotness. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. DOI:
[91]
Z. Zhang, Z. Jia, P. Liu, and L. Ju. 2016. Energy efficient real-time task scheduling for embedded systems with hybrid main memory. In Proceedings of the IEEE 20th International Conference on Embedded and Real-time Computing Systems and ApplicationsDOI:

Cited By

View all
  • (2024)NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00111(1518-1531)Online publication date: 2-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 22, Issue 1
January 2023
512 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3567467
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 13 December 2022
Online AM: 31 March 2022
Accepted: 20 March 2022
Revised: 10 January 2022
Received: 15 October 2021
Published in TECS Volume 22, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Profile-guided optimization
  2. heap allocation
  3. heterogeneous memory system
  4. huge page

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • R&D program of MOTIE/KEIT
  • Engineering Research Center Program through the National Research Foundation of Korea (NRF)
  • Korean Government MSIT
  • Inha University Research

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)265
  • Downloads (Last 6 weeks)31
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00111(1518-1531)Online publication date: 2-Nov-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media