research-article

NUMAlloc: A Faster NUMA Memory Allocator

Authors:
Hanmei Yang

University of Massachusetts, Amherst, USA

University of Massachusetts, Amherst, USA
View Profile

,
Xin Zhao

University of Massachusetts, Amherst, USA

University of Massachusetts, Amherst, USA
View Profile

,
Jin Zhou

University of Massachusetts, Amherst, USA

University of Massachusetts, Amherst, USA
View Profile

,
Wei Wang

University of Texas at San Antonio, USA

University of Texas at San Antonio, USA
View Profile

,
Sandip Kundu

University of Massachusetts, Amherst, USA

University of Massachusetts, Amherst, USA
View Profile

,
Bo Wu

Colorado School of Mines, USA

Colorado School of Mines, USA
View Profile

,
Hui Guan

University of Massachusetts, Amherst, USA

University of Massachusetts, Amherst, USA
View Profile

,
Tongping Liu

University of Massachusetts, Amherst, USA

University of Massachusetts, Amherst, USA
View Profile

ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory ManagementJune 2023Pages 97–110https://doi.org/10.1145/3591195.3595276

Published:06 June 2023Publication History

ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

Pages 97–110

ABSTRACT

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.

Supplemental Material

Available for Download

zip

pldiws23ismmmain-p64-p-archive.zip (3.2 MB)

This is an appendix for the paper titled "NUMAlloc: A Faster NUMA Memory Allocator" submitted to ISMM 2023. The appendix provides a report of the standard deviation of the performance data presented in the paper.

References

2017. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks Google Scholar
2020. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page Google Scholar
Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, Multicore-scalable, Low-fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). 451–469. isbn:978-1-4503-3689-5 https://doi.org/10.1145/2814270.2814294 Google ScholarDigital Library
Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2016. scalloc. https://github.com/cksystemsgroup/scalloc Google Scholar
Periklis Akritidis. 2010. Cling: A Memory Allocator to Mitigate Dangling Pointers. In 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010, Proceedings. 177–192. http://www.usenix.org/events/sec10/tech/full_papers/Akritidis.pdf Google Scholar
Andreas Kleen at SUSE LINUX. 2012. "A NUMA API for LINUX". http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf Google Scholar
Avi Kivity. 2016. Automatic NUMA balancing may reduce performance. https://github.com/scylladb/scylla/issues/1120 Google Scholar
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems. 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232 Google ScholarDigital Library
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). 72–81. isbn:9781605582825 https://doi.org/10.1145/1454115.1454128 Google ScholarDigital Library
Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A Case for NUMA-aware Contention Management on Multicore Systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. 1–1. http://dl.acm.org/citation.cfm?id=2002181.2002182 Google ScholarDigital Library
W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP ’89). Association for Computing Machinery, 19–31. isbn:0897913388 https://doi.org/10.1145/74850.74854 Google ScholarDigital Library
Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (No. SAND2015-1862C).. Sandia National Lab.(SNL-NM), Albuquerque, NM. Google Scholar
William Cohen. 2014. Examining Huge Pages or Transparent Huge Pages performance. https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance Google Scholar
Jonathan Corbet. 2012. AutoNUMA: The Other Approach to NUMA Scheduling. https://lwn.net/Articles/488709/ Google Scholar
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 381–394. isbn:978-1-4503-1870-9 https://doi.org/10.1145/2451116.2451157 Google ScholarDigital Library
SQL Developers.. 2019. How SQLite Is Tested. ". https://www.sqlite.org/testing.html Google Scholar
Matthias Diener. 2015. Automatic task and data mapping in shared memory architectures. Google Scholar
Matthias Diener, Eduardo HM Cruz, and Philippe OA Navaux. 2015. Locality vs. Balance: Exploring data mapping policies on NUMA systems. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 9–16. Google ScholarDigital Library
Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wen-mei W. Hwu (Eds.). ACM, 125–137. https://doi.org/10.1145/2967938.2967946 Google ScholarDigital Library
Jason Evans. 2011. Scalable memory allocation using jemalloc. ". https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/ Google Scholar
OpenBSD Foundation. 2012. "OpenBSD". ". https://www.openbsd.org Google Scholar
The Apache Software Foundation. 2020. ab - Apache HTTP server benchmarking tool. ". https://httpd.apache.org/docs/2.4/programs/ab.html Google Scholar
David Gay and Alexander Aiken. 1998. Memory Management with Explicit Regions. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17-19, 1998. 313–323. https://doi.org/10.1145/277650.277748 Google ScholarDigital Library
Sanjay Ghemawat and Paul Menage. 2007. "TCMalloc : Thread-Caching Malloc". ". http://goog-perftools.sourceforge.net/doc/tcmalloc.html Google Scholar
Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A Garbage Collector for Big Data on Big NUMA Machines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA. 661–673. isbn:978-1-4503-2835-7 https://doi.org/10.1145/2694344.2694361 Google ScholarDigital Library
Mel Gorman. 2012. Foundation for automatic NUMA balancing. ". https://lwn.net/Articles/523065/ Google Scholar
David R Hanson. 1980. A portable storage management system for the Icon programming language. Software: Practice and Experience, 10, 6 (1980), 489–500. Google ScholarCross Ref
A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257–273. isbn:978-1-939133-22-9 https://www.usenix.org/conference/osdi21/presentation/hunter Google Scholar
Intel Corporation. [n. d.]. Intel VTune Performance Analyzer. http://www.intel.com/software/products/vtune Google Scholar
Stefan Kaestle, Reto Achermann, Timothy Roscoe, and Tim Harris. 2015. Shoal: Smart Allocation and Replication of Memory for Parallel Programs. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 263–276. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813787 Google Scholar
Patryk Kaminski. 2012. NUMA aware heap memory manager. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/NUMA_aware_heap_memory_manager_article_final.pdf Google Scholar
Alex Katranov and Anton Potapov. 2021. oneAPI Threading Building Blocks. https://github.com/oneapi-src/oneTBB Google Scholar
Alex Katranov and Michael Voss. 2020. Optimize Intel oneAPI Threading Building Blocks for NUMA Architectures. https://www.intel.com/content/www/us/en/developer/videos/onetbb-optimizing-for-numa-architectures.html Google Scholar
Chris Kennelly and Paul Burton. 2021. TCMalloc: Implement NUMA awareness. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801 Google Scholar
Seyeon Kim. 2013. Node-oriented dynamic memory management for real-time systems on ccNUMA architecture systems. Ph. D. Dissertation. University of York. Google Scholar
Bradley C Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management. 41–55. Google ScholarDigital Library
Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A Memory Profiler for NUMA Multicore Systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 5–5. http://dl.acm.org/citation.cfm?id=2342821.2342826 Google ScholarDigital Library
Christoph Lameter. 2013. Numa (non-uniform memory access): An overview. Queue, 11, 7 (2013), 40–51. Google ScholarDigital Library
Per-Åke Larson and Murali Krishnan. 1998. Memory Allocation for Long-Running Server Applications. SIGPLAN Not., 34, 3 (1998), Oct., 176–185. issn:0362-1340 https://doi.org/10.1145/301589.286880 Google ScholarDigital Library
Doug Lea. 1988. The GNU C Library. ". http://www.gnu.org/software/libc/libc.html Google Scholar
Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc Google Scholar
Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 277–289. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813788 Google ScholarDigital Library
Xu Liu and John Mellor-Crummey. 2014. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, New York, NY, USA. 259–272. isbn:978-1-4503-2656-8 https://doi.org/10.1145/2555243.2555271 Google ScholarDigital Library
Sandra Loosemore, Richard M. Stallman, Roland McGrath, Andrew Oram, and Ulrich Drepper. 2019. The GNU C Library Reference Manual. https://www.gnu.org/software/libc/manual/2.28/pdf/libc.pdf Google Scholar
Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. 2020. Learning-based Memory Allocation for C++ Server Workloads. In ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020. 541–556. https://doi.org/10.1145/3373376.3378525 Google ScholarDigital Library
Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (ISMM ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0263-0 https://doi.org/10.1145/1993478.1993481 Google ScholarDigital Library
Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 11–22. https://doi.org/10.1109/IISWC.2013.6704666 Google ScholarCross Ref
Zoltan Majo and Thomas R. Gross. 2015. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA. 227–238. isbn:978-1-4503-3205-7 https://doi.org/10.1145/2688500.2688509 Google ScholarDigital Library
C. McCurdy and J. Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 87–96. https://doi.org/10.1109/ISPASS.2010.5452060 Google ScholarCross Ref
Gene Novark and Emery D. Berger. 2010. DieHarder: securing the heap. In Proceedings of the 17th ACM conference on Computer and communications security (CCS ’10). ACM, New York, NY, USA. 573–584. isbn:978-1-4503-0245-6 https://doi.org/10.1145/1866307.1866371 Google ScholarDigital Library
Takeshi Ogasawara. 2009. NUMA-aware Memory Manager with Dominant-thread-based Copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09). ACM, New York, NY, USA. 377–390. isbn:978-1-60558-766-0 https://doi.org/10.1145/1640089.1640117 Google ScholarDigital Library
Sean Reifschneider. 2013. "Pure python memcached client". ". https://pypi.python.org/pypi/python-memcached Google Scholar
Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. ". https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html Google Scholar
Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Scalable Locality-Conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM ’06). Association for Computing Machinery, New York, NY, USA. 84–94. isbn:1595932216 https://doi.org/10.1145/1133956.1133968 Google ScholarDigital Library
Sam Silvestro, Hongyu Liu, Corey Crosser, Zhiqiang Lin, and Tongping Liu. 2017. FreeGuard: A Faster Secure Heap Allocator. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017. 2389–2403. https://doi.org/10.1145/3133956.3133957 Google ScholarDigital Library
Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, and Tongping Liu. 2018. Guarder: An Efficient Heap Allocator with Strongest and Tunable Security. In Proceedings of The 27th USENIX Security Symposium (Security’18). Google Scholar
M. M. Tikir and J. K. Hollingsworth. 2005. NUMA-Aware Java Heaps for Server Applications. In 19th IEEE International Parallel and Distributed Processing Symposium. 108b–108b. issn:1530-2075 https://doi.org/10.1109/IPDPS.2005.299 Google ScholarDigital Library
François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. 2018. NumaMMA: NUMA MeMory Analyzer. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). Association for Computing Machinery, New York, NY, USA. Article 19, 10 pages. isbn:9781450365109 https://doi.org/10.1145/3225058.3225094 Google ScholarDigital Library
Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Technology Conference on Performance Evaluation and Benchmarking. 45–60. Google Scholar
Sean Williams, Latchesar Ionkov, Michael Lang, and Jason Lee. 2018. Heterogeneous Memory and Arena-Based Heap Allocation. In Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. 67–71. https://doi.org/10.1145/3286475.3286568 Google ScholarDigital Library
Ting Yang, Tongping Liu, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss. 2008. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA. 73–86. http://dl.acm.org/citation.cfm?id=1855741.1855747 Google ScholarDigital Library
Zhang Yang, Aiqing Zhang, and Zeyao Mo. 2019. JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications. arXiv preprint arXiv:1902.07590. Google Scholar
Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: Predictive NUMA Profiling. In Proceedings of the ACM International Conference on Supercomputing (ICS ’21). ACM, 52–62. isbn:9781450383356 https://doi.org/10.1145/3447818.3460361 Google ScholarDigital Library
L. Zhu, H. Jin, and X. Liao. 2016. A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. 1145–1152. https://doi.org/10.1109/TrustCom.2016.0187 Google ScholarCross Ref

Index Terms

NUMAlloc: A Faster NUMA Memory Allocator
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Read More
A bounded memory allocator for software-defined global address spaces
ISMM '16

This paper presents a memory allocator targeting manycore architec- tures with distributed memory. Among the family of Multi Processor System on Chip (MPSoC), these devices are composed of multiple nodes linked by an on-chip network; most nodes have ...
Read More
The intelligent memory allocator selector

Memory fragmentation is a serious obstacle preventing efficient memory usage. Garbage collectors may solve the problem; however, they cause serious performance impact, memory and energy consumption. Therefore, various memory allocators have been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management
June 2023
175 pages
ISBN:9798400701795
DOI:10.1145/3591195
General Chair:
Stephen M. Blackburn
Google, Australia / Australian National University, Australia
,
Program Chair:
Erez Petrank
Technion, Israel
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Memory Allocation
NUMA Architecture
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate72of156submissions,46%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 392
  Total Downloads
- Downloads (Last 12 months)392
- Downloads (Last 6 weeks)31
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NUMAlloc: A Faster NUMA Memory Allocator

ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Redesign the Memory Allocator for Non-Volatile Main Memory

A bounded memory allocator for software-defined global address spaces

The intelligent memory allocator selector