ABSTRACT
The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
Supplemental Material
Available for Download
This is an appendix for the paper titled "NUMAlloc: A Faster NUMA Memory Allocator" submitted to ISMM 2023. The appendix provides a report of the standard deviation of the performance data presented in the paper.
- 2017. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks Google Scholar
- 2020. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page Google Scholar
- Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, Multicore-scalable, Low-fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). 451–469. isbn:978-1-4503-3689-5 https://doi.org/10.1145/2814270.2814294 Google ScholarDigital Library
- Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2016. scalloc. https://github.com/cksystemsgroup/scalloc Google Scholar
- Periklis Akritidis. 2010. Cling: A Memory Allocator to Mitigate Dangling Pointers. In 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010, Proceedings. 177–192. http://www.usenix.org/events/sec10/tech/full_papers/Akritidis.pdf Google Scholar
- Andreas Kleen at SUSE LINUX. 2012. "A NUMA API for LINUX". http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf Google Scholar
- Avi Kivity. 2016. Automatic NUMA balancing may reduce performance. https://github.com/scylladb/scylla/issues/1120 Google Scholar
- Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems. 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232 Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). 72–81. isbn:9781605582825 https://doi.org/10.1145/1454115.1454128 Google ScholarDigital Library
- Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A Case for NUMA-aware Contention Management on Multicore Systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. 1–1. http://dl.acm.org/citation.cfm?id=2002181.2002182 Google ScholarDigital Library
- W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP ’89). Association for Computing Machinery, 19–31. isbn:0897913388 https://doi.org/10.1145/74850.74854 Google ScholarDigital Library
- Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (No. SAND2015-1862C).. Sandia National Lab.(SNL-NM), Albuquerque, NM. Google Scholar
- William Cohen. 2014. Examining Huge Pages or Transparent Huge Pages performance. https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance Google Scholar
- Jonathan Corbet. 2012. AutoNUMA: The Other Approach to NUMA Scheduling. https://lwn.net/Articles/488709/ Google Scholar
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 381–394. isbn:978-1-4503-1870-9 https://doi.org/10.1145/2451116.2451157 Google ScholarDigital Library
- SQL Developers.. 2019. How SQLite Is Tested. ". https://www.sqlite.org/testing.html Google Scholar
- Matthias Diener. 2015. Automatic task and data mapping in shared memory architectures. Google Scholar
- Matthias Diener, Eduardo HM Cruz, and Philippe OA Navaux. 2015. Locality vs. Balance: Exploring data mapping policies on NUMA systems. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 9–16. Google ScholarDigital Library
- Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wen-mei W. Hwu (Eds.). ACM, 125–137. https://doi.org/10.1145/2967938.2967946 Google ScholarDigital Library
- Jason Evans. 2011. Scalable memory allocation using jemalloc. ". https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/ Google Scholar
- OpenBSD Foundation. 2012. "OpenBSD". ". https://www.openbsd.org Google Scholar
- The Apache Software Foundation. 2020. ab - Apache HTTP server benchmarking tool. ". https://httpd.apache.org/docs/2.4/programs/ab.html Google Scholar
- David Gay and Alexander Aiken. 1998. Memory Management with Explicit Regions. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17-19, 1998. 313–323. https://doi.org/10.1145/277650.277748 Google ScholarDigital Library
- Sanjay Ghemawat and Paul Menage. 2007. "TCMalloc : Thread-Caching Malloc". ". http://goog-perftools.sourceforge.net/doc/tcmalloc.html Google Scholar
- Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A Garbage Collector for Big Data on Big NUMA Machines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA. 661–673. isbn:978-1-4503-2835-7 https://doi.org/10.1145/2694344.2694361 Google ScholarDigital Library
- Mel Gorman. 2012. Foundation for automatic NUMA balancing. ". https://lwn.net/Articles/523065/ Google Scholar
- David R Hanson. 1980. A portable storage management system for the Icon programming language. Software: Practice and Experience, 10, 6 (1980), 489–500. Google ScholarCross Ref
- A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257–273. isbn:978-1-939133-22-9 https://www.usenix.org/conference/osdi21/presentation/hunter Google Scholar
- Intel Corporation. [n. d.]. Intel VTune Performance Analyzer. http://www.intel.com/software/products/vtune Google Scholar
- Stefan Kaestle, Reto Achermann, Timothy Roscoe, and Tim Harris. 2015. Shoal: Smart Allocation and Replication of Memory for Parallel Programs. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 263–276. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813787 Google Scholar
- Patryk Kaminski. 2012. NUMA aware heap memory manager. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/NUMA_aware_heap_memory_manager_article_final.pdf Google Scholar
- Alex Katranov and Anton Potapov. 2021. oneAPI Threading Building Blocks. https://github.com/oneapi-src/oneTBB Google Scholar
- Alex Katranov and Michael Voss. 2020. Optimize Intel oneAPI Threading Building Blocks for NUMA Architectures. https://www.intel.com/content/www/us/en/developer/videos/onetbb-optimizing-for-numa-architectures.html Google Scholar
- Chris Kennelly and Paul Burton. 2021. TCMalloc: Implement NUMA awareness. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801 Google Scholar
- Seyeon Kim. 2013. Node-oriented dynamic memory management for real-time systems on ccNUMA architecture systems. Ph. D. Dissertation. University of York. Google Scholar
- Bradley C Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management. 41–55. Google ScholarDigital Library
- Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A Memory Profiler for NUMA Multicore Systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 5–5. http://dl.acm.org/citation.cfm?id=2342821.2342826 Google ScholarDigital Library
- Christoph Lameter. 2013. Numa (non-uniform memory access): An overview. Queue, 11, 7 (2013), 40–51. Google ScholarDigital Library
- Per-Åke Larson and Murali Krishnan. 1998. Memory Allocation for Long-Running Server Applications. SIGPLAN Not., 34, 3 (1998), Oct., 176–185. issn:0362-1340 https://doi.org/10.1145/301589.286880 Google ScholarDigital Library
- Doug Lea. 1988. The GNU C Library. ". http://www.gnu.org/software/libc/libc.html Google Scholar
- Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc Google Scholar
- Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 277–289. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813788 Google ScholarDigital Library
- Xu Liu and John Mellor-Crummey. 2014. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, New York, NY, USA. 259–272. isbn:978-1-4503-2656-8 https://doi.org/10.1145/2555243.2555271 Google ScholarDigital Library
- Sandra Loosemore, Richard M. Stallman, Roland McGrath, Andrew Oram, and Ulrich Drepper. 2019. The GNU C Library Reference Manual. https://www.gnu.org/software/libc/manual/2.28/pdf/libc.pdf Google Scholar
- Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. 2020. Learning-based Memory Allocation for C++ Server Workloads. In ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020. 541–556. https://doi.org/10.1145/3373376.3378525 Google ScholarDigital Library
- Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (ISMM ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0263-0 https://doi.org/10.1145/1993478.1993481 Google ScholarDigital Library
- Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 11–22. https://doi.org/10.1109/IISWC.2013.6704666 Google ScholarCross Ref
- Zoltan Majo and Thomas R. Gross. 2015. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA. 227–238. isbn:978-1-4503-3205-7 https://doi.org/10.1145/2688500.2688509 Google ScholarDigital Library
- C. McCurdy and J. Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 87–96. https://doi.org/10.1109/ISPASS.2010.5452060 Google ScholarCross Ref
- Gene Novark and Emery D. Berger. 2010. DieHarder: securing the heap. In Proceedings of the 17th ACM conference on Computer and communications security (CCS ’10). ACM, New York, NY, USA. 573–584. isbn:978-1-4503-0245-6 https://doi.org/10.1145/1866307.1866371 Google ScholarDigital Library
- Takeshi Ogasawara. 2009. NUMA-aware Memory Manager with Dominant-thread-based Copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09). ACM, New York, NY, USA. 377–390. isbn:978-1-60558-766-0 https://doi.org/10.1145/1640089.1640117 Google ScholarDigital Library
- Sean Reifschneider. 2013. "Pure python memcached client". ". https://pypi.python.org/pypi/python-memcached Google Scholar
- Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. ". https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html Google Scholar
- Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Scalable Locality-Conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM ’06). Association for Computing Machinery, New York, NY, USA. 84–94. isbn:1595932216 https://doi.org/10.1145/1133956.1133968 Google ScholarDigital Library
- Sam Silvestro, Hongyu Liu, Corey Crosser, Zhiqiang Lin, and Tongping Liu. 2017. FreeGuard: A Faster Secure Heap Allocator. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017. 2389–2403. https://doi.org/10.1145/3133956.3133957 Google ScholarDigital Library
- Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, and Tongping Liu. 2018. Guarder: An Efficient Heap Allocator with Strongest and Tunable Security. In Proceedings of The 27th USENIX Security Symposium (Security’18). Google Scholar
- M. M. Tikir and J. K. Hollingsworth. 2005. NUMA-Aware Java Heaps for Server Applications. In 19th IEEE International Parallel and Distributed Processing Symposium. 108b–108b. issn:1530-2075 https://doi.org/10.1109/IPDPS.2005.299 Google ScholarDigital Library
- François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. 2018. NumaMMA: NUMA MeMory Analyzer. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). Association for Computing Machinery, New York, NY, USA. Article 19, 10 pages. isbn:9781450365109 https://doi.org/10.1145/3225058.3225094 Google ScholarDigital Library
- Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Technology Conference on Performance Evaluation and Benchmarking. 45–60. Google Scholar
- Sean Williams, Latchesar Ionkov, Michael Lang, and Jason Lee. 2018. Heterogeneous Memory and Arena-Based Heap Allocation. In Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. 67–71. https://doi.org/10.1145/3286475.3286568 Google ScholarDigital Library
- Ting Yang, Tongping Liu, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss. 2008. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA. 73–86. http://dl.acm.org/citation.cfm?id=1855741.1855747 Google ScholarDigital Library
- Zhang Yang, Aiqing Zhang, and Zeyao Mo. 2019. JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications. arXiv preprint arXiv:1902.07590. Google Scholar
- Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: Predictive NUMA Profiling. In Proceedings of the ACM International Conference on Supercomputing (ICS ’21). ACM, 52–62. isbn:9781450383356 https://doi.org/10.1145/3447818.3460361 Google ScholarDigital Library
- L. Zhu, H. Jin, and X. Liao. 2016. A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. 1145–1152. https://doi.org/10.1109/TrustCom.2016.0187 Google ScholarCross Ref
Index Terms
- NUMAlloc: A Faster NUMA Memory Allocator
Recommendations
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing SystemsThe non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
A bounded memory allocator for software-defined global address spaces
ISMM '16This paper presents a memory allocator targeting manycore architec- tures with distributed memory. Among the family of Multi Processor System on Chip (MPSoC), these devices are composed of multiple nodes linked by an on-chip network; most nodes have ...
The intelligent memory allocator selector
Memory fragmentation is a serious obstacle preventing efficient memory usage. Garbage collectors may solve the problem; however, they cause serious performance impact, memory and energy consumption. Therefore, various memory allocators have been ...
Comments