skip to main content
10.1145/3591195.3595276acmconferencesArticle/Chapter ViewAbstractPublication PagesismmConference Proceedingsconference-collections
research-article

NUMAlloc: A Faster NUMA Memory Allocator

Published:06 June 2023Publication History

ABSTRACT

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.

Skip Supplemental Material Section

Supplemental Material

References

  1. 2017. CORAL-2 Benchmarks. https://asc.llnl.gov/coral-2-benchmarks Google ScholarGoogle Scholar
  2. 2020. perf: Linux profiling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page Google ScholarGoogle Scholar
  3. Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, Multicore-scalable, Low-fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA 2015). 451–469. isbn:978-1-4503-3689-5 https://doi.org/10.1145/2814270.2814294 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2016. scalloc. https://github.com/cksystemsgroup/scalloc Google ScholarGoogle Scholar
  5. Periklis Akritidis. 2010. Cling: A Memory Allocator to Mitigate Dangling Pointers. In 19th USENIX Security Symposium, Washington, DC, USA, August 11-13, 2010, Proceedings. 177–192. http://www.usenix.org/events/sec10/tech/full_papers/Akritidis.pdf Google ScholarGoogle Scholar
  6. Andreas Kleen at SUSE LINUX. 2012. "A NUMA API for LINUX". http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf Google ScholarGoogle Scholar
  7. Avi Kivity. 2016. Automatic NUMA balancing may reduce performance. https://github.com/scylladb/scylla/issues/1120 Google ScholarGoogle Scholar
  8. Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: a scalable memory allocator for multithreaded applications. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems. 117–128. isbn:1-58113-317-0 https://doi.org/10.1145/378993.379232 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT ’08). 72–81. isbn:9781605582825 https://doi.org/10.1145/1454115.1454128 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A Case for NUMA-aware Contention Management on Multicore Systems. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference. 1–1. http://dl.acm.org/citation.cfm?id=2002181.2002182 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but Effective Techniques for NUMA Memory Management. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP ’89). Association for Computing Machinery, 19–31. isbn:0897913388 https://doi.org/10.1145/74850.74854 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. (No. SAND2015-1862C).. Sandia National Lab.(SNL-NM), Albuquerque, NM. Google ScholarGoogle Scholar
  13. William Cohen. 2014. Examining Huge Pages or Transparent Huge Pages performance. https://developers.redhat.com/blog/2014/03/10/examining-huge-pages-or-transparent-huge-pages-performance Google ScholarGoogle Scholar
  14. Jonathan Corbet. 2012. AutoNUMA: The Other Approach to NUMA Scheduling. https://lwn.net/Articles/488709/ Google ScholarGoogle Scholar
  15. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). 381–394. isbn:978-1-4503-1870-9 https://doi.org/10.1145/2451116.2451157 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. SQL Developers.. 2019. How SQLite Is Tested. ". https://www.sqlite.org/testing.html Google ScholarGoogle Scholar
  17. Matthias Diener. 2015. Automatic task and data mapping in shared memory architectures. Google ScholarGoogle Scholar
  18. Matthias Diener, Eduardo HM Cruz, and Philippe OA Navaux. 2015. Locality vs. Balance: Exploring data mapping policies on NUMA systems. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 9–16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andi Drebes, Antoniu Pop, Karine Heydemann, Albert Cohen, and Nathalie Drach. 2016. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT 2016, Haifa, Israel, September 11-15, 2016, Ayal Zaks, Bilha Mendelson, Lawrence Rauchwerger, and Wen-mei W. Hwu (Eds.). ACM, 125–137. https://doi.org/10.1145/2967938.2967946 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jason Evans. 2011. Scalable memory allocation using jemalloc. ". https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/ Google ScholarGoogle Scholar
  21. OpenBSD Foundation. 2012. "OpenBSD". ". https://www.openbsd.org Google ScholarGoogle Scholar
  22. The Apache Software Foundation. 2020. ab - Apache HTTP server benchmarking tool. ". https://httpd.apache.org/docs/2.4/programs/ab.html Google ScholarGoogle Scholar
  23. David Gay and Alexander Aiken. 1998. Memory Management with Explicit Regions. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation (PLDI), Montreal, Canada, June 17-19, 1998. 313–323. https://doi.org/10.1145/277650.277748 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sanjay Ghemawat and Paul Menage. 2007. "TCMalloc : Thread-Caching Malloc". ". http://goog-perftools.sourceforge.net/doc/tcmalloc.html Google ScholarGoogle Scholar
  25. Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A Garbage Collector for Big Data on Big NUMA Machines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA. 661–673. isbn:978-1-4503-2835-7 https://doi.org/10.1145/2694344.2694361 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mel Gorman. 2012. Foundation for automatic NUMA balancing. ". https://lwn.net/Articles/523065/ Google ScholarGoogle Scholar
  27. David R Hanson. 1980. A portable storage management system for the Icon programming language. Software: Practice and Experience, 10, 6 (1980), 489–500. Google ScholarGoogle ScholarCross RefCross Ref
  28. A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257–273. isbn:978-1-939133-22-9 https://www.usenix.org/conference/osdi21/presentation/hunter Google ScholarGoogle Scholar
  29. Intel Corporation. [n. d.]. Intel VTune Performance Analyzer. http://www.intel.com/software/products/vtune Google ScholarGoogle Scholar
  30. Stefan Kaestle, Reto Achermann, Timothy Roscoe, and Tim Harris. 2015. Shoal: Smart Allocation and Replication of Memory for Parallel Programs. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 263–276. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813787 Google ScholarGoogle Scholar
  31. Patryk Kaminski. 2012. NUMA aware heap memory manager. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/NUMA_aware_heap_memory_manager_article_final.pdf Google ScholarGoogle Scholar
  32. Alex Katranov and Anton Potapov. 2021. oneAPI Threading Building Blocks. https://github.com/oneapi-src/oneTBB Google ScholarGoogle Scholar
  33. Alex Katranov and Michael Voss. 2020. Optimize Intel oneAPI Threading Building Blocks for NUMA Architectures. https://www.intel.com/content/www/us/en/developer/videos/onetbb-optimizing-for-numa-architectures.html Google ScholarGoogle Scholar
  34. Chris Kennelly and Paul Burton. 2021. TCMalloc: Implement NUMA awareness. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801 Google ScholarGoogle Scholar
  35. Seyeon Kim. 2013. Node-oriented dynamic memory management for real-time systems on ccNUMA architecture systems. Ph. D. Dissertation. University of York. Google ScholarGoogle Scholar
  36. Bradley C Kuszmaul. 2015. SuperMalloc: a super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management. 41–55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A Memory Profiler for NUMA Multicore Systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC’12). USENIX Association, Berkeley, CA, USA. 5–5. http://dl.acm.org/citation.cfm?id=2342821.2342826 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Christoph Lameter. 2013. Numa (non-uniform memory access): An overview. Queue, 11, 7 (2013), 40–51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Per-Åke Larson and Murali Krishnan. 1998. Memory Allocation for Long-Running Server Applications. SIGPLAN Not., 34, 3 (1998), Oct., 176–185. issn:0362-1340 https://doi.org/10.1145/301589.286880 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Doug Lea. 1988. The GNU C Library. ". http://www.gnu.org/software/libc/libc.html Google ScholarGoogle Scholar
  41. Daan Leijen. 2020. mimalloc. https://github.com/microsoft/mimalloc Google ScholarGoogle Scholar
  42. Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC ’15). USENIX Association, Berkeley, CA, USA. 277–289. isbn:978-1-931971-225 http://dl.acm.org/citation.cfm?id=2813767.2813788 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xu Liu and John Mellor-Crummey. 2014. A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’14). ACM, New York, NY, USA. 259–272. isbn:978-1-4503-2656-8 https://doi.org/10.1145/2555243.2555271 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sandra Loosemore, Richard M. Stallman, Roland McGrath, Andrew Oram, and Ulrich Drepper. 2019. The GNU C Library Reference Manual. https://www.gnu.org/software/libc/manual/2.28/pdf/libc.pdf Google ScholarGoogle Scholar
  45. Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. 2020. Learning-based Memory Allocation for C++ Server Workloads. In ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020. 541–556. https://doi.org/10.1145/3373376.3378525 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zoltan Majo and Thomas R. Gross. 2011. Memory Management in NUMA Multicore Systems: Trapped Between Cache Contention and Interconnect Overhead. In Proceedings of the International Symposium on Memory Management (ISMM ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0263-0 https://doi.org/10.1145/1993478.1993481 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zoltan Majo and Thomas R. Gross. 2013. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In 2013 IEEE International Symposium on Workload Characterization (IISWC). 11–22. https://doi.org/10.1109/IISWC.2013.6704666 Google ScholarGoogle ScholarCross RefCross Ref
  48. Zoltan Majo and Thomas R. Gross. 2015. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA. 227–238. isbn:978-1-4503-3205-7 https://doi.org/10.1145/2688500.2688509 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. C. McCurdy and J. Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 87–96. https://doi.org/10.1109/ISPASS.2010.5452060 Google ScholarGoogle ScholarCross RefCross Ref
  50. Gene Novark and Emery D. Berger. 2010. DieHarder: securing the heap. In Proceedings of the 17th ACM conference on Computer and communications security (CCS ’10). ACM, New York, NY, USA. 573–584. isbn:978-1-4503-0245-6 https://doi.org/10.1145/1866307.1866371 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Takeshi Ogasawara. 2009. NUMA-aware Memory Manager with Dominant-thread-based Copying GC. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA ’09). ACM, New York, NY, USA. 377–390. isbn:978-1-60558-766-0 https://doi.org/10.1145/1640089.1640117 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Sean Reifschneider. 2013. "Pure python memcached client". ". https://pypi.python.org/pypi/python-memcached Google ScholarGoogle Scholar
  53. Kirill Rogozhin. 2014. Controlling memory consumption with Intel® Threading Building Blocks (Intel® TBB) scalable allocator. ". https://software.intel.com/content/www/us/en/develop/articles/controlling-memory-consumption-with-intel-threading-building-blocks-intel-tbb-scalable.html Google ScholarGoogle Scholar
  54. Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Scalable Locality-Conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM ’06). Association for Computing Machinery, New York, NY, USA. 84–94. isbn:1595932216 https://doi.org/10.1145/1133956.1133968 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Sam Silvestro, Hongyu Liu, Corey Crosser, Zhiqiang Lin, and Tongping Liu. 2017. FreeGuard: A Faster Secure Heap Allocator. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017. 2389–2403. https://doi.org/10.1145/3133956.3133957 Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Sam Silvestro, Hongyu Liu, Tianyi Liu, Zhiqiang Lin, and Tongping Liu. 2018. Guarder: An Efficient Heap Allocator with Strongest and Tunable Security. In Proceedings of The 27th USENIX Security Symposium (Security’18). Google ScholarGoogle Scholar
  57. M. M. Tikir and J. K. Hollingsworth. 2005. NUMA-Aware Java Heaps for Server Applications. In 19th IEEE International Parallel and Distributed Processing Symposium. 108b–108b. issn:1530-2075 https://doi.org/10.1109/IPDPS.2005.299 Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. François Trahay, Manuel Selva, Lionel Morel, and Kevin Marquet. 2018. NumaMMA: NUMA MeMory Analyzer. In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). Association for Computing Machinery, New York, NY, USA. Article 19, 10 pages. isbn:9781450365109 https://doi.org/10.1145/3225058.3225094 Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Technology Conference on Performance Evaluation and Benchmarking. 45–60. Google ScholarGoogle Scholar
  60. Sean Williams, Latchesar Ionkov, Michael Lang, and Jason Lee. 2018. Heterogeneous Memory and Arena-Based Heap Allocation. In Proceedings of the Workshop on Memory Centric High Performance Computing, MCHPC@SC 2018, Dallas, TX, USA, November 11, 2018. 67–71. https://doi.org/10.1145/3286475.3286568 Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Ting Yang, Tongping Liu, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss. 2008. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI’08). USENIX Association, Berkeley, CA, USA. 73–86. http://dl.acm.org/citation.cfm?id=1855741.1855747 Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Zhang Yang, Aiqing Zhang, and Zeyao Mo. 2019. JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications. arXiv preprint arXiv:1902.07590. Google ScholarGoogle Scholar
  63. Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: Predictive NUMA Profiling. In Proceedings of the ACM International Conference on Supercomputing (ICS ’21). ACM, 52–62. isbn:9781450383356 https://doi.org/10.1145/3447818.3460361 Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. L. Zhu, H. Jin, and X. Liao. 2016. A Tool to Detect Performance Problems of Multi-threaded Programs on NUMA Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA. 1145–1152. https://doi.org/10.1109/TrustCom.2016.0187 Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. NUMAlloc: A Faster NUMA Memory Allocator

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISMM 2023: Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management
      June 2023
      175 pages
      ISBN:9798400701795
      DOI:10.1145/3591195

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate72of156submissions,46%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader