skip to main content
10.1145/3620666.3651350acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open Access

Characterizing a Memory Allocator at Warehouse Scale

Published:27 April 2024Publication History

ABSTRACT

Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.

We present the first comprehensive characterization study of TCMalloc, a memory allocator used by warehouse-scale applications in Google's production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we optimize TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. For the applications with the highest memory allocation usage, we observe up to 8.1% and 6.3% improvement in throughput and memory usage respectively. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.

References

  1. Completely Fair Scheduler. https://docs.kernel.org/scheduler/sched-design-CFS.html.Google ScholarGoogle Scholar
  2. Eigen Linear Algebra Library. https://eigen.tuxfamily.org.Google ScholarGoogle Scholar
  3. Implement NUMA Awareness in TCMalloc. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801.Google ScholarGoogle Scholar
  4. Intel Memory Latency Checker. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html.Google ScholarGoogle Scholar
  5. mi-malloc. https://microsoft.github.io/mimalloc/.Google ScholarGoogle Scholar
  6. Redis. https://redis.io.Google ScholarGoogle Scholar
  7. Restartable Sequences. https://github.com/torvalds/linux/commit/d82991a8688ad128b46db1b42d5d84396487a508.Google ScholarGoogle Scholar
  8. Restartable Sequences. https://dynamorio.org/page_rseq.html.Google ScholarGoogle Scholar
  9. Strace: Linux Syscall Tracer. https://strace.io.Google ScholarGoogle Scholar
  10. TCMalloc. https://github.com/google/tcmalloc.Google ScholarGoogle Scholar
  11. The GNU C Library. https://www.gnu.org/software/libc.Google ScholarGoogle Scholar
  12. Transparent Hugepage Support. https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html.Google ScholarGoogle Scholar
  13. Colin Adams, Luis Alonso, Ben Atkin, John P. Banning, Sumeer Bhola, Rick Buskens, Ming Chen, Xi Chen, Yoo Chung, Qin Jia, Nick Sakharov, George T. Talbot, Adam Jacob Tart, and Nick Taylor, editors. Monarch: Google's Planet-Scale In-Memory Time Series Database, 2020.Google ScholarGoogle Scholar
  14. Yehuda Afek, Dave Dice, and Adam Morrison. Cache index-aware memory allocation. ACM SIGPLAN Notices, 46(11):55--64, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul R Wilson. Hoard: A scalable memory allocator for multithreaded applications. ACM Sigplan Notices, 35(11):117--128, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), jun 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shuang Chen, Christina Delimitrou, and José F Martínez. Parties: Qos-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 107--120, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jonathan Corbet. Extending restartable sequences with virtual CPU IDs. https://lwn.net/Articles/885818, 2022.Google ScholarGoogle Scholar
  19. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. Spanner: Google's globally-distributed database. In OSDI, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jeff Defilippi. Why Chiplets and why now? https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/why-chiplets-why-now.Google ScholarGoogle Scholar
  21. Christina Delimitrou and Christos Kozyrakis. Qos-aware scheduling in heterogeneous datacenters with paragon. ACM Transactions on Computer Systems (TOCS), 31(4):1--34, 2013.Google ScholarGoogle Scholar
  22. Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lieven Eeckhout. Is moore's law slowing down? what's next? IEEE Micro, 37(04):4--5, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. In 2011 38th Annual International Symposium on Computer Architecture (ISCA), pages 365--376, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jason Evans. A scalable concurrent malloc (3) implementation for freebsd. In Proc. of the bsdcan conference, ottawa, canada, 2006.Google ScholarGoogle Scholar
  26. Jason Evans. Scalable memory allocation using jemalloc. https://engineering.fb.com/2011/01/03/core-infra/scalable-memory-allocation-using-jemalloc/, 2011.Google ScholarGoogle Scholar
  27. Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. Acm sigplan notices, 47(4):37--48, 2012.Google ScholarGoogle Scholar
  28. Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3--18, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Abraham Gonzalez, Aasheesh Kolli, Samira Khan, Sihang Liu, Vidushi Dadu, Sagar Karandikar, Jichuan Chang, Krste Asanovic, and Parthasarathy Ranganathan. Profiling hyperscale big data processing. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1--16, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yacine Hadjadj, Chakib Mustapha Anouar Zouaoui, Nasreddine Taleb, Sarah Mazari, Mohamed El Bahri, and Miloud Chikr El Mezouar. Vc-malloc: A virtually contiguous memory allocator. IEEE Transactions on Computers, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Md E Haque, Yuxiong He, Sameh Elnikety, Thu D Nguyen, Ricardo Bianchini, and Kathryn S McKinley. Exploiting heterogeneity for tail latency and energy efficiency. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 625--638, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. John L Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, 2006.Google ScholarGoogle Scholar
  33. A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 257--273. USENIX Association, July 2021.Google ScholarGoogle Scholar
  34. Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA '15, page 158--169, New York, NY, USA, 2015. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Svilen Kanev, Sam Likun Xi, Gu-Yeon Wei, and David Brooks. Mallacc: Accelerating memory allocation. SIGPLAN Not., 52(4):33--45, apr 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1--10. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  37. Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. SIGARCH Comput. Archit. News, 30(5):211--222, oct 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. Server engineering insights for large-scale online services. IEEE micro, 30(4):8--19, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Bradley C Kuszmaul. Supermalloc: A super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management, pages 41--55, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. Coordinated and efficient huge page management with ingens. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 705--721, Savannah, GA, November 2016. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When prefetching works, when it doesn't, and why. ACM Trans. Archit. Code Optim., 9(1), mar 2012.Google ScholarGoogle Scholar
  42. Daan Leijen, Benjamin Zorn, and Leonardo de Moura. Mimalloc: Free list sharding in action. In Programming Languages and Systems: 17th Asian Symposium, APLAS 2019, Nusa Dua, Bali, Indonesia, December 1--4, 2019, Proceedings 17, pages 244--265. Springer, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  43. Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. There's plenty of room at the top: What will drive computer performance after moore's law? Science, 368(6495):eaam9744, 2020.Google ScholarGoogle Scholar
  44. Charles E Leiserson, Neil C Thompson, Joel S Emer, Bradley C Kuszmaul, Butler W Lampson, Daniel Sanchez, and Tao B Schardl. There's plenty of room at the top: What will drive computer performance after moore's law? Science, 368(6495):eaam9744, 2020.Google ScholarGoogle Scholar
  45. Ruihao Li, Qinzhe Wu, Krishna Kavi, Gayatri Mehta, Neeraja J Yadwadkar, and Lizy K John. Nextgen-malloc: Giving memory allocator its own room in the house. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pages 135--142, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson, Alex Shamis, Christoph M Wintersteiger, and David Chisnall. Snmalloc: a message passing allocator. In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, pages 122--135, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 450--462, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. Learning-based memory allocation for c++ server workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 541--556, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Martin Maas, Chris Kennelly, Khanh Nguyen, Darryl Gove, Kathryn S McKinley, and Paul Turner. Adaptive huge-page subrelease for non-moving memory allocators in warehouse-scale computers. In Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management, pages 28--38, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families. In Proceedings of the 48th Annual International Symposium on Computer Architecture, ISCA '21, page 57--70. IEEE Press, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ml serving, 2017.Google ScholarGoogle Scholar
  52. Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. Bolt: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, page 2--14. IEEE Press, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  53. Ashish Panwar, Sorav Bansal, and K Gopinath. Hawkeye: Efficient fine-grained os support for huge pages. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 347--360, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ashish Panwar, Aravinda Prasad, and K Gopinath. Making huge pages actually useful. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 679--692, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Bobby Powers, David Tench, Emery D Berger, and Andrew McGregor. Mesh: Compacting memory management for c/c++ applications. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 333--346, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Venkat Sri Sai Ram, Ashish Panwar, and Arkaprava Basu. Trident: Harnessing architectural resources for all page sizes in x86 processors. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1106--1120, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro, pages 65--79, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, Felix Weigel, David Wilhite, Jiacheng Yang, Jun Xu, Jiexing Li, Zhan Yuan, Craig Chasseur, Qiang Zeng, Ian Rae, Anurag Biyani, Andrew Harn, Yang Xia, Andrey Gubichev, Amr El-Helw, Orri Erling, Zhepeng Yan, Mohan Yang, Yiqun Wei, Thanh Do, Colin Zheng, Goetz Graefe, Somayeh Sardashti, Ahmed M. Aly, Divy Agrawal, Ashish Gupta, and Shiv Venkataraman. F1 query: declarative querying at scale. Proc. VLDB Endow., 11(12):1835--1848, aug 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Korakit Seemakhupt, Brent E Stephens, Samira Khan, Sihang Liu, Hassan Wassel, Soheil Hassas Yeganeh, Alex C Snoeren, Arvind Krishnamurthy, David E Culler, and Henry M Levy. A cloud-scale characterization of remote procedure calls. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 23), pages 498--514, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Han Shen, Krzysztof Pszeniczny, Rahman Lavaee, Snehasish Kumar, Sriraman Tallam, and Xinliang David Li. Propeller: A profile guided, relinking optimizer for warehouse-scale applications. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 617--631, New York, NY, USA, 2023. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Teja Singh, Sundar Rangarajan, Deepesh John, Russell Schreiber, Spence Oliver, Rajit Seahra, and Alex Schaefer. 2.1 zen 2: The amd 7nm energy-efficient high-performance x86-64 microprocessor core. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pages 42--44. IEEE, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  62. Akshitha Sriraman and Abhishek Dhanotia. Accelerometer: Understanding acceleration opportunities for data center overheads at hyperscale. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 733--750, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. Soft-sku: Optimizing server architectures for microservice diversity @scale. In Proceedings of the 46th International Symposium on Computer Architecture, ISCA '19, page 513--526, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Akshitha Sriraman and Thomas F Wenisch. μ suite: a benchmark suite for microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 1--12. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  65. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818--2826, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  66. Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Translation ranger: Operating system support for contiguity-aware tlbs. In Proceedings of the 46th International Symposium on Computer Architecture, pages 698--710, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Hanmei Yang, Xin Zhao, Jin Zhou, Wei Wang, Sandip Kundu, Bo Wu, Hui Guan, and Tongping Liu. Numalloc: A faster numa memory allocator. In Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management, pages 97--110, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35--44, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  69. Ahmad Yasin, Yosi Ben-Asher, and Avi Mendelson. Deep-dive analysis of the data analytics workload in cloudsuite. In 2014 IEEE International Symposium on Workload Characterization (IISWC), pages 202--211. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  70. Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, et al. Contiguitas: The pursuit of physical memory contiguity in datacenters. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1--15, 2023.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Characterizing a Memory Allocator at Warehouse Scale

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
    April 2024
    1106 pages
    ISBN:9798400703867
    DOI:10.1145/3620666

    Copyright © 2024 Copyright held by the owner/author(s)

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 27 April 2024

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate535of2,713submissions,20%
  • Article Metrics

    • Downloads (Last 12 months)244
    • Downloads (Last 6 weeks)244

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader