ABSTRACT
Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.
We present the first comprehensive characterization study of TCMalloc, a memory allocator used by warehouse-scale applications in Google's production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we optimize TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. For the applications with the highest memory allocation usage, we observe up to 8.1% and 6.3% improvement in throughput and memory usage respectively. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.
- Completely Fair Scheduler. https://docs.kernel.org/scheduler/sched-design-CFS.html.Google Scholar
- Eigen Linear Algebra Library. https://eigen.tuxfamily.org.Google Scholar
- Implement NUMA Awareness in TCMalloc. https://github.com/google/tcmalloc/commit/ef7a3f8d794c42705bf4327ca79fa17186904801.Google Scholar
- Intel Memory Latency Checker. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html.Google Scholar
- mi-malloc. https://microsoft.github.io/mimalloc/.Google Scholar
- Redis. https://redis.io.Google Scholar
- Restartable Sequences. https://github.com/torvalds/linux/commit/d82991a8688ad128b46db1b42d5d84396487a508.Google Scholar
- Restartable Sequences. https://dynamorio.org/page_rseq.html.Google Scholar
- Strace: Linux Syscall Tracer. https://strace.io.Google Scholar
- TCMalloc. https://github.com/google/tcmalloc.Google Scholar
- The GNU C Library. https://www.gnu.org/software/libc.Google Scholar
- Transparent Hugepage Support. https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html.Google Scholar
- Colin Adams, Luis Alonso, Ben Atkin, John P. Banning, Sumeer Bhola, Rick Buskens, Ming Chen, Xi Chen, Yoo Chung, Qin Jia, Nick Sakharov, George T. Talbot, Adam Jacob Tart, and Nick Taylor, editors. Monarch: Google's Planet-Scale In-Memory Time Series Database, 2020.Google Scholar
- Yehuda Afek, Dave Dice, and Adam Morrison. Cache index-aware memory allocation. ACM SIGPLAN Notices, 46(11):55--64, 2011.Google ScholarDigital Library
- Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul R Wilson. Hoard: A scalable memory allocator for multithreaded applications. ACM Sigplan Notices, 35(11):117--128, 2000.Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), jun 2008.Google ScholarDigital Library
- Shuang Chen, Christina Delimitrou, and José F Martínez. Parties: Qos-aware resource partitioning for multiple interactive services. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 107--120, 2019.Google ScholarDigital Library
- Jonathan Corbet. Extending restartable sequences with virtual CPU IDs. https://lwn.net/Articles/885818, 2022.Google Scholar
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. Spanner: Google's globally-distributed database. In OSDI, 2012.Google ScholarDigital Library
- Jeff Defilippi. Why Chiplets and why now? https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/why-chiplets-why-now.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. Qos-aware scheduling in heterogeneous datacenters with paragon. ACM Transactions on Computer Systems (TOCS), 31(4):1--34, 2013.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2014.Google ScholarDigital Library
- Lieven Eeckhout. Is moore's law slowing down? what's next? IEEE Micro, 37(04):4--5, 2017.Google ScholarDigital Library
- Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. In 2011 38th Annual International Symposium on Computer Architecture (ISCA), pages 365--376, 2011.Google ScholarDigital Library
- Jason Evans. A scalable concurrent malloc (3) implementation for freebsd. In Proc. of the bsdcan conference, ottawa, canada, 2006.Google Scholar
- Jason Evans. Scalable memory allocation using jemalloc. https://engineering.fb.com/2011/01/03/core-infra/scalable-memory-allocation-using-jemalloc/, 2011.Google Scholar
- Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. Acm sigplan notices, 47(4):37--48, 2012.Google Scholar
- Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, et al. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 3--18, 2019.Google ScholarDigital Library
- Abraham Gonzalez, Aasheesh Kolli, Samira Khan, Sihang Liu, Vidushi Dadu, Sagar Karandikar, Jichuan Chang, Krste Asanovic, and Parthasarathy Ranganathan. Profiling hyperscale big data processing. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1--16, 2023.Google ScholarDigital Library
- Yacine Hadjadj, Chakib Mustapha Anouar Zouaoui, Nasreddine Taleb, Sarah Mazari, Mohamed El Bahri, and Miloud Chikr El Mezouar. Vc-malloc: A virtually contiguous memory allocator. IEEE Transactions on Computers, 2023.Google ScholarDigital Library
- Md E Haque, Yuxiong He, Sameh Elnikety, Thu D Nguyen, Ricardo Bianchini, and Kathryn S McKinley. Exploiting heterogeneity for tail latency and energy efficiency. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 625--638, 2017.Google ScholarDigital Library
- John L Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1--17, 2006.Google Scholar
- A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 257--273. USENIX Association, July 2021.Google Scholar
- Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA '15, page 158--169, New York, NY, USA, 2015. Association for Computing Machinery.Google ScholarDigital Library
- Svilen Kanev, Sam Likun Xi, Gu-Yeon Wei, and David Brooks. Mallacc: Accelerating memory allocation. SIGPLAN Not., 52(4):33--45, apr 2017.Google ScholarDigital Library
- Harshad Kasture and Daniel Sanchez. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC), pages 1--10. IEEE, 2016.Google ScholarCross Ref
- Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. SIGARCH Comput. Archit. News, 30(5):211--222, oct 2002.Google ScholarDigital Library
- Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. Server engineering insights for large-scale online services. IEEE micro, 30(4):8--19, 2010.Google ScholarDigital Library
- Bradley C Kuszmaul. Supermalloc: A super fast multithreaded malloc for 64-bit machines. In Proceedings of the 2015 International Symposium on Memory Management, pages 41--55, 2015.Google ScholarDigital Library
- Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. Coordinated and efficient huge page management with ingens. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 705--721, Savannah, GA, November 2016. USENIX Association.Google ScholarDigital Library
- Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When prefetching works, when it doesn't, and why. ACM Trans. Archit. Code Optim., 9(1), mar 2012.Google Scholar
- Daan Leijen, Benjamin Zorn, and Leonardo de Moura. Mimalloc: Free list sharding in action. In Programming Languages and Systems: 17th Asian Symposium, APLAS 2019, Nusa Dua, Bali, Indonesia, December 1--4, 2019, Proceedings 17, pages 244--265. Springer, 2019.Google ScholarCross Ref
- Charles E. Leiserson, Neil C. Thompson, Joel S. Emer, Bradley C. Kuszmaul, Butler W. Lampson, Daniel Sanchez, and Tao B. Schardl. There's plenty of room at the top: What will drive computer performance after moore's law? Science, 368(6495):eaam9744, 2020.Google Scholar
- Charles E Leiserson, Neil C Thompson, Joel S Emer, Bradley C Kuszmaul, Butler W Lampson, Daniel Sanchez, and Tao B Schardl. There's plenty of room at the top: What will drive computer performance after moore's law? Science, 368(6495):eaam9744, 2020.Google Scholar
- Ruihao Li, Qinzhe Wu, Krishna Kavi, Gayatri Mehta, Neeraja J Yadwadkar, and Lizy K John. Nextgen-malloc: Giving memory allocator its own room in the house. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pages 135--142, 2023.Google ScholarDigital Library
- Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson, Alex Shamis, Christoph M Wintersteiger, and David Chisnall. Snmalloc: a message passing allocator. In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, pages 122--135, 2019.Google ScholarDigital Library
- David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving resource efficiency at scale. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 450--462, 2015.Google ScholarDigital Library
- Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi Javanmard, Kathryn S. McKinley, and Colin Raffel. Learning-based memory allocation for c++ server workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 541--556, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarDigital Library
- Martin Maas, Chris Kennelly, Khanh Nguyen, Darryl Gove, Kathryn S McKinley, and Paul Turner. Adaptive huge-page subrelease for non-moving memory allocators in warehouse-scale computers. In Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management, pages 28--38, 2021.Google ScholarDigital Library
- Samuel Naffziger, Noah Beck, Thomas Burd, Kevin Lepak, Gabriel H. Loh, Mahesh Subramony, and Sean White. Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families. In Proceedings of the 48th Annual International Symposium on Computer Architecture, ISCA '21, page 57--70. IEEE Press, 2021.Google ScholarDigital Library
- Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. Tensorflow-serving: Flexible, high-performance ml serving, 2017.Google Scholar
- Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. Bolt: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, page 2--14. IEEE Press, 2019.Google ScholarCross Ref
- Ashish Panwar, Sorav Bansal, and K Gopinath. Hawkeye: Efficient fine-grained os support for huge pages. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 347--360, 2019.Google ScholarDigital Library
- Ashish Panwar, Aravinda Prasad, and K Gopinath. Making huge pages actually useful. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 679--692, 2018.Google ScholarDigital Library
- Bobby Powers, David Tench, Emery D Berger, and Andrew McGregor. Mesh: Compacting memory management for c/c++ applications. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 333--346, 2019.Google ScholarDigital Library
- Venkat Sri Sai Ram, Ashish Panwar, and Arkaprava Basu. Trident: Harnessing architectural resources for all page sizes in x86 processors. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 1106--1120, 2021.Google ScholarDigital Library
- Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. Google-wide profiling: A continuous profiling infrastructure for data centers. IEEE Micro, pages 65--79, 2010.Google ScholarDigital Library
- Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, Felix Weigel, David Wilhite, Jiacheng Yang, Jun Xu, Jiexing Li, Zhan Yuan, Craig Chasseur, Qiang Zeng, Ian Rae, Anurag Biyani, Andrew Harn, Yang Xia, Andrey Gubichev, Amr El-Helw, Orri Erling, Zhepeng Yan, Mohan Yang, Yiqun Wei, Thanh Do, Colin Zheng, Goetz Graefe, Somayeh Sardashti, Ahmed M. Aly, Divy Agrawal, Ashish Gupta, and Shiv Venkataraman. F1 query: declarative querying at scale. Proc. VLDB Endow., 11(12):1835--1848, aug 2018.Google ScholarDigital Library
- Korakit Seemakhupt, Brent E Stephens, Samira Khan, Sihang Liu, Hassan Wassel, Soheil Hassas Yeganeh, Alex C Snoeren, Arvind Krishnamurthy, David E Culler, and Henry M Levy. A cloud-scale characterization of remote procedure calls. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 23), pages 498--514, 2023.Google ScholarDigital Library
- Han Shen, Krzysztof Pszeniczny, Rahman Lavaee, Snehasish Kumar, Sriraman Tallam, and Xinliang David Li. Propeller: A profile guided, relinking optimizer for warehouse-scale applications. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 617--631, New York, NY, USA, 2023. Association for Computing Machinery.Google ScholarDigital Library
- Teja Singh, Sundar Rangarajan, Deepesh John, Russell Schreiber, Spence Oliver, Rajit Seahra, and Alex Schaefer. 2.1 zen 2: The amd 7nm energy-efficient high-performance x86-64 microprocessor core. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pages 42--44. IEEE, 2020.Google ScholarCross Ref
- Akshitha Sriraman and Abhishek Dhanotia. Accelerometer: Understanding acceleration opportunities for data center overheads at hyperscale. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 733--750, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarDigital Library
- Akshitha Sriraman, Abhishek Dhanotia, and Thomas F. Wenisch. Soft-sku: Optimizing server architectures for microservice diversity @scale. In Proceedings of the 46th International Symposium on Computer Architecture, ISCA '19, page 513--526, New York, NY, USA, 2019. Association for Computing Machinery.Google ScholarDigital Library
- Akshitha Sriraman and Thomas F Wenisch. μ suite: a benchmark suite for microservices. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pages 1--12. IEEE, 2018.Google ScholarCross Ref
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818--2826, 2016.Google ScholarCross Ref
- Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Translation ranger: Operating system support for contiguity-aware tlbs. In Proceedings of the 46th International Symposium on Computer Architecture, pages 698--710, 2019.Google ScholarDigital Library
- Hanmei Yang, Xin Zhao, Jin Zhou, Wei Wang, Sandip Kundu, Bo Wu, Hui Guan, and Tongping Liu. Numalloc: A faster numa memory allocator. In Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management, pages 97--110, 2023.Google ScholarDigital Library
- Ahmad Yasin. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35--44, 2014.Google ScholarCross Ref
- Ahmad Yasin, Yosi Ben-Asher, and Avi Mendelson. Deep-dive analysis of the data analytics workload in cloudsuite. In 2014 IEEE International Symposium on Workload Characterization (IISWC), pages 202--211. IEEE, 2014.Google ScholarCross Ref
- Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang, Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma, Chunqiang Tang, et al. Contiguitas: The pursuit of physical memory contiguity in datacenters. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1--15, 2023.Google ScholarDigital Library
Index Terms
- Characterizing a Memory Allocator at Warehouse Scale
Recommendations
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing SystemsThe non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3Fast DRAM increasingly dominates infrastructure spend in large scale computing environments and this trend will likely worsen without an architectural shift. The cost of deployed memory can be reduced by replacing part of the conventional DRAM with lower ...
A Page-based Hybrid (Software-Hardware) Dynamic Memory Allocator
Modern programming languages often include complex mechanisms for dynamic memoryallocation and garbage collection. These features drive the need for more efficient implementation of memory management functions, both in terms of memory usage and ...
Comments